fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
A powerful open-source tool for analyzing image and video datasets founded by the authors of XGBoost, Apache TVM & Turi Create - Danny Bickson, Carlos Guestrin and Amir Alush.
pip
install fastdup from PyPI:
pip install fastdup
More installation options are available here.
Initialize and run fastdup:
import fastdup
fd = fastdup.create(input_dir="IMAGE_FOLDER/")
fd.run()
Explore the results in a interactive web UI:
fd.explore()
Alternatively, visualize the result in a static gallery:
fd.vis.duplicates_gallery() # gallery of duplicates
fd.vis.outliers_gallery() # gallery of outliers
fd.vis.component_gallery() # gallery of connected components
fd.vis.stats_gallery() # gallery of image statistics (e.g. blur, brightness, etc.)
fd.vis.similarity_gallery() # gallery of similar images
https://github.com/user-attachments/assets/738a329d-8063-4515-a961-f2527934a0ca
fastdup handles labeled/unlabeled datasets in image or video format, providing a range of features:
What sets fastdup apart from other similar tools:
Learn the basics of fastdup through interactive examples. View the notebooks on GitHub or nbviewer. Even better, run them on Google Colab or Kaggle, for free.
⚡ Quickstart: Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
📌 Dataset: Oxford-IIIT Pet. |
||
🧹 Finding and Removing Duplicates: Learn how to how to analyze an image dataset for duplicates and near-duplicates.
📌 Dataset: Oxford-IIIT Pet. |
||
🖼 Finding and Removing Mislabels: Learn how to analyze an image dataset for potential image mislabels and export the list of mislabeled images for further inspection.
📌 Dataset: Food-101. |
||
🎁 Image Similarity Search: Perform image search in a large dataset of images.
📌 Dataset: Shopee Product Matching. |
||
🤗 Hugging Face Datasets: Load and analyze datasets from Hugging Face Datasets. Perfect if you already have a dataset hosted on Hugging Face hub. | ||
🧠 TIMM Embeddings: Compute dataset embeddings using TIMM (PyTorch Image Models) and run fastdup over the them to surface dataset issues. Runs on CPU and GPU. | ||
🦖 ONNX Embeddings: Bring your own ONNX model. In this example we extract feature vectors of your images using DINOv2 model. Runs on CPU. | ||
See more examples.
Get help from the fastdup team or community members via the following channels:
Community-contributed blog posts on fastdup:
What our users say:
Visual Layer offers commercial services for managing, cleaning, and curating visual data at scale.
Sign-up for free.
https://github.com/visual-layer/fastdup/assets/6821286/57f13d77-0ac4-4c74-8031-07fae87c5b00
Not convinced? Interact with Visual Layer Cloud public dataset with no sign-up required.
We have added an experimental crash report collection using Sentry.
We DO NOT collect user-specific information such as folder names, user names, image names, image content, etc.
We do collect data related to fastdup’s internal operations and performance statistics such as total number of images, average runtime per image, total free memory, total free disk space, number of cores, etc.
This help us identify and resolve stability issues, thereby improving the overall reliability of fastdup.
The code for the data collection is found here. On MAC we use Google crashpad to report crashes.
Users have the option to opt out of the experimental crash reporting system through one of the following methods:
SENTRY_OPT_OUT
run()
with turi_param='run_sentry=0'
fastdup is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License.
For any more information or inquiries regarding the license, please contact us at [email protected] or see the LICENSE file.