An unsupervised and free tool for image and video dataset analysis.
Explore the docs »
Features
·
Report Bug
·
Read Blog
·
Quickstart
·
Enterprise Edition
·
About us
🔥 We've released fastdup V1.0! View the release notes here.
fastdup analyzes your image/video dataset for potential issues such as -
fastdup works on both labeled and unlabeled data. Additional features include -
- Quality: Find and remove anomalies and outliers from your dataset, including duplicates and similar images and videos at a large scale.
- Cost: Reduce data operation costs by intelligently sampling high-quality or novel datasets before labeling and assessing labeled data quality.
- Scale: fastdup's C++ graph engine is highly efficient and can handle up to 400M images on a single CPU machine.
Supported
Python
versions:
Supported operating systems:
Option 1 - Install fastdup via PyPI:
# upgrade pip to its latest version
pip install -U pip
# install fastdup
pip install fastdup
# Alternatively, use explicit python version (XX)
python3.XX -m pip install fastdup
Option 2 - Install fastdup via an Ubuntu 20.04 Docker image on DockerHub:
docker pull karpadoni/fastdup-ubuntu-20.04
Detailed installation instructions and common errors here.
Run fastdup with only 3 lines of code.
Visualize the result.
Here are 8 lines of code you'll need in most cases.
import fastdup
fd = fastdup.create(work_dir, images_dir)
fd.run()
fd.vis.duplicates_gallery() # create a visual gallery of found duplicates
fd.vis.outliers_gallery() # create a visual gallery of anomalies
fd.vis.component_gallery() # create a visualization of connected components
fd.vis.stats_gallery() # create a visualization of images statistics (for example blur)
fd.vis.similarity_gallery() # create a gallery of similar images
View the API docs here.
The following are advanced functionalities of fastdup which are still in the beta testing phase. Sign up for free to be a beta tester and get early access. Drop us an email at [email protected] .
Get help from the fastdup team or community members via the following channels -
The following are community-contributed blog posts about fastdup -
- Master Data Integrity to Clean Your Computer Vision Datasets.
- fastdup: A Powerful Tool to Manage, Clean & Curate Visual Data at Scale on Your CPU - For Free.
- Clean Up Your Digital Life: How I Found 1929 Fully Identical Images, Dark, Bright and Blurry Shots in Minutes, For Free.
- The weighty significance of data cleanliness — or as I like to call it, “cleanliness is next to model-ness” — cannot be overstated.
fastdup is licensed under Creative Commons 4.0 license. See LICENSE.
For any queries, reach us at [email protected]
Usage Tracking
We have added an experimental crash report collection, using sentry.io. It does not collect user data other than anonymized IP address data, and it only logs fastdup library's own actions. We do NOT collect folder names, user names, image names, image content only aggregate performance statistics like total number of images, average runtime per image, total free memory, total free disk space, number of cores, etc. Collecting fastdup crashes will help us improve stability.
The code for the data collection is found here. On MAC we use Google crashpad.
It is always possible to opt out of the experimental crash report collection via either of the following two options:
- Define an environment variable called
SENTRY_OPT_OUT
- or run() with
turi_param='run_sentry=0'
fastdup is founded by the authors of XGBoost, Apache TVM & Turi Create - Danny Bickson, Carlos Guestrin and Amir Alush.
Learn more about Visual Layer here.