Collection of free datasets hosted with ReductStore.

The goal of this repository is to provide a collection of free datasets that can be used for testing and benchmarking machine learning algorithms.

All datasets are hosted on ReductStore and can be downloaded using Reduct CLI or one of the client libraries:

Why ReductStore?

Inspite of the fact that ReductStore is a time series database, we use it to store datasets as a collection of records and use timestamp is a unique identifier. This approcah have the following advantages:

The database is fast and free, you can mirror datasets on your own instance and use them locally.
You can download partial datasets
You can use databases directly from Python, Rust, C++, or Node.js
You can use annotations as a dictionary of labels, no need to parse them manually.

Examples

Credentials to obtain the datasets:

Host: https://play.reduct.store
Bucket: datasets
API Token: reductstore

Export data with Reduct CLI

You can export datasets to your local machine using Reduct CLI:

# Install the tool
wget https://github.com/reductstore/reduct-cli/releases/latest/download/reduct-cli.linux-amd64.tar.gz
tar -xvf reduct-cli.linux-amd64.tar.gz
chmod +x reduct-cli
sudo mv reduct-cli /usr/local/bin
# Add the ReductStore instance to aliases
reduct-cli alias add play -L https://play.reduct.store -t reductstore
# Download dataset(s) specified in --entry. Each sample will have a JSON document with metadata and anotations.
reduct-cli cp play/datasets . --entries=<Dataset Name> --with-meta

Export data with Python Client SDK

You can integrate ReductStore into your Python code and use the datasets directly:

import asyncio
from reduct import Client

HOST = "https://play.reduct.store"
API_TOKEN = "reductstore"
DATASET = "cats"


async def main():
    client = Client(HOST, API_TOKEN)
    bucket = await client.get_bucket("datasets")
    async for record in bucket.query(DATASET):
        print(record.labels)
        jpeg = await record.read_all()
        # Do something with the JPEG image


if __name__ == "__main__":
    asyncio.run(main())

Datasets

Entry Name	Description	Data Type	Labels	Original Source	Export Script
cats	Over 9,000 images of cats with annotated facial features	jpeg	left-eye-x,left-eye-y,right-eye-x,right-eye-y,mouth-x,mouth-y,left-ear-1-x,left-ear-1-y,left-ear-2-x,left-ear-2-y,left-ear-3-x,left-ear-3-y,right-ear-1-x,right-ear-1-y,right-ear-2-x,right-ear-2-y,right-ear-3-x,right-ear-3-y	kaggle	export.py
mnist_training, mnist_test	MNIST handwritten digits	png	digit	MNIST	export.py
imdb	~50,000 photos from IMBD with face location, age and gender	jpeg	dob,photo_taken,gender,name,face_location_{x,y,w,h},face_score,second_face_score,celeb_names,celeb_id	IMDB-WIKI	export.py

Examples

How to Use "Cats" dataset with Python Reduct SDK

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
examples		examples
export		export
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collection of free datasets hosted with ReductStore.

Why ReductStore?

Examples

Export data with Reduct CLI

Export data with Python Client SDK

Datasets

Examples

About

Releases

Packages

Contributors 2

Languages

License

reductstore/datasets

Folders and files

Latest commit

History

Repository files navigation

Collection of free datasets hosted with ReductStore.

Why ReductStore?

Examples

Export data with Reduct CLI

Export data with Python Client SDK

Datasets

Examples

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages