Skip to content

reductstore/datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Collection of free datasets hosted with ReductStore.

The goal of this repository is to provide a collection of free datasets that can be used for testing and benchmarking machine learning algorithms.

All datasets are hosted on ReductStore and can be downloaded using Reduct CLI or one of the client libraries:

Why ReductStore?

Inspite of the fact that ReductStore is a time series database, we use it to store datasets as a collection of records and use timestamp is a unique identifier. This approcah have the following advantages:

  • The database is fast and free, you can mirror datasets on your own instance and use them locally.
  • You can download partial datasets
  • You can use databases directly from Python, Rust, C++, or Node.js
  • You can use annotations as a dictionary of labels, no need to parse them manually.

Examples

Credentials to obtain the datasets:

Export data with Reduct CLI

You can export datasets to your local machine using Reduct CLI:

# Install the tool
wget https://github.com/reductstore/reduct-cli/releases/latest/download/reduct-cli.linux-amd64.tar.gz
tar -xvf reduct-cli.linux-amd64.tar.gz
chmod +x reduct-cli
sudo mv reduct-cli /usr/local/bin
# Add the ReductStore instance to aliases
reduct-cli alias add play -L https://play.reduct.store -t reductstore
# Download dataset(s) specified in --entry. Each sample will have a JSON document with metadata and anotations.
reduct-cli cp play/datasets . --entries=<Dataset Name> --with-meta

Export data with Python Client SDK

You can integrate ReductStore into your Python code and use the datasets directly:

import asyncio
from reduct import Client

HOST = "https://play.reduct.store"
API_TOKEN = "reductstore"
DATASET = "cats"


async def main():
    client = Client(HOST, API_TOKEN)
    bucket = await client.get_bucket("datasets")
    async for record in bucket.query(DATASET):
        print(record.labels)
        jpeg = await record.read_all()
        # Do something with the JPEG image


if __name__ == "__main__":
    asyncio.run(main())

Datasets

Entry Name Description Data Type Labels Original Source Export Script
cats Over 9,000 images of cats with annotated facial features jpeg left-eye-x,left-eye-y,right-eye-x,right-eye-y,mouth-x,mouth-y,left-ear-1-x,left-ear-1-y,left-ear-2-x,left-ear-2-y,left-ear-3-x,left-ear-3-y,right-ear-1-x,right-ear-1-y,right-ear-2-x,right-ear-2-y,right-ear-3-x,right-ear-3-y kaggle export.py
mnist_training, mnist_test MNIST handwritten digits png digit MNIST export.py
imdb ~50,000 photos from IMBD with face location, age and gender jpeg dob,photo_taken,gender,name,face_location_{x,y,w,h},face_score,second_face_score,celeb_names,celeb_id IMDB-WIKI export.py

Examples

Releases

No releases published

Packages

No packages published

Languages