The dataset repo of "CLImage: Human-Annotated Datasets for Complementary-Label Learning"
This repo contains four datasets: CLCIFAR10, CLCIFAR20, CLMicroImageNet10, and CLMicroImageNet20 with human-annotated complementary labels for complementary label learning tasks.
TL;DR: the download links to CLCIFAR and CLMicroImageNet datasets
- CLCIFAR10: clcifar10.pkl (148MB)
- CLCIFAR20: clcifar20.pkl (151MB)
- CLMicroImageNet10 Train: clmicro_imagenet10_train.pkl (55MB)
- CLMicroImageNet10 Test: clmicro_imagenet10_test.pkl (6MB)
- CLMicroImageNet20 Train: clmicro_imagenet20_train.pkl (119MB)
- CLMicroImageNet20 Test: clmicro_imagenet20_test.pkl (11MB)
To collect human-annotated labels, we used Amazon Mechanical Turk (MTurk) to deploy our annotation task. The layout and interface design for the MTurk task can be found in the file design-layout-mturk.html.
In each task, a single image was presented alongside the question: Choose any one "incorrect" label for this image? Annotators were given four example labels to choose from (e.g., dog, cat, ship, bird), and were instructed to select the one that does not correctly describe the image.
The Python version should be 3.8.10 or above.
pip3 install -r requirement.txt
bash run.shThis complementary labeled CIFAR10 dataset contains 3 human-annotated complementary labels for all 50,000 images in the training split of CIFAR10. The workers are from Amazon Mechanical Turk(https://www.mturk.com). We randomly sampled 4 different labels for 3 different annotators, so each image would have 3 (probably repeated) complementary labels.
For more details, please visit our paper at the link.
Dataset download link: clcifar10.pkl (148MB)
We use pickle package to save and load the dataset objects. Use the function pickle.load to load the dataset dictionary object data in Python.
data = pickle.load(open("clcifar10.pkl", "rb"))
# keys of data: 'names', 'images', 'ord_labels', 'cl_labels'data would be a dictionary object with four keys: names, images, ord_labels, cl_labels.
-
names: The list of filenames as strings. These filenames are the same as the ones in CIFAR10 -
images: Anumpy.ndarrayof size (32, 32, 3) representing the image data with 3 channels, 32*32 resolution. -
ord_labels: The ordinary labels of the images, and they are labeled from 0 to 9 as follows:0: airplane 1: automobile 2: bird 3: cat 4: deer 5: dog 6: frog 7: horse 8: ship 9: truck
-
cl_labels: Three complementary labels for each image from three different workers.
Human Intelligence Task (HIT) is the unit of work in Amazon mTurk. We have several designs to make the submission page friendly:
- Enlarge the tiny 32*32 pixels images to 200*200 pixels for clarity.
This complementary labeled CIFAR100 dataset contains 3 human-annotated complementary labels for all 50,000 images in the training split of CIFAR100. We group 4-6 categories as a superclass according to [1] and collect the complementary labels of these 20 superclasses. The workers are from Amazon Mechanical Turk(https://www.mturk.com). We randomly sampled 4 different labels for 3 different annotators, so each image would have 3 (probably repeated) complementary labels.
Dataset download link: clcifar20.pkl (151MB)
We use pickle package to save and load the dataset objects. Use the function pickle.load to load the dataset dictionary object data in Python.
data = pickle.load(open("clcifar20.pkl", "rb"))
# keys of data: 'names', 'images', 'ord_labels', 'cl_labels'data would be a dictionary object with four keys: names, images, ord_labels, cl_labels.
-
names: The list of filenames as strings. These filenames arethe same as the ones in CIFAR20 -
images: Anumpy.ndarrayof size (32, 32, 3) representing the image data with 3 channels, 32*32 resolution. -
ord_labels: The ordinary labels of the images, and they are labeled from 0 to 19 as follows:0: aquatic_mammals 1: fish 2: flowers 3: food_containers 4: fruit, vegetables and mushrooms 5: household electrical devices 6: household furniture 7: insects 8: large carnivores and bear 9: large man-made outdoor things 10: large natural outdoor scenes 11: large omnivores and herbivores 12: medium-sized mammals 13: non-insect invertebrates 14: people 15: reptiles 16: small mammals 17: trees 18: transportation vehicles 19: non-transportation vehicles
-
cl_labels: Three complementary labels for each image from three different workers.
Human Intelligence Task (HIT) is the unit of work in Amazon mTurk. We have several designs to make the submission page friendly:
- Hyperlink to all 10 problems that decrease the scrolling time
- Example images of the superclasses for a better understanding of the categories
- Enlarge the tiny 32*32 pixels images to 200*200 pixels for clarity.
This complementary labeled MicroImageNet10 dataset contains 3 human-annotated complementary labels for all 5,000 images in the training split of TinyImageNet200. The workers are from Amazon Mechanical Turk(https://www.mturk.com). We randomly sampled 4 different labels for 3 different annotators, so each image would have 3 (probably repeated) complementary labels.
For more details, please visit our paper at the link.
Training set download link: clmicro_imagenet10_train.pkl (55MB)
Testing set download link: clmicro_imagenet10_test.pkl (6MB)
We use pickle package to save and load the dataset objects. Use the function pickle.load to load the dataset dictionary object data in Python.
data = pickle.load(open("clmicro_imagenet10_train.pkl", "rb"))
# keys of data: 'names', 'images', 'ord_labels', 'cl_labels'data would be a dictionary object with four keys: names, images, ord_labels, cl_labels.
-
names: The list of filenames as strings. These filenames are the same as the ones in MicroImageNet10 -
images: Anumpy.ndarrayof size (32, 32, 3) representing the image data with 3 channels, 64*64 resolution. -
ord_labels: The ordinary labels of the images, and they are labeled from 0 to 9 as follows:0: sulphur-butterfly 1: backpack 2: cardigan 3: kimono 4: magnetic-compass 5: oboe 6: scandal 7: torch 8: pizza 9: alp
-
cl_labels: Three complementary labels for each image from three different workers.
Human Intelligence Task (HIT) is the unit of work in Amazon mTurk. We have several designs to make the submission page friendly:
- Enlarge the tiny 64*64 pixels images to 200*200 pixels for clarity.
This complementary labeled MicroImageNet20 dataset contains 3 human-annotated complementary labels for all 10,000 images in the training split of TinyImageNet200. The workers are from Amazon Mechanical Turk(https://www.mturk.com). We randomly sampled 4 different labels for 3 different annotators, so each image would have 3 (probably repeated) complementary labels.
For more details, please visit our paper at the link.
Training set download link: clmicro_imagenet20_train.pkl (119MB)
Testing set download link: clmicro_imagenet20_test.pkl (11MB)
We use pickle package to save and load the dataset objects. Use the function pickle.load to load the dataset dictionary object data in Python.
data = pickle.load(open("clmicro_imagenet20_train.pkl", "rb"))
# keys of data: 'names', 'images', 'ord_labels', 'cl_labels'data would be a dictionary object with four keys: names, images, ord_labels, cl_labels.
-
names: The list of filenames as strings. These filenames are the same as the ones in MicroImageNet20 -
images: Anumpy.ndarrayof size (32, 32, 3) representing the image data with 3 channels, 64*64 resolution. -
ord_labels: The ordinary labels of the images, and they are labeled from 0 to 19 as follows:0: tailed frog 1: scorpion 2: snail 3: american lobster 4: tabby 5: persian cat 6: gazelle 7: chimpanzee 8: bannister 9: barrel 10: christmas stocking 11: gasmask 12: hourglass 13: iPod 14: scoreboard 15: snorkel 16: suspension bridge 17: torch 18: tractor 19: triumphal arch
-
cl_labels: Three complementary labels for each image from three different workers.
Human Intelligence Task (HIT) is the unit of work in Amazon mTurk. We have several designs to make the submission page friendly:
- Enlarge the tiny 64*64 pixels images to 200*200 pixels for clarity.
We have published the list of worker IDs for all contributors who helped label the CLImage_Dataset. To safeguard privacy, we have hashed both the original worker IDs and HITIds using the SHA‑1 algorithm. We’ve also included the annotation durations (worktimeinseconds) so users can see how long each image‑labeling task took. For full details, please refer to the worker_ids folder, which contains the hashed identifiers and timing data for each dataset.
[1] Jiaheng Wei, Zhaowei Zhu, and Hao Cheng. Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations. arXiv preprint arXiv:2110.12088, 2021.