数据集:
renumics/cifar100-enriched
? Data-centric AI principles have become increasingly important for real-world use cases. At Renumics we believe that classical benchmark datasets and competitions should be extended to reflect this development.
? This is why we are publishing benchmark datasets with application-specific enrichments (e.g. embeddings, baseline results, uncertainties, label error scores). We hope this helps the ML community in the following ways:
? This dataset is an enriched version of the CIFAR-100 Dataset .
The enrichments allow you to quickly gain insights into the dataset. The open source data curation tool Renumics Spotlight enables that with just a few lines of code:
Install datasets and Spotlight via pip :
!pip install renumics-spotlight datasets
Load the dataset from huggingface in your notebook:
import datasets dataset = datasets.load_dataset("renumics/cifar100-enriched", split="train")
Start exploring with a simple view that leverages embeddings to identify relevant data segments:
from renumics import spotlight df = dataset.to_pandas() df_show = df.drop(columns=['embedding', 'probabilities']) spotlight.show(df_show, port=8000, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding})
You can use the UI to interactively configure the view on the data. Depending on the concrete tasks (e.g. model comparison, debugging, outlier detection) you might want to leverage different enrichments and metadata.
The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). The classes are completely mutually exclusive. We have enriched the dataset by adding image embeddings generated with a Vision Transformer . Here is the list of classes in the CIFAR-100:
Superclass | Classes |
---|---|
aquatic mammals | beaver, dolphin, otter, seal, whale |
fish | aquarium fish, flatfish, ray, shark, trout |
flowers | orchids, poppies, roses, sunflowers, tulips |
food containers | bottles, bowls, cans, cups, plates |
fruit and vegetables | apples, mushrooms, oranges, pears, sweet peppers |
household electrical devices | clock, computer keyboard, lamp, telephone, television |
household furniture | bed, chair, couch, table, wardrobe |
insects | bee, beetle, butterfly, caterpillar, cockroach |
large carnivores | bear, leopard, lion, tiger, wolf |
large man-made outdoor things | bridge, castle, house, road, skyscraper |
large natural outdoor scenes | cloud, forest, mountain, plain, sea |
large omnivores and herbivores | camel, cattle, chimpanzee, elephant, kangaroo |
medium-sized mammals | fox, porcupine, possum, raccoon, skunk |
non-insect invertebrates | crab, lobster, snail, spider, worm |
people | baby, boy, girl, man, woman |
reptiles | crocodile, dinosaur, lizard, snake, turtle |
small mammals | hamster, mouse, rabbit, shrew, squirrel |
trees | maple, oak, palm, pine, willow |
vehicles 1 | bicycle, bus, motorcycle, pickup truck, train |
vehicles 2 | lawn-mower, rocket, streetcar, tank, tractor |
English class labels.
A sample from the training set is provided below:
{ 'image': '/huggingface/datasets/downloads/extracted/f57c1a3fbca36f348d4549e820debf6cc2fe24f5f6b4ec1b0d1308a80f4d7ade/0/0.png', 'full_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32 at 0x7F15737C9C50>, 'fine_label': 19, 'coarse_label': 11, 'fine_label_str': 'cattle', 'coarse_label_str': 'large_omnivores_and_herbivores', 'fine_label_prediction': 19, 'fine_label_prediction_str': 'cattle', 'fine_label_prediction_error': 0, 'split': 'train', 'embedding': [-1.2482988834381104, 0.7280710339546204, ..., 0.5312759280204773], 'probabilities': [4.505949982558377e-05, 7.286163599928841e-05, ..., 6.577593012480065e-05], 'embedding_reduced': [1.9439491033554077, -5.35720682144165] }
Feature | Data Type |
---|---|
image | Value(dtype='string', id=None) |
full_image | Image(decode=True, id=None) |
fine_label | ClassLabel(names=[...], id=None) |
coarse_label | ClassLabel(names=[...], id=None) |
fine_label_str | Value(dtype='string', id=None) |
coarse_label_str | Value(dtype='string', id=None) |
fine_label_prediction | ClassLabel(names=[...], id=None) |
fine_label_prediction_str | Value(dtype='string', id=None) |
fine_label_prediction_error | Value(dtype='int32', id=None) |
split | Value(dtype='string', id=None) |
embedding | Sequence(feature=Value(dtype='float32', id=None), length=768, id=None) |
probabilities | Sequence(feature=Value(dtype='float32', id=None), length=100, id=None) |
embedding_reduced | Sequence(feature=Value(dtype='float32', id=None), length=2, id=None) |
Dataset Split | Number of Images in Split | Samples per Class (fine) |
---|---|---|
Train | 50000 | 500 |
Test | 10000 | 100 |
The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
If you use this dataset, please cite the following paper:
@article{krizhevsky2009learning, added-at = {2021-01-21T03:01:11.000+0100}, author = {Krizhevsky, Alex}, biburl = {https://www.bibsonomy.org/bibtex/2fe5248afe57647d9c85c50a98a12145c/s364315}, interhash = {cc2d42f2b7ef6a4e76e47d1a50c8cd86}, intrahash = {fe5248afe57647d9c85c50a98a12145c}, keywords = {}, pages = {32--33}, timestamp = {2021-01-21T03:01:11.000+0100}, title = {Learning Multiple Layers of Features from Tiny Images}, url = {https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf}, year = 2009 }
Alex Krizhevsky, Vinod Nair, Geoffrey Hinton, and Renumics GmbH.