数据集:

renumics/cifar100-enriched

任务:

图像分类

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

crowdsourced

源数据集:

extended|other-80-Million-Tiny-Images

其他:

image classification cifar-100 cifar-100-enriched image+classification

许可:

mit

数据集介绍文件清单

中文

Dataset Card for CIFAR-100-Enriched (Enhanced by Renumics)

Dataset Summary

📊 Data-centric AI principles have become increasingly important for real-world use cases. At Renumics we believe that classical benchmark datasets and competitions should be extended to reflect this development.

🔍 This is why we are publishing benchmark datasets with application-specific enrichments (e.g. embeddings, baseline results, uncertainties, label error scores). We hope this helps the ML community in the following ways:

Enable new researchers to quickly develop a profound understanding of the dataset.

Popularize data-centric AI principles and tooling in the ML community.

Encourage the sharing of meaningful qualitative insights in addition to traditional quantitative metrics.

📚 This dataset is an enriched version of the CIFAR-100 Dataset .

Explore the Dataset

The enrichments allow you to quickly gain insights into the dataset. The open source data curation tool Renumics Spotlight enables that with just a few lines of code:

Install datasets and Spotlight via pip :

!pip install renumics-spotlight datasets

Load the dataset from huggingface in your notebook:

import datasets

dataset = datasets.load_dataset("renumics/cifar100-enriched", split="train")

Start exploring with a simple view that leverages embeddings to identify relevant data segments:

from renumics import spotlight

df = dataset.to_pandas()
df_show = df.drop(columns=['embedding', 'probabilities'])
spotlight.show(df_show, port=8000, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding})

You can use the UI to interactively configure the view on the data. Depending on the concrete tasks (e.g. model comparison, debugging, outlier detection) you might want to leverage different enrichments and metadata.

CIFAR-100 Dataset

The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). The classes are completely mutually exclusive. We have enriched the dataset by adding image embeddings generated with a Vision Transformer . Here is the list of classes in the CIFAR-100:

Superclass	Classes
aquatic mammals	beaver, dolphin, otter, seal, whale
fish	aquarium fish, flatfish, ray, shark, trout
flowers	orchids, poppies, roses, sunflowers, tulips
food containers	bottles, bowls, cans, cups, plates
fruit and vegetables	apples, mushrooms, oranges, pears, sweet peppers
household electrical devices	clock, computer keyboard, lamp, telephone, television
household furniture	bed, chair, couch, table, wardrobe
insects	bee, beetle, butterfly, caterpillar, cockroach
large carnivores	bear, leopard, lion, tiger, wolf
large man-made outdoor things	bridge, castle, house, road, skyscraper
large natural outdoor scenes	cloud, forest, mountain, plain, sea
large omnivores and herbivores	camel, cattle, chimpanzee, elephant, kangaroo
medium-sized mammals	fox, porcupine, possum, raccoon, skunk
non-insect invertebrates	crab, lobster, snail, spider, worm
people	baby, boy, girl, man, woman
reptiles	crocodile, dinosaur, lizard, snake, turtle
small mammals	hamster, mouse, rabbit, shrew, squirrel
trees	maple, oak, palm, pine, willow
vehicles 1	bicycle, bus, motorcycle, pickup truck, train
vehicles 2	lawn-mower, rocket, streetcar, tank, tractor

Supported Tasks and Leaderboards

image-classification : The goal of this task is to classify a given image into one of 100 classes. The leaderboard is available here .

Languages

English class labels.

Dataset Structure

Data Instances

A sample from the training set is provided below:

{
  'image': '/huggingface/datasets/downloads/extracted/f57c1a3fbca36f348d4549e820debf6cc2fe24f5f6b4ec1b0d1308a80f4d7ade/0/0.png',
  'full_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32 at 0x7F15737C9C50>,
  'fine_label': 19,
  'coarse_label': 11,
  'fine_label_str': 'cattle',
  'coarse_label_str': 'large_omnivores_and_herbivores',
  'fine_label_prediction': 19,
  'fine_label_prediction_str': 'cattle',
  'fine_label_prediction_error': 0,
  'split': 'train',
  'embedding': [-1.2482988834381104,
    0.7280710339546204, ...,
    0.5312759280204773],
  'probabilities': [4.505949982558377e-05,
    7.286163599928841e-05, ...,
    6.577593012480065e-05],
  'embedding_reduced': [1.9439491033554077, -5.35720682144165]
}

Data Fields

Feature	Data Type
image	Value(dtype='string', id=None)
full_image	Image(decode=True, id=None)
fine_label	ClassLabel(names=[...], id=None)
coarse_label	ClassLabel(names=[...], id=None)
fine_label_str	Value(dtype='string', id=None)
coarse_label_str	Value(dtype='string', id=None)
fine_label_prediction	ClassLabel(names=[...], id=None)
fine_label_prediction_str	Value(dtype='string', id=None)
fine_label_prediction_error	Value(dtype='int32', id=None)
split	Value(dtype='string', id=None)
embedding	Sequence(feature=Value(dtype='float32', id=None), length=768, id=None)
probabilities	Sequence(feature=Value(dtype='float32', id=None), length=100, id=None)
embedding_reduced	Sequence(feature=Value(dtype='float32', id=None), length=2, id=None)

Data Splits

Dataset Split	Number of Images in Split	Samples per Class (fine)
Train	50000	500
Test	10000	100

Dataset Creation

Curation Rationale

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

If you use this dataset, please cite the following paper:

@article{krizhevsky2009learning,
  added-at = {2021-01-21T03:01:11.000+0100},
  author = {Krizhevsky, Alex},
  biburl = {https://www.bibsonomy.org/bibtex/2fe5248afe57647d9c85c50a98a12145c/s364315},
  interhash = {cc2d42f2b7ef6a4e76e47d1a50c8cd86},
  intrahash = {fe5248afe57647d9c85c50a98a12145c},
  keywords = {},
  pages = {32--33},
  timestamp = {2021-01-21T03:01:11.000+0100},
  title = {Learning Multiple Layers of Features from Tiny Images},
  url = {https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf},
  year = 2009
}

Contributions

Alex Krizhevsky, Vinod Nair, Geoffrey Hinton, and Renumics GmbH.

作者:

renumics

数据集大小:

313.58 MB