数据集:

aharley/rvl_cdip

语言:

en

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

extended|iit_cdip

预印本库:

arxiv:1502.07058

许可:

other
中文

Dataset Card for RVL-CDIP

Dataset Summary

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

Supported Tasks and Leaderboards

  • image-classification : The goal of this task is to classify a given document into one of 16 classes representing document types (letter, form, etc.). The leaderboard for this task is available here .

Languages

All the classes and documents use English as their primary language.

Dataset Structure

Data Instances

A sample from the training set is provided below :

{
    'image': <PIL.TiffImagePlugin.TiffImageFile image mode=L size=754x1000 at 0x7F9A5E92CA90>,
    'label': 15
}

Data Fields

  • image : A PIL.Image.Image object containing a document.
  • label : an int classification label.
Class Label Mappings
{
  "0": "letter",
  "1": "form",
  "2": "email",
  "3": "handwritten",
  "4": "advertisement",
  "5": "scientific report",
  "6": "scientific publication",
  "7": "specification",
  "8": "file folder",
  "9": "news article",
  "10": "budget",
  "11": "invoice",
  "12": "presentation",
  "13": "questionnaire",
  "14": "resume",
  "15": "memo"
}

Data Splits

train test validation
# of examples 320000 40000 40000

The dataset was split in proportions similar to those of ImageNet.

  • 320000 images were used for training,
  • 40000 images for validation, and
  • 40000 images for testing.

Dataset Creation

Curation Rationale

From the paper:

This work makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories, useful for training new CNNs for document analysis.

Source Data

Initial Data Collection and Normalization

The same as in the IIT-CDIP collection.

Who are the source language producers?

The same as in the IIT-CDIP collection.

Annotations

Annotation process

The same as in the IIT-CDIP collection.

Who are the annotators?

The same as in the IIT-CDIP collection.

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

The dataset was curated by the authors - Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis.

Licensing Information

RVL-CDIP is a subset of IIT-CDIP, which came from the Legacy Tobacco Document Library , for which license information can be found here .

Citation Information

@inproceedings{harley2015icdar,
    title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},
    author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis},
    booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}},
    year = {2015}
}

Contributions

Thanks to @dnaveenr for adding this dataset.