数据集:

wkrl/cord

子任务:

parsing

语言:

en

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for CORD (Consolidated Receipt Dataset)

Dataset Summary

[More Information Needed]

Supported Tasks and Leaderboards

[More Information Needed]

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

{
  "id": datasets.Value("string"),
  "words": datasets.Sequence(datasets.Value("string")),
  "bboxes": datasets.Sequence(datasets.Sequence(datasets.Value("int64"))),
  "labels": datasets.Sequence(datasets.features.ClassLabel(names=_LABELS)),
  "images": datasets.features.Image(),
}

Data Splits

  • train (800 rows)
  • validation (100 rows)
  • test (100 rows)

Dataset Creation

Licensing Information

Creative Commons Attribution 4.0 International License

Citation Information

@article{park2019cord,
  title={CORD: A Consolidated Receipt Dataset for Post-OCR Parsing},
  author={Park, Seunghyun and Shin, Seung and Lee, Bado and Lee, Junyeop and Surh, Jaeheung and Seo, Minjoon and Lee, Hwalsuk}
  booktitle={Document Intelligence Workshop at Neural Information Processing Systems}
  year={2019}
}

Contributions

Thanks to @clovaai for adding this dataset.