数据集:
pierreguillou/DocLayNet-large
All information from this page but the content of this paragraph "About this card (01/27/2023)" has been copied/pasted from Dataset Card for DocLayNet .
DocLayNet is a dataset created by Deep Search (IBM Research) published under license CDLA-Permissive-1.0 .
I do not claim any rights to the data taken from this dataset and published on this page.
DocLayNet dataset (IBM) provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories.
Until today, the dataset can be downloaded through direct links or as a dataset from Hugging Face datasets:
Paper: DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis (06/02/2022)
These 2 options require the downloading of all the data (approximately 30GBi), which requires downloading time (about 45 mn in Google Colab) and a large space on the hard disk. These could limit experimentation for people with low resources.
Moreover, even when using the download via HF datasets library, it is necessary to download the EXTRA zip separately ( doclaynet_extra.zip , 7.5 GiB) to associate the annotated bounding boxes with the text extracted by OCR from the PDFs. This operation also requires additional code because the boundings boxes of the texts do not necessarily correspond to those annotated (a calculation of the percentage of area in common between the boundings boxes annotated and those of the texts makes it possible to make a comparison between them).
At last, in order to use Hugging Face notebooks on fine-tuning layout models like LayoutLMv3 or LiLT, DocLayNet data must be processed in a proper format.
For all these reasons, I decided to process the DocLayNet dataset:
Note: the layout HF notebooks will greatly help participants of the IBM ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents !
Citation of the page 3 of the DocLayNet paper : "We did not control the document selection with regard to language. The vast majority of documents contained in DocLayNet (close to 95%) are published in English language. However, DocLayNet also contains a number of documents in other languages such as German (2.5%), French (1.0%) and Japanese (1.0%). While the document language has negligible impact on the performance of computer vision methods such as object detection and segmentation models, it might prove challenging for layout analysis methods which exploit textual features."
Citation of the page 3 of the DocLayNet paper : "The pages in DocLayNet can be grouped into six distinct categories , namely Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents and Government Tenders. Each document category was sourced from various repositories. For example, Financial Reports contain both free-style format annual reports which expose company-specific, artistic layouts as well as the more formal SEC filings. The two largest categories (Financial Reports and Manuals) contain a large amount of free-style layouts in order to obtain maximum variability. In the other four categories, we boosted the variability by mixing documents from independent providers, such as different government websites or publishers. In Figure 2, we show the document categories contained in DocLayNet with their respective sizes."
The size of the DocLayNet large is about 100% of the DocLayNet dataset.
WARNING The following code allows to download DocLayNet large but it can not run until the end in Google Colab because of the size needed to store cache data and the CPU RAM to download the data (for example, the cache data in /home/ubuntu/.cache/huggingface/datasets/ needs almost 120 GB during the downloading process). And even with a suitable instance, the download time of the DocLayNet large dataset is around 1h50. This is one more reason to test your fine-tuning code on DocLayNet small and/or DocLayNet base ?
# !pip install -q datasets from datasets import load_dataset dataset_large = load_dataset("pierreguillou/DocLayNet-large") # overview of dataset_large DatasetDict({ train: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 69103 }) validation: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 6480 }) test: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 4994 }) })
The DocLayNet base makes easy to display document image with the annotaed bounding boxes of paragraphes or lines.
Check the notebook processing_DocLayNet_dataset_to_be_used_by_layout_models_of_HF_hub.ipynb in order to get the code.
Paragraphes LinesDocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank:
We are hosting a competition in ICDAR 2023 based on the DocLayNet dataset. For more information see https://ds4sd.github.io/icdar23-doclaynet/ .
DocLayNet provides four types of data assets:
The COCO image record are defined like this example
... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // Custom fields: "doc_category": "financial_reports" // high-level document category "collection": "ann_reports_00_04_fancy", // sub-collection name "doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename "page_no": 9, // page number in original document "precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation }, ...
The doc_category field uses one of the following constants:
financial_reports, scientific_articles, laws_and_regulations, government_tenders, manuals, patents
The dataset provides three splits
The labeling guideline used for training of the annotation experts are available at DocLayNet_Labeling_Guide_Public.pdf .
Who are the annotators?Annotations are crowdsourced.
The dataset is curated by the Deep Search team at IBM Research. You can contact us at deepsearch-core@zurich.ibm.com .
Curators:
License: CDLA-Permissive-1.0
@article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} }
Thanks to @dolfim-ibm , @cau-git for adding this dataset.