数据集:
jordanparker6/publaynet
PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations. The source of the documents is PubMed Central Open Access Subset (commercial use collection) . The annotations are automatically generated by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset. More details are available in our paper "PubLayNet: largest dataset ever for document layout analysis." .
The public dataset is in tar.gz format which doesn't fit nicely with huggingface streaming. Modifications have been made to optimise the delivery of the dataset for the hugginface datset api. The original files can be found here .
Licence: Community Data License Agreement – Permissive – Version 1.0 License
Author: IBM
GitHub: https://github.com/ibm-aur-nlp/PubLayNet
@article{ zhong2019publaynet, title = { PubLayNet: largest dataset ever for document layout analysis }, author = { Zhong, Xu and Tang, Jianbin and Yepes, Antonio Jimeno }, journal = { arXiv preprint arXiv:1908.07836}, year. = { 2019 } }