数据集:
yuanchuan/annotated_reference_strings
任务:
标记分类子任务:
parsing语言:
en计算机处理:
monolingual大小:
10M<n<100M语言创建人:
found批注创建人:
other源数据集:
original许可:
cc-by-4.0The annotated_reference_strings dataset comprises millions of the annotated reference strings, i.e. each token of the strings have an associated label such as author, title, year, etc.
These strings are synthesized using citation processor on millions of citations obtained from various sources, spanning different scientific domains.
This dataset can be used for structure prediction.
The dataset is composed of reference strings that are in English.
{ "source": "pubmed", "lang": "en", "entry_type": "article", "doi_prefix": "pubmed19n0001", "csl_style": "annual-reviews", "content": "<citation-number>8.</citation-number> <author>Mohr W.</author> <year>1977.</year> <title>[Morphology of bone tumors. 2. Morphology of benign bone tumors].</title> <container-title>Aktuelle Probleme in Chirurgie und Orthopadie.</container-title> <volume>5:</volume> <page>29–42</page>" }Important Note
Data splits are not available yet.
The citations that are used to generate these reference strings are obtained from 3 main sources:
If the citation is not in BibTeX format, bibutils is used to convert it to BibTeX.
Who are the source language producers?The manner which the citations are rendered as reference strings are based on rules/specifications dictated by the publisher. Citation Style Language (CSL) is an established standard which such specifications are prescribed. Thousands of citation styles are available.
The annotation process involves 2 main interventions:
The original CSL specification are available on GitHub .
The modification of the styles and the sanitization process are done by the author of this work.
This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) .
This dataset is a product of a Master Project done in the National University of Singapore.
If you are using it, please cite the following:
@techreport{kee2021, author = {Yuan Chuan Kee}, title = {Synthesis of a large dataset of annotated reference strings for developing citation parsers}, institution = {National University of Singapore}, year = {2021} }
Thanks to @kylase for adding this dataset.