数据集:

yuanchuan/annotated_reference_strings

子任务:

parsing

语言:

en

计算机处理:

monolingual

大小:

10M<n<100M

语言创建人:

found

批注创建人:

other

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for annotated_reference_strings

Dataset Summary

The annotated_reference_strings dataset comprises millions of the annotated reference strings, i.e. each token of the strings have an associated label such as author, title, year, etc.

These strings are synthesized using citation processor on millions of citations obtained from various sources, spanning different scientific domains.

Supported Tasks

This dataset can be used for structure prediction.

Languages

The dataset is composed of reference strings that are in English.

Dataset Structure

Data Instances

{
  "source": "pubmed",
  "lang": "en",
  "entry_type": "article",
  "doi_prefix": "pubmed19n0001",
  "csl_style": "annual-reviews",
  "content": "<citation-number>8.</citation-number> <author>Mohr W.</author> <year>1977.</year> <title>[Morphology of bone tumors. 2. Morphology of benign bone tumors].</title> <container-title>Aktuelle Probleme in Chirurgie und Orthopadie.</container-title> <volume>5:</volume> <page>29–42</page>"
}
Important Note
  • Each citation is rendered to at most 17 CSL styles. Therefore, there will be near duplicates.
  • All characters (including punctuations) of a segment ( a segment consists of 1 or more token ) are enclosed by tag(s).
  • Only tokens that act as "conjunctions" are not enclosed in tags. These tokens will be labelled as other .
  • There will be instances which a segment can be enclosed by more than one tag e.g. <issued><year>2021</year></issued> . This depends on how the styles' author(s).
  • Data Fields

    • source : Describe the source of the citation. {pubmed, jstor, crossref}
    • lang : Describe the language of the citation. {en}
    • entry_type : Describe the BibTeX entry type. {article, book, inbook, misc, techreport, phdthesis, incollection, inproceedings}
    • doi_prefix : For JSTOR and CrossRef, it is the prefix of the DOI. For PubMed, it is the directory (e.g. pubmed19nXXXX where XXXX is 4 digits) of which the citation is generated from.
    • csl_style : The CSL style which the citation is rendered as.
    • content : The rendered citation of a specific style with each segment enclosed by tags named after the CSL variables

    Data Splits

    Data splits are not available yet.

    Dataset Creation

    Source Data

    Initial Data Collection and Normalization

    The citations that are used to generate these reference strings are obtained from 3 main sources:

    If the citation is not in BibTeX format, bibutils is used to convert it to BibTeX.

    Who are the source language producers?

    The manner which the citations are rendered as reference strings are based on rules/specifications dictated by the publisher. Citation Style Language (CSL) is an established standard which such specifications are prescribed. Thousands of citation styles are available.

    Annotations

    Annotation process

    The annotation process involves 2 main interventions:

  • Modification of the styles' CSL specification to inject the CSL variable names as part of the render process
  • Sanitization of the rendered strings using regular expressions to ensure all tokens and characters are enclosed in the tags
  • Who are the annotators?

    The original CSL specification are available on GitHub .

    The modification of the styles and the sanitization process are done by the author of this work.

    Additional Information

    Licensing Information

    This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) .

    Citation Information

    This dataset is a product of a Master Project done in the National University of Singapore.

    If you are using it, please cite the following:

    @techreport{kee2021,
        author = {Yuan Chuan Kee},
        title = {Synthesis of a large dataset of annotated reference strings for developing citation parsers},
        institution = {National University of Singapore},
        year = {2021}
    }
    

    Contributions

    Thanks to @kylase for adding this dataset.