数据集:

allenai/cord19

预印本库:

arxiv:2004.07180

源数据集:

original

批注创建人:

no-annotation

语言创建人:

found

大小:

100K<n<1M

计算机处理:

monolingual

语言:

en
中文

Dataset Card for CORD-19

Dataset Summary

CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research.

Supported Tasks and Leaderboards

See tasks defined in the related Kaggle challenge

Languages

The dataset is in english (en).

Dataset Structure

Data Instances

The following code block present an overview of a sample in json-like syntax (abbreviated since some fields are very long):

{
    "abstract": "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified [...]", 
    "authors": "Madani, Tariq A; Al-Ghamdi, Aisha A", 
    "cord_uid": "ug7v899j", 
    "doc_embeddings": [
    -2.939983606338501,
    -6.312200546264648,
    -1.0459030866622925,
    [...] 766 values in total [...]
    -4.107113361358643,
    -3.8174145221710205,
    1.8976187705993652,
    5.811529159545898,
    -2.9323840141296387
],
"doi": "10.1186/1471-2334-1-6",
"journal": "BMC Infect Dis",
"publish_time": "2001-07-04",
"sha": "d1aafb70c066a2068b02786f8929fd9c900897fb",
"source_x": "PMC",
"title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia",
"url": "https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC35282/"
}

Data Fields

Currently only the following fields are integrated: cord_uid , sha , source_x , title , doi , abstract , publish_time , authors , journal . With fulltext configuration, the sections transcribed in pdf_json_files are converted in fulltext feature.

  • cord_uid : A str -valued field that assigns a unique identifier to each CORD-19 paper. This is not necessariy unique per row, which is explained in the FAQs.
  • sha : A List[str] -valued field that is the SHA1 of all PDFs associated with the CORD-19 paper. Most papers will have either zero or one value here (since we either have a PDF or we don't), but some papers will have multiple. For example, the main paper might have supplemental information saved in a separate PDF. Or we might have two separate PDF copies of the same paper. If multiple PDFs exist, their SHA1 will be semicolon-separated (e.g. '4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a; d4f0247db5e916c20eae3f6d772e8572eb828236' )
  • source_x : A List[str] -valued field that is the names of sources from which we received this paper. Also semicolon-separated. For example, 'ArXiv; Elsevier; PMC; WHO' . There should always be at least one source listed.
  • title : A str -valued field for the paper title
  • doi : A str -valued field for the paper DOI
  • pmcid : A str -valued field for the paper's ID on PubMed Central. Should begin with PMC followed by an integer.
  • pubmed_id : An int -valued field for the paper's ID on PubMed.
  • license : A str -valued field with the most permissive license we've found associated with this paper. Possible values include: 'cc0', 'hybrid-oa', 'els-covid', 'no-cc', 'cc-by-nc-sa', 'cc-by', 'gold-oa', 'biorxiv', 'green-oa', 'bronze-oa', 'cc-by-nc', 'medrxiv', 'cc-by-nd', 'arxiv', 'unk', 'cc-by-sa', 'cc-by-nc-nd'
  • abstract : A str -valued field for the paper's abstract
  • publish_time : A str -valued field for the published date of the paper. This is in yyyy-mm-dd format. Not always accurate as some publishers will denote unknown dates with future dates like yyyy-12-31
  • authors : A List[str] -valued field for the authors of the paper. Each author name is in Last, First Middle format and semicolon-separated.
  • journal : A str -valued field for the paper journal. Strings are not normalized (e.g. BMJ and British Medical Journal can both exist). Empty string if unknown.
  • mag_id : Deprecated, but originally an int -valued field for the paper as represented in the Microsoft Academic Graph.
  • who_covidence_id : A str -valued field for the ID assigned by the WHO for this paper. Format looks like #72306 .
  • arxiv_id : A str -valued field for the arXiv ID of this paper.
  • pdf_json_files : A List[str] -valued field containing paths from the root of the current data dump version to the parses of the paper PDFs into JSON format. Multiple paths are semicolon-separated. Example: document_parses/pdf_json/4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a.json; document_parses/pdf_json/d4f0247db5e916c20eae3f6d772e8572eb828236.json
  • pmc_json_files : A List[str] -valued field. Same as above, but corresponding to the full text XML files downloaded from PMC, parsed into the same JSON format as above.
  • url : A List[str] -valued field containing all URLs associated with this paper. Semicolon-separated.
  • s2_id : A str -valued field containing the Semantic Scholar ID for this paper. Can be used with the Semantic Scholar API (e.g. s2_id=9445722 corresponds to http://api.semanticscholar.org/corpusid:9445722 )

Extra fields based on selected configuration during loading:

  • fulltext : A str -valued field containing the concatenation of all text sections from json (itself extracted from pdf)
  • doc_embeddings : A sequence of float-valued elements containing document embeddings as a vector of floats (parsed from string of values separated by ','). Details on the system used to extract the embeddings are available in: SPECTER: Document-level Representation Learning using Citation-informed Transformers . TL;DR: it's relying on a BERT model pre-trained on document-level relatedness using the citation graph. The system can be queried through REST (see public API documentation ).

Data Splits

No annotation provided in this dataset so all instances are provided in training split.

The sizes of each configuration are:

train
metadata 368618
fulltext 368618
embeddings 368618

Dataset Creation

Curation Rationale

See official readme

Source Data

See official readme

Initial Data Collection and Normalization

See official readme

Who are the source language producers?

See official readme

Annotations

No annotations

Annotation process

N/A

Who are the annotators?

N/A

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@article{Wang2020CORD19TC,
  title={CORD-19: The Covid-19 Open Research Dataset},
  author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and
  K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and
  Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and
  D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier},
  journal={ArXiv},
  year={2020}
}

Contributions

Thanks to @ggdupont for adding this dataset.