数据集:
orieg/elsevier-oa-cc-by
语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2008.00774许可:
cc-by-4.0Elsevier OA CC-By: This is a corpus of 40k (40,091) open access (OA) CC-BY articles from across Elsevier’s journals representing a large scale, cross-discipline set of research data to support NLP and ML research. The corpus include full-text articles published in 2014 to 2020 and are categorized in 27 Mid Level ASJC Code (subject classification).
Distribution of Publication Years
Publication Year | Number of Articles |
---|---|
2014 | 3018 |
2015 | 4438 |
2016 | 5913 |
2017 | 6419 |
2018 | 8016 |
2019 | 10135 |
2020 | 2159 |
Distribution of Articles Per Mid Level ASJC Code. Each article can belong to multiple ASJC codes.
Discipline | Count |
---|---|
General | 3847 |
Agricultural and Biological Sciences | 4840 |
Arts and Humanities | 982 |
Biochemistry, Genetics and Molecular Biology | 8356 |
Business, Management and Accounting | 937 |
Chemical Engineering | 1878 |
Chemistry | 2490 |
Computer Science | 2039 |
Decision Sciences | 406 |
Earth and Planetary Sciences | 2393 |
Economics, Econometrics and Finance | 976 |
Energy | 2730 |
Engineering | 4778 |
Environmental Science | 6049 |
Immunology and Microbiology | 3211 |
Materials Science | 3477 |
Mathematics | 538 |
Medicine | 7273 |
Neuroscience | 3669 |
Nursing | 308 |
Pharmacology, Toxicology and Pharmaceutics | 2405 |
Physics and Astronomy | 2404 |
Psychology | 1760 |
Social Sciences | 3540 |
Veterinary | 991 |
Dentistry | 40 |
Health Professions | 821 |
[More Information Needed]
English ( en ).
The original dataset was published with the following json structure:
{ "docId": <str>, "metadata":{ "title": <str>, "authors": [ { "first": <str>, "initial": <str>, "last": <str>, "email": <str> }, ... ], "issn": <str>, "volume": <str>, "firstpage": <str>, "lastpage": <str>, "pub_year": <int>, "doi": <str>, "pmid": <str>, "openaccess": "Full", "subjareas": [<str>], "keywords": [<str>], "asjc": [<int>], }, "abstract":[ { "sentence": <str>, "startOffset": <int>, "endOffset": <int> }, ... ], "bib_entries":{ "BIBREF0":{ "title":<str>, "authors":[ { "last":<str>, "initial":<str>, "first":<str> }, ... ], "issn": <str>, "volume": <str>, "firstpage": <str>, "lastpage": <str>, "pub_year": <int>, "doi": <str>, "pmid": <str> }, ... }, "body_text":[ { "sentence": <str>, "secId": <str>, "startOffset": <int>, "endOffset": <int>, "title": <str>, "refoffsets": { <str>:{ "endOffset":<int>, "startOffset":<int> } }, "parents": [ { "id": <str>, "title": <str> }, ... ] }, ... ] }
docId The docID is the identifier of the document. This is unique to the document, and can be resolved into a URL for the document through the addition of https//www.sciencedirect.com/science/pii/<docId>
abstract This is the author provided abstract for the document
body_text The full text for the document. The text has been split on sentence boundaries, thus making it easier to use across research projects. Each sentence has the title (and ID) of the section which it is from, along with titles (and IDs) of the parent section. The highest-level section takes index 0 in the parents array. If the array is empty then the title of the section for the sentence is the highest level section title. This will allow for the reconstruction of the article structure. References have been extracted from the sentences. The IDs of the extracted reference and their respective offset within the sentence can be found in the “refoffsets” field. The complete list of references are can be found in the “bib_entry” field along with the references’ respective metadata. Some will be missing as we only keep ‘clean’ sentences,
bib_entities All the references from within the document can be found in this section. If the meta data for the reference is available, it has been added against the key for the reference. Where possible information such as the document titles, authors, and relevant identifiers (DOI and PMID) are included. The keys for each reference can be found in the sentence where the reference is used with the start and end offset of where in the sentence that reference was used.
metadata Meta data includes additional information about the article, such as list of authors, relevant IDs (DOI and PMID). Along with a number of classification schemes such as ASJC and Subject Classification.
author_highlights Author highlights were included in the corpus where the author(s) have provided them. The coverage is 61% of all articles. The author highlights, consisting of 4 to 6 sentences, is provided by the author with the aim of summarising the core findings and results in the article.
Distribution of Publication Years
Train | Test | Validation | |
---|---|---|---|
All Articles | 32072 | 4009 | 4008 |
With Author Highlights | 19644 | 2420 | 2514 |
[More Information Needed]
Date the data was collected: 2020-06-25T11:00:00.000Z
See the original paper for more detail on the data collection process.
Who are the source language producers?See 3.1 Data Sampling in the original paper .
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@article{Kershaw2020ElsevierOC, title = {Elsevier OA CC-By Corpus}, author = {Daniel James Kershaw and R. Koeling}, journal = {ArXiv}, year = {2020}, volume = {abs/2008.00774}, doi = {https://doi.org/10.48550/arXiv.2008.00774}, url = {https://elsevier.digitalcommonsdata.com/datasets/zm33cdndxs}, keywords = {Science, Natural Language Processing, Machine Learning, Open Dataset}, abstract = {We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.} }
@dataset{https://10.17632/zm33cdndxs.3, doi = {10.17632/zm33cdndxs.2}, url = {https://data.mendeley.com/datasets/zm33cdndxs/3}, author = "Daniel Kershaw and Rob Koeling", keywords = {Science, Natural Language Processing, Machine Learning, Open Dataset}, title = {Elsevier OA CC-BY Corpus}, publisher = {Mendeley}, year = {2020}, month = {sep} }
Thanks to @orieg for adding this dataset.