数据集:
EMBO/biolang
任务:
文本生成子任务:
language-modeling语言:
en计算机处理:
monolingual语言创建人:
expert-generated批注创建人:
machine-generated许可:
cc-by-4.0BioLang is a dataset is based on abstracts from the open access section of EuropePubMed Central to train language models in the domain of biology. The dataset can be used for random masked language modeling or for language modeling using only specific part-of-speech maksing. More details on generation and use of the dataset at https://github.com/source-data/soda-roberta .
English
{ "input_ids":[ 0, 2444, 6997, 46162, 7744, 35, 20632, 20862, 3457, 36, 500, 23858, 29, 43, 32, 3919, 716, 15, 49, 4476, 4, 1398, 6, 52, 1118, 5, 20862, 819, 9, 430, 23305, 248, 23858, 29, 4, 256, 40086, 104, 35, 1927, 1069, 459, 1484, 58, 4776, 13, 23305, 634, 16706, 493, 2529, 8954, 14475, 73, 34263, 6, 4213, 718, 833, 12, 24291, 4473, 22500, 14475, 73, 510, 705, 73, 34263, 6, 5143, 4313, 2529, 8954, 14475, 73, 34263, 6, 8, 5143, 4313, 2529, 8954, 14475, 248, 23858, 29, 23, 4448, 225, 4722, 2392, 11, 9341, 261, 4, 49043, 35, 96, 746, 6, 5962, 9, 38415, 4776, 408, 36, 3897, 4, 398, 8871, 56, 23305, 4, 20, 15608, 21, 8061, 6164, 207, 13, 70, 248, 23858, 29, 6, 150, 5, 42561, 21, 8061, 5663, 207, 13, 80, 3457, 4, 509, 1296, 5129, 21567, 3457, 36, 398, 23528, 8748, 22065, 11654, 35, 7253, 15, 49, 4476, 6, 70, 3457, 4682, 65, 189, 28, 5131, 13, 23305, 9726, 4, 2 ], "label_ids": [ "X", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "ADJ", "ADJ", "NOUN", "PUNCT", "PROPN", "PROPN", "PROPN", "PUNCT", "AUX", "VERB", "VERB", "ADP", "DET", "NOUN", "PUNCT", "ADV", "PUNCT", "PRON", "VERB", "DET", "ADJ", "NOUN", "ADP", "ADJ", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "ADJ", "ADJ", "ADJ", "PUNCT", "NOUN", "NOUN", "NOUN", "NOUN", "AUX", "VERB", "ADP", "NOUN", "VERB", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "CCONJ", "ADJ", "PROPN", "PROPN", "PROPN", "PROPN", "NOUN", "NOUN", "NOUN", "ADP", "PROPN", "PROPN", "PROPN", "PROPN", "ADP", "PROPN", "PROPN", "PUNCT", "PROPN", "PUNCT", "ADP", "NOUN", "PUNCT", "NUM", "ADP", "NUM", "VERB", "NOUN", "PUNCT", "NUM", "NUM", "NUM", "NOUN", "AUX", "NOUN", "PUNCT", "DET", "NOUN", "AUX", "X", "NUM", "NOUN", "ADP", "DET", "NOUN", "NOUN", "NOUN", "PUNCT", "SCONJ", "DET", "NOUN", "AUX", "X", "NUM", "NOUN", "ADP", "NUM", "NOUN", "PUNCT", "NUM", "NOUN", "VERB", "ADJ", "NOUN", "PUNCT", "NUM", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "VERB", "ADP", "DET", "NOUN", "PUNCT", "DET", "NOUN", "SCONJ", "PRON", "VERB", "AUX", "VERB", "ADP", "NOUN", "NOUN", "PUNCT", "X" ], "special_tokens_mask": [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ] }
MLM :
DET , VERB , SMALL :
The dataset was assembled to train language models in the field of cell and molecular biology. To expand the size of the dataset and to include many examples with highly technical language, abstracts were complemented with figure legends (or figure 'captions').
The xml content of papers were downloaded in January 2021 from the open access section of EuropePMC . Figure legends and abstracts were extracted from the JATS XML, tokenized with the roberta-base tokenizer and part-of-speech tagged with Spacy's en_core_web_sm model ( https://spacy.io ).
More details at https://github.com/source-data/soda-roberta
Who are the source language producers?Experts scientists.
Part-of-speech was tagged automatically.
Who are the annotators?Spacy's en_core_web_sm model ( https://spacy.io ) was used for part-of-speech tagging.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thomas Lemberger
CC-BY 4.0
[More Information Needed]
Thanks to @tlemberger for adding this dataset.