数据集:
acronym_identification
任务:
标记分类语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2010.14678许可:
mitThis dataset contains the training, validation, and test data for the Shared Task 1: Acronym Identification of the AAAI-21 Workshop on Scientific Document Understanding.
The dataset supports an acronym-identification task, where the aim is to predic which tokens in a pre-tokenized sentence correspond to acronyms. The dataset was released for a Shared Task which supported a leaderboard .
The sentences in the dataset are in English ( en ).
A sample from the training set is provided below:
{'id': 'TR-0', 'labels': [4, 4, 4, 4, 0, 2, 2, 4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4], 'tokens': ['What', 'is', 'here', 'called', 'controlled', 'natural', 'language', '(', 'CNL', ')', 'has', 'traditionally', 'been', 'given', 'many', 'different', 'names', '.']}
Please note that in test set sentences only the id and tokens fields are available. labels can be ignored for test set. Labels in the test set are all O
The data instances have the following fields:
The training, validation, and test set contain 14,006 , 1,717 , and 1750 sentences respectively.
First, most of the existing datasets for acronym identification (AI) are either limited in their sizes or created using simple rule-based methods. This is unfortunate as rules are in general not able to capture all the diverse forms to express acronyms and their long forms in text. Second, most of the existing datasets are in the medical domain, ignoring the challenges in other scientific domains. In order to address these limitations this paper introduces two new datasets for Acronym Identification. Notably, our datasets are annotated by human to achieve high quality and have substantially larger numbers of examples than the existing AI datasets in the non-medical domain.
In order to prepare a corpus for acronym annotation, we collect a corpus of 6,786 English papers from arXiv. These papers consist of 2,031,592 sentences that would be used for data annotation for AI in this work.
The dataset paper does not report the exact tokenization method.
Who are the source language producers?The language was comes from papers hosted on the online digital archive arXiv . No more information is available on the selection process or identity of the writers.
Each sentence for annotation needs to contain at least one word in which more than half of the characters in are capital letters (i.e., acronym candidates). Afterward, we search for a sub-sequence of words in which the concatenation of the first one, two or three characters of the words (in the order of the words in the sub-sequence could form an acronym candidate. We call the sub-sequence a long form candidate. If we cannot find any long form candidate, we remove the sentence. Using this process, we end up with 17,506 sentences to be annotated manually by the annotators from Amazon Mechanical Turk (MTurk). In particular, we create a HIT for each sentence and ask the workers to annotate the short forms and the long forms in the sentence. In case of disagreements, if two out of three workers agree on an annotation, we use majority voting to decide the correct annotation. Otherwise, a fourth annotator is hired to resolve the conflict
Who are the annotators?Workers were recruited through Amazon MEchanical Turk and paid $0.05 per annotation. No further demographic information is provided.
Papers published on arXiv are unlikely to contain much personal information, although some do include some poorly chosen examples revealing personal details, so the data should be used with care.
[More Information Needed]
[More Information Needed]
Dataset provided for research purposes only. Please check dataset license for additional information.
[More Information Needed]
The dataset provided for this shared task is licensed under CC BY-NC-SA 4.0 international license.
@inproceedings{Veyseh2020, author = {Amir Pouran Ben Veyseh and Franck Dernoncourt and Quan Hung Tran and Thien Huu Nguyen}, editor = {Donia Scott and N{\'{u}}ria Bel and Chengqing Zong}, title = {What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING} 2020, Barcelona, Spain (Online), December 8-13, 2020}, pages = {3285--3301}, publisher = {International Committee on Computational Linguistics}, year = {2020}, url = {https://doi.org/10.18653/v1/2020.coling-main.292}, doi = {10.18653/v1/2020.coling-main.292} }
Thanks to @abhishekkrthakur for adding this dataset.