数据集:
Babelscape/rebel-dataset
语言:
en计算机处理:
monolingual语言创建人:
machine-generated批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:2005.00614许可:
cc-by-sa-4.0cc-by-nc-sa-4.0--- annotations_creators:
Dataset created for REBEL dataset from interlinking Wikidata and Wikipedia for Relation Extraction, filtered using NLI.
The dataset is in English, from the English Wikipedia.
REBEL
{ 'id': 'Q82442-1', 'title': 'Arsène Lupin, Gentleman Burglar', 'context': 'Arsène Lupin , Gentleman Burglar is the first collection of stories by Maurice Leblanc recounting the adventures of Arsène Lupin , released on 10 June 1907 .', 'triplets': '<triplet> Arsène Lupin, Gentleman Burglar <subj> Maurice Leblanc <obj> author <triplet> Arsène Lupin <subj> Maurice Leblanc <obj> creator' }
The original data is in jsonl format and contains much more information. It is divided by Wikipedia articles instead of by sentence, and contains metadata about Wikidata entities, their boundaries in the text, how it was annotated, etc. For more information check the paper repository and how it was generated using the Relation Extraction dataset pipeline, cRocoDiLe .
List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.
Test and Validation splits are each 5% of the original data.
Provide the sizes of each split. As appropriate, provide any descriptive statistics for the features, such as average length. For example:
Tain | Valid | Test | |
---|---|---|---|
Input Sentences | 3,120,296 | 172,860 | 173,601 |
Input Sentences (top 220 relation types as used in original paper) | 784,202 | 43,341 | 43,506 |
Number of Triplets (top 220 relation types as used in original paper) | 878,555 | 48,514 | 48,852 |
This dataset was created to enable the training of a BART based model as pre-training phase for Relation Extraction as seen in the paper REBEL: Relation Extraction By End-to-end Language generation .
Data comes from Wikipedia text before the table of contents, as well as Wikidata for the triplets annotation.
Initial Data Collection and NormalizationFor the data collection, the dataset extraction pipeline cRocoDiLe: Automati c R elati o n Extra c ti o n D ataset w i th N L I filt e ring insipired by T-REx Pipeline more details found at: T-REx Website . The starting point is a Wikipedia dump as well as a Wikidata one.
After the triplets are extracted, an NLI system was used to filter out those not entailed by the text.
Who are the source language producers?Any Wikipedia and Wikidata contributor.
The dataset extraction pipeline cRocoDiLe: Automati c R elati o n Extra c ti o n D ataset w i th N L I filt e ring .
Who are the annotators?Automatic annottations
All text is from Wikipedia, any Personal or Sensitive Information there may be present in this dataset.
The dataset serves as a pre-training step for Relation Extraction models. It is distantly annotated, hence it should only be used as such. A model trained solely on this dataset may produce allucinations coming from the silver nature of the dataset.
Since the dataset was automatically created from Wikipedia and Wikidata, it may reflect the biases withing those sources.
For Wikipedia text, see for example Dinan et al 2020 on biases in Wikipedia (esp. Table 1) , or Blodgett et al 2020 for a more general discussion of the topic.
For Wikidata, there are class imbalances, also resulting from Wikipedia.
Not for now
Pere-Lluis Huguet Cabot - Babelscape and Sapienza University of Rome, Italy Roberto Navigli - Sapienza University of Rome, Italy
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.
Provide the BibTex -formatted reference for the dataset. For example:
@inproceedings{huguet-cabot-navigli-2021-rebel, title = "REBEL: Relation Extraction By End-to-end Language generation", author = "Huguet Cabot, Pere-Llu{\'\i}s and Navigli, Roberto", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", month = nov, year = "2021", address = "Online and in the Barceló Bávaro Convention Centre, Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://github.com/Babelscape/rebel/blob/main/docs/EMNLP_2021_REBEL__Camera_Ready_.pdf", }
Thanks to @littlepea13 for adding this dataset.