数据集:
fever
任务:
文本分类语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
crowdsourced源数据集:
extended|wikipediaWith billions of individual pages on the web providing information on almost every conceivable topic, we should have the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this information is contained in structured sources (Wikidata, Freebase, etc.) – we are therefore limited by our ability to transform free-form text to structured knowledge. There is, however, another problem that has become the focus of a lot of recent research and media coverage: false information coming from unreliable sources.
The FEVER workshops are a venue for work in verifiable knowledge extraction and to stimulate progress in this direction.
FEVER Dataset: FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment.
FEVER 2.0 Adversarial Attacks Dataset: The FEVER 2.0 Dataset consists of 1174 claims created by the submissions of participants in the Breaker phase of the 2019 shared task. Participants (Breakers) were tasked with generating adversarial examples that induce classification errors for the existing systems. Breakers submitted a dataset of up to 1000 instances with equal number of instances for each of the three classes (Supported, Refuted NotEnoughInfo). Only novel claims (i.e. not contained in the original FEVER dataset) were considered as valid entries to the shared task. The submissions were then manually evaluated for Correctness (grammatical, appropriately labeled and meet the FEVER annotation guidelines requirements).
The task is verification of textual claims against textual sources.
When compared to textual entailment (TE)/natural language inference, the key difference is that in these tasks the passage to verify each claim is given, and in recent years it typically consists a single sentence, while in verification systems it is retrieved from a large set of documents in order to form the evidence.
The dataset is in English.
An example of 'train' looks as follows.
'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.', 'evidence_wiki_url': 'Nikolaj_Coster-Waldau', 'label': 'SUPPORTS', 'id': 75397, 'evidence_id': 104971, 'evidence_sentence_id': 7, 'evidence_annotation_id': 92206}v2.0
An example of 'validation' looks as follows.
{'claim': "There is a convicted statutory rapist called Chinatown's writer.", 'evidence_wiki_url': '', 'label': 'NOT ENOUGH INFO', 'id': 500000, 'evidence_id': -1, 'evidence_sentence_id': -1, 'evidence_annotation_id': 269158}wiki_pages
An example of 'wikipedia_pages' looks as follows.
{'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ', 'lines': '0\tThe following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .\n1\t', 'id': '1928_in_association_football'}
The data fields are the same among all splits.
v1.0train | unlabelled_dev | labelled_dev | paper_dev | unlabelled_test | paper_test | |
---|---|---|---|---|---|---|
v1.0 | 311431 | 19998 | 37566 | 18999 | 19998 | 18567 |
validation | |
---|---|
v2.0 | 2384 |
wikipedia_pages | |
---|---|
wiki_pages | 5416537 |
FEVER license:
These data annotations incorporate material from Wikipedia, which is licensed pursuant to the Wikipedia Copyright Policy. These annotations are made available under the license terms described on the applicable Wikipedia article pages, or, where Wikipedia license terms are unavailable, under the Creative Commons Attribution-ShareAlike License (version 3.0), available at http://creativecommons.org/licenses/by-sa/3.0/ (collectively, the “License Termsâ€). You may not use these files except in compliance with the applicable License Terms.
If you use "FEVER Dataset", please cite:
@inproceedings{Thorne18Fever, author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit}, title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}}, booktitle = {NAACL-HLT}, year = {2018} }
If you use "FEVER 2.0 Adversarial Attacks Dataset", please cite:
@inproceedings{Thorne19FEVER2, author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit}, title = {The {FEVER2.0} Shared Task}, booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}}, year = {2018} }
Thanks to @thomwolf , @lhoestq , @mariamabarham , @lewtun , @albertvillanova for adding this dataset.