数据集:
allenai/scifact
任务:
子任务:
fact-checking语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales.
An example of 'validation' looks as follows.
{
"cited_doc_ids": [14717500],
"claim": "1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.",
"evidence_doc_id": "14717500",
"evidence_label": "SUPPORT",
"evidence_sentences": [2, 5],
"id": 3
}
corpus
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"abstract": "[\"Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and res...",
"doc_id": 4983,
"structured": false,
"title": "Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging."
}
The data fields are the same among all splits.
claims| train | validation | test | |
|---|---|---|---|
| claims | 1261 | 450 | 300 |
| train | |
|---|---|
| corpus | 5183 |
https://github.com/allenai/scifact/blob/master/LICENSE.md
The SciFact dataset is released under the CC BY-NC 2.0 . By using the SciFact data, you are agreeing to its usage terms.
@inproceedings{wadden-etal-2020-fact,
title = "Fact or Fiction: Verifying Scientific Claims",
author = "Wadden, David and
Lin, Shanchuan and
Lo, Kyle and
Wang, Lucy Lu and
van Zuylen, Madeleine and
Cohan, Arman and
Hajishirzi, Hannaneh",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.609",
doi = "10.18653/v1/2020.emnlp-main.609",
pages = "7534--7550",
}
Thanks to @thomwolf , @lhoestq , @dwadden , @patrickvonplaten , @mariamabarham , @lewtun for adding this dataset.