数据集:
allenai/scifact
任务:
文本分类子任务:
fact-checking语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-2.0SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales.
An example of 'validation' looks as follows.
{ "cited_doc_ids": [14717500], "claim": "1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.", "evidence_doc_id": "14717500", "evidence_label": "SUPPORT", "evidence_sentences": [2, 5], "id": 3 }corpus
An example of 'train' looks as follows.
This example was too long and was cropped: { "abstract": "[\"Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and res...", "doc_id": 4983, "structured": false, "title": "Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging." }
The data fields are the same among all splits.
claimstrain | validation | test | |
---|---|---|---|
claims | 1261 | 450 | 300 |
train | |
---|---|
corpus | 5183 |
https://github.com/allenai/scifact/blob/master/LICENSE.md
The SciFact dataset is released under the CC BY-NC 2.0 . By using the SciFact data, you are agreeing to its usage terms.
@inproceedings{wadden-etal-2020-fact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550", }
Thanks to @thomwolf , @lhoestq , @dwadden , @patrickvonplaten , @mariamabarham , @lewtun for adding this dataset.