数据集:
clips/mfaq
任务:
问答子任务:
multiple-choice-qa计算机处理:
multilingual语言创建人:
other批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:2109.12870许可:
cc0-1.0? See MQA or MFAQ Light for an updated version of the dataset.
MFAQ is a multilingual corpus of Frequently Asked Questions parsed from the Common Crawl .
from datasets import load_dataset load_dataset("clips/mfaq", "en") { "qa_pairs": [ { "question": "Do I need a rental Car in Cork?", "answer": "If you plan on travelling outside of Cork City, for instance to Kinsale [...]" }, ... ] }
We collected around 6M pairs of questions and answers in 21 different languages. To download a language specific subset you need to specify the language key as configuration. See below for an example.
load_dataset("clips/mfaq", "en") # replace "en" by any language listed below
Language | Key | Pairs | Pages |
---|---|---|---|
All | all | 6,346,693 | 1,035,649 |
English | en | 3,719,484 | 608,796 |
German | de | 829,098 | 111,618 |
Spanish | es | 482,818 | 75,489 |
French | fr | 351,458 | 56,317 |
Italian | it | 155,296 | 24,562 |
Dutch | nl | 150,819 | 32,574 |
Portuguese | pt | 138,778 | 26,169 |
Turkish | tr | 102,373 | 19,002 |
Russian | ru | 91,771 | 22,643 |
Polish | pl | 65,182 | 10,695 |
Indonesian | id | 45,839 | 7,910 |
Norwegian | no | 37,711 | 5,143 |
Swedish | sv | 37,003 | 5,270 |
Danish | da | 32,655 | 5,279 |
Vietnamese | vi | 27,157 | 5,261 |
Finnish | fi | 20,485 | 2,795 |
Romanian | ro | 17,066 | 3,554 |
Czech | cs | 16,675 | 2,568 |
Hebrew | he | 11,212 | 1,921 |
Hungarian | hu | 8,598 | 1,264 |
Croatian | hr | 5,215 | 819 |
The data is organized by page. Each page contains a list of questions and answers.
The data is organized by pair (i.e. pages are flattened). You can access the flat version of any language by appending _flat to the configuration (e.g. en_flat ). The data will be returned pair-by-pair instead of page-by-page.
This section was adapted from the source data description of OSCAR
Common Crawl is a non-profit foundation which produces and maintains an open repository of web crawled data that is both accessible and analysable. Common Crawl's complete web archive consists of petabytes of data collected over 8 years of web crawling. The repository contains raw web page HTML data (WARC files), metdata extracts (WAT files) and plain text extracts (WET files). The organisation's crawlers has always respected nofollow and robots.txt policies.
To construct MFAQ, the WARC files of Common Crawl were used. We looked for FAQPage markup in the HTML and subsequently parsed the FAQItem from the page.
This model was developed by Maxime De Bruyn , Ehsan Lotfi, Jeska Buhmann and Walter Daelemans.
These data are released under this licensing scheme. We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: * Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. * Clearly identify the copyrighted work claimed to be infringed. * Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material. We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
@misc{debruyn2021mfaq, title={MFAQ: a Multilingual FAQ Dataset}, author={Maxime {De Bruyn} and Ehsan Lotfi and Jeska Buhmann and Walter Daelemans}, year={2021}, eprint={2109.12870}, archivePrefix={arXiv}, primaryClass={cs.CL} }