数据集:
fquad
语言:
fr计算机处理:
monolingual大小:
1K<n<10K批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:2002.06071许可:
cc-by-nc-sa-3.0FQuAD: French Question Answering Dataset We introduce FQuAD, a native French Question Answering Dataset.
FQuAD contains 25,000+ question and answer pairs. Finetuning CamemBERT on FQuAD yields a F1 score of 88% and an exact match of 77.9%. Developped to provide a SQuAD equivalent in the French language. Questions are original and based on high quality Wikipedia articles.
Please, note this dataset is licensed for non-commercial purposes and users must agree to the following terms and conditions:
Request manually download of the data from: https://fquad.illuin.tech/
This dataset is exclusively in French, with context data from Wikipedia and questions from French university students ( fr ).
An example of 'validation' looks as follows.
This example was too long and was cropped: { "answers": { "answers_starts": [161, 46, 204], "texts": ["La Vierge aux rochers", "documents contemporains", "objets de spéculations"] }, "context": "\"Les deux tableaux sont certes décrits par des documents contemporains à leur création mais ceux-ci ne le font qu'indirectement ...", "questions": ["Que concerne principalement les documents ?", "Par quoi sont décrit les deux tableaux ?", "Quels types d'objets sont les deux tableaux aux yeux des chercheurs ?"] }
The data fields are the same among all splits.
defaultThe FQuAD dataset has 3 splits: train , validation , and test . The test split is however not released publicly at the moment. The splits contain disjoint sets of articles. The following table contains stats about each split.
Dataset Split | Number of Articles in Split | Number of paragraphs in split | Number of questions in split |
---|---|---|---|
Train | 117 | 4921 | 20731 |
Validation | 768 | 51.0% | 3188 |
Test | 10 | 532 | 2189 |
The FQuAD dataset was created by Illuin technology. It was developped to provide a SQuAD equivalent in the French language. Questions are original and based on high quality Wikipedia articles.
The text used for the contexts are from the curated list of French High-Quality Wikipedia articles .
Annotations (spans and questions) are written by students of the CentraleSupélec school of engineering. Wikipedia articles were scraped and Illuin used an internally-developped tool to help annotators ask questions and indicate the answer spans. Annotators were given paragraph sized contexts and asked to generate 4/5 non-trivial questions about information in the context.
No personal or sensitive information is included in this dataset. This has been manually verified by the dataset curators.
Users should consider this dataset is sampled from Wikipedia data which might not be representative of all QA use cases.
The social biases of this dataset have not yet been investigated.
The social biases of this dataset have not yet been investigated, though articles have been selected by their quality and objectivity.
The limitations of the FQuAD dataset have not yet been investigated.
Illuin Technology: https://fquad.illuin.tech/
The FQuAD dataset is licensed under the CC BY-NC-SA 3.0 license.
It allows personal and academic research uses of the dataset, but not commercial uses. So concretely, the dataset cannot be used to train a model that is then put into production within a business or a company. For this type of commercial use, we invite FQuAD users to contact the authors to discuss possible partnerships.
@ARTICLE{2020arXiv200206071 author = {Martin, d'Hoffschmidt and Maxime, Vidal and Wacim, Belblidia and Tom, Brendlé}, title = "{FQuAD: French Question Answering Dataset}", journal = {arXiv e-prints}, keywords = {Computer Science - Computation and Language}, year = "2020", month = "Feb", eid = {arXiv:2002.06071}, pages = {arXiv:2002.06071}, archivePrefix = {arXiv}, eprint = {2002.06071}, primaryClass = {cs.CL} }
Thanks to @thomwolf , @mariamabarham , @patrickvonplaten , @lewtun , @albertvillanova for adding this dataset. Thanks to @ManuelFay for providing information on the dataset creation process.