Question-answering French model, using base CamemBERT fine-tuned on a combo of three French Q&A datasets:
python run_squad.py \ --model_type camembert \ --model_name_or_path camembert-base \ --do_train --do_eval \ --train_file data/SQuAD+fquad+piaf.json \ --predict_file data/fquad_valid.json \ --per_gpu_train_batch_size 12 \ --learning_rate 3e-5 \ --num_train_epochs 4 \ --max_seq_length 384 \ --doc_stride 128 \ --save_steps 10000
{"f1": 79.81, "exact_match": 55.14}
{"f1": 80.61, "exact_match": 59.54}
from transformers import pipeline nlp = pipeline('question-answering', model='etalab-ia/camembert-base-squadFR-fquad-piaf', tokenizer='etalab-ia/camembert-base-squadFR-fquad-piaf') nlp({ 'question': "Qui est Claude Monet?", 'context': "Claude Monet, né le 14 novembre 1840 à Paris et mort le 5 décembre 1926 à Giverny, est un peintre français et l’un des fondateurs de l'impressionnisme." })
This work was performed using HPC resources from GENCI–IDRIS (Grant 2020-AD011011224).
@inproceedings{KeraronLBAMSSS20, author = {Rachel Keraron and Guillaume Lancrenon and Mathilde Bras and Fr{\'{e}}d{\'{e}}ric Allary and Gilles Moyse and Thomas Scialom and Edmundo{-}Pavel Soriano{-}Morales and Jacopo Staiano}, title = {Project {PIAF:} Building a Native French Question-Answering Dataset}, booktitle = {{LREC}}, pages = {5481--5490}, publisher = {European Language Resources Association}, year = {2020} }
@article{dHoffschmidt2020FQuADFQ, title={FQuAD: French Question Answering Dataset}, author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich}, journal={ArXiv}, year={2020}, volume={abs/2002.06071} }
@MISC{kabbadj2018, author = "Kabbadj, Ali", title = "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ", editor = "linkedin.com", month = "November", year = "2018", url = "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}", note = "[Online; posted 11-November-2018]", }
HF model card : https://huggingface.co/camembert-base
@inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020} }