数据集:
qanastek/ANTILLES
ANTILLES is a part-of-speech tagging corpora based on UD_French-GSD which was originally created in 2015 and is based on the universal dependency treebank v2.0 .
Originally, the corpora consists of 400,399 words (16,341 sentences) and had 17 different classes. Now, after applying our tags augmentation script transform.py , we obtain 60 different classes which add semantic information such as: the gender, number, mood, person, tense or verb form given in the different CoNLL-U fields from the original corpora.
We based our tags on the level of details given by the LIA_TAGG statistical POS tagger written by Frédéric Béchet in 2001.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .
part-of-speech-tagging : The dataset can be used to train a model for part-of-speech-tagging. The performance is measured by how high its F1 score is. A Flair Sequence-To-Sequence model trained to tag tokens from Wikipedia passages achieves a F1 score (micro) of 0.952.
The text in the dataset is in French, as spoken by Wikipedia users. The associated BCP-47 code is fr .
from datasets import load_dataset dataset = load_dataset("qanastek/ANTILLES") print(dataset)
from flair.datasets import UniversalDependenciesCorpus corpus: Corpus = UniversalDependenciesCorpus( data_folder='ANTILLES', train_file="train.conllu", test_file="test.conllu", dev_file="dev.conllu" )
from flair.models import SequenceTagger tagger = SequenceTagger.load("qanastek/pos-french")
# sent_id = fr-ud-dev_00005 # text = Travail de trés grande qualité exécuté par un imprimeur artisan passionné. 1 Travail travail NMS _ Gender=Masc|Number=Sing 0 root _ wordform=travail 2 de de PREP _ _ 5 case _ _ 3 trés trés ADV _ _ 4 advmod _ _ 4 grande grand ADJFS _ Gender=Fem|Number=Sing 5 amod _ _ 5 qualité qualité NFS _ Gender=Fem|Number=Sing 1 nmod _ _ 6 exécuté exécuter VPPMS _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 1 acl _ _ 7 par par PREP _ _ 9 case _ _ 8 un un DINTMS _ Definite=Ind|Gender=Masc|Number=Sing|PronType=Art 9 det _ _ 9 imprimeur imprimeur NMS _ Gender=Masc|Number=Sing 6 obl:agent _ _ 10 artisan artisan NMS _ Gender=Masc|Number=Sing 9 nmod _ _ 11 passionné passionné ADJMS _ Gender=Masc|Number=Sing 9 amod _ SpaceAfter=No 12 . . YPFOR _ _ 1 punct _ _
Abbreviation | Description | Examples | # tokens |
---|---|---|---|
PREP | Preposition | de | 63 738 |
AUX | Auxiliary Verb | est | 12 886 |
ADV | Adverb | toujours | 14 969 |
COSUB | Subordinating conjunction | que | 3 007 |
COCO | Coordinating Conjunction | et | 10 102 |
PART | Demonstrative particle | -t | 93 |
PRON | Pronoun | qui ce quoi | 667 |
PDEMMS | Singular Masculine Demonstrative Pronoun | ce | 1 950 |
PDEMMP | Plurial Masculine Demonstrative Pronoun | ceux | 108 |
PDEMFS | Singular Feminine Demonstrative Pronoun | cette | 1 004 |
PDEMFP | Plurial Feminine Demonstrative Pronoun | celles | 53 |
PINDMS | Singular Masculine Indefinite Pronoun | tout | 961 |
PINDMP | Plurial Masculine Indefinite Pronoun | autres | 89 |
PINDFS | Singular Feminine Indefinite Pronoun | chacune | 136 |
PINDFP | Plurial Feminine Indefinite Pronoun | certaines | 31 |
PROPN | Proper noun | houston | 22 135 |
XFAMIL | Last name | levy | 6 449 |
NUM | Numerical Adjectives | trentaine vingtaine | 67 |
DINTMS | Masculine Numerical Adjectives | un | 4 254 |
DINTFS | Feminine Numerical Adjectives | une | 3 543 |
PPOBJMS | Singular Masculine Pronoun complements of objects | le lui | 1 425 |
PPOBJMP | Plurial Masculine Pronoun complements of objects | eux y | 212 |
PPOBJFS | Singular Feminine Pronoun complements of objects | moi la | 358 |
PPOBJFP | Plurial Feminine Pronoun complements of objects | en y | 70 |
PPER1S | Personal Pronoun First Person Singular | je | 571 |
PPER2S | Personal Pronoun Second Person Singular | tu | 19 |
PPER3MS | Personal Pronoun Third Person Masculine Singular | il | 3 938 |
PPER3MP | Personal Pronoun Third Person Masculine Plurial | ils | 513 |
PPER3FS | Personal Pronoun Third Person Feminine Singular | elle | 992 |
PPER3FP | Personal Pronoun Third Person Feminine Plurial | elles | 121 |
PREFS | Reflexive Pronouns First Person of Singular | me m' | 120 |
PREF | Reflexive Pronouns Third Person of Singular | se s' | 2 337 |
PREFP | Reflexive Pronouns First / Second Person of Plurial | nous vous | 686 |
VERB | Verb | obtient | 21 131 |
VPPMS | Singular Masculine Participle Past Verb | formulé | 6 275 |
VPPMP | Plurial Masculine Participle Past Verb | classés | 1 352 |
VPPFS | Singular Feminine Participle Past Verb | appelée | 2 434 |
VPPFP | Plurial Feminine Participle Past Verb | sanctionnées | 813 |
VPPRE | Present participle | étant | 2 |
DET | Determinant | les l' | 25 206 |
DETMS | Singular Masculine Determinant | les | 15 444 |
DETFS | Singular Feminine Determinant | la | 10 978 |
ADJ | Adjective | capable sérieux | 1 075 |
ADJMS | Singular Masculine Adjective | grand important | 8 338 |
ADJMP | Plurial Masculine Adjective | grands petits | 3 274 |
ADJFS | Singular Feminine Adjective | franéaise petite | 8 004 |
ADJFP | Plurial Feminine Adjective | légéres petites | 3 041 |
NOUN | Noun | temps | 1 389 |
NMS | Singular Masculine Noun | drapeau | 29 698 |
NMP | Plurial Masculine Noun | journalistes | 10 882 |
NFS | Singular Feminine Noun | téte | 25 414 |
NFP | Plurial Feminine Noun | ondes | 7 448 |
PREL | Relative Pronoun | qui dont | 2 976 |
PRELMS | Singular Masculine Relative Pronoun | lequel | 94 |
PRELMP | Plurial Masculine Relative Pronoun | lesquels | 29 |
PRELFS | Singular Feminine Relative Pronoun | laquelle | 70 |
PRELFP | Plurial Feminine Relative Pronoun | lesquelles | 25 |
PINTFS | Singular Feminine Interrogative Pronoun | laquelle | 3 |
INTJ | Interjection | merci bref | 75 |
CHIF | Numbers | 1979 10 | 10 417 |
SYM | Symbol | é % | 705 |
YPFOR | Endpoint | . | 15 088 |
PUNCT | Ponctuation | : , | 28 918 |
MOTINC | Unknown words | Technology Lady | 2 022 |
X | Typos & others | sfeir 3D statu | 175 |
Train | Dev | Test | |
---|---|---|---|
# Docs | 14 449 | 1 476 | 416 |
Avg # Tokens / Doc | 24.54 | 24.19 | 24.08 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
The corpora is free of personal or sensitive information since it has been based on Wikipedia articles content.
[Needs More Information]
The nature of the corpora introduce various biases such as the names of the streets which are temporaly based and can therefore introduce named entity like author or event names. For example, street names such as Rue Victor-Hugo or Rue Pasteur doesn't exist before the 20's century in France.
[Needs More Information]
ANTILLES : Labrak Yanis, Dufour Richard
UD_FRENCH-GSD : de Marneffe Marie-Catherine, Guillaume Bruno, McDonald Ryan, Suhr Alane, Nivre Joakim, Grioni Matias, Dickerson Carly, Perrier Guy
Universal Dependency : Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Tackstrom, Claudia Bedini, Nuria Bertomeu Castello and Jungmee Lee
For the following languages German, Spanish, French, Indonesian, Italian, Japanese, Korean and Brazilian Portuguese we will distinguish between two portions of the data. 1. The underlying text for sentences that were annotated. This data Google asserts no ownership over and no copyright over. Some or all of these sentences may be copyrighted in some jurisdictions. Where copyrighted, Google collected these sentences under exceptions to copyright or implied license rights. GOOGLE MAKES THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESS OR IMPLIED. 2. The annotations -- part-of-speech tags and dependency annotations. These are made available under a CC BY-SA 4.0. GOOGLE MAKES THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESS OR IMPLIED. See attached LICENSE file for the text of CC BY-NC-SA. Portions of the German data were sampled from the CoNLL 2006 Tiger Treebank data. Hans Uszkoreit graciously gave permission to use the underlying sentences in this data as part of this release. Any use of the data should reference the above plus: Universal Dependency Annotation for Multilingual Parsing Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Tackstrom, Claudia Bedini, Nuria Bertomeu Castello and Jungmee Lee Proceedings of ACL 2013
Please cite the following paper when using this model.
ANTILLES extended corpus:
@inproceedings{labrak:hal-03696042, TITLE = {{ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus}}, AUTHOR = {Labrak, Yanis and Dufour, Richard}, URL = {https://hal.archives-ouvertes.fr/hal-03696042}, BOOKTITLE = {{25th International Conference on Text, Speech and Dialogue (TSD)}}, ADDRESS = {Brno, Czech Republic}, PUBLISHER = {{Springer}}, YEAR = {2022}, MONTH = Sep, KEYWORDS = {Part-of-speech corpus ; POS tagging ; Open tools ; Word embeddings ; Bi-LSTM ; CRF ; Transformers}, PDF = {https://hal.archives-ouvertes.fr/hal-03696042/file/ANTILLES_A_freNch_linguisTIcaLLy_Enriched_part_of_Speech_corpus.pdf}, HAL_ID = {hal-03696042}, HAL_VERSION = {v1}, }
UD_French-GSD corpora:
@misc{ universaldependencies, title={UniversalDependencies/UD_French-GSD}, url={https://github.com/UniversalDependencies/UD_French-GSD}, journal={GitHub}, author={UniversalDependencies} }
{U}niversal {D}ependency Annotation for Multilingual Parsing:
@inproceedings{mcdonald-etal-2013-universal, title = "{U}niversal {D}ependency Annotation for Multilingual Parsing", author = {McDonald, Ryan and Nivre, Joakim and Quirmbach-Brundage, Yvonne and Goldberg, Yoav and Das, Dipanjan and Ganchev, Kuzman and Hall, Keith and Petrov, Slav and Zhang, Hao and T{\"a}ckstr{\"o}m, Oscar and Bedini, Claudia and Bertomeu Castell{\'o}, N{\'u}ria and Lee, Jungmee}, booktitle = "Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", month = aug, year = "2013", address = "Sofia, Bulgaria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P13-2017", pages = "92--97", }
LIA TAGG:
@techreport{LIA_TAGG, author = {Frédéric Béchet}, title = {LIA_TAGG: a statistical POS tagger + syntactic bracketer}, institution = {Aix-Marseille University & CNRS}, year = {2001} }