模型:

classla/xlm-roberta-base-multilingual-text-genre-classifier

任务:

文本分类

类库:

PyTorch Safetensors Transformers

语言:

multilingual

其他:

xlm-roberta genre text-genre

预印本库:

arxiv:1911.02116

许可:

cc-by-sa-4.0

模型介绍文件清单

中文

X-GENRE classifier - multilingual text genre classifier

Text classification model based on xlm-roberta-base and fine-tuned on a combination of three genre datasets: Slovene GINCO 1 dataset, the English CORE 2 dataset and the English FTD 3 dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the xlm-roberta-base .

Model description

The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on the comparison of labels and cross-dataset experiments (described in details here ).

Fine-tuning hyperparameters

Fine-tuning was performed with simpletransformers . Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:

model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            }

Intended use and limitations

Usage

An example of preparing data for genre identification and post-processing of the results can be found here where we applied X-GENRE classifier to the English part of MaCoCu parallel corpora.

For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not have enough features of any genre, and these predictions can be discarded as well.

After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.

Use examples

from simpletransformers.classification import ClassificationModel
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            "silent": True
            }
model = ClassificationModel(
    "xlmroberta", "classla/xlm-roberta-base-multilingual-text-genre-classifier", use_cuda=True,
    args=model_args
    
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.", 
                                        "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
                                        )
predictions
# Output: array([3, 8])

[model.config.id2label[i] for i in predictions]
# Output: ['Instruction', 'Promotion']

Use example for prediction on a dataset, using batch processing, is available via Google Collab .

X-GENRE categories

List of labels:

labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],

labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction': 3, 'Opinion/Argumentation': 4, 'Forum': 5, 'Prose/Lyrical': 6, 'Legal': 7, 'Promotion': 8}

Description of labels:

Label	Description	Examples
Information/Explanation	An objective text that describes or presents an event, a person, a thing, a concept etc. Its main purpose is to inform the reader about something. Common features: objective/factual, explanation/definition of a concept (x is …), enumeration.	research article, encyclopedia article, informational blog, product specification, course materials, general information, job description, manual, horoscope, travel guide, glossaries, historical article, biographical story/history.
Instruction	An objective text which instructs the readers on how to do something. Common features: multiple steps/actions, chronological order, 1st person plural or 2nd person, modality (must, have to, need to, can, etc.), adverbial clauses of manner (in a way that), of condition (if), of time (after …).	how-to texts, recipes, technical support
Legal	An objective formal text that contains legal terms and is clearly structured. The name of the text type is often included in the headline (contract, rules, amendment, general terms and conditions, etc.). Common features: objective/factual, legal terms, 3rd person.	small print, software license, proclamation, terms and conditions, contracts, law, copyright notices, university regulation
News	An objective or subjective text which reports on an event recent at the time of writing or coming in the near future. Common features: adverbs/adverbial clauses of time and/or place (dates, places), many proper nouns, direct or reported speech, past tense.	news report, sports report, travel blog, reportage, police report, announcement
Opinion/Argumentation	A subjective text in which the authors convey their opinion or narrate their experience. It includes promotion of an ideology and other non-commercial causes. This genre includes a subjective narration of a personal experience as well. Common features: adjectives/adverbs that convey opinion, words that convey (un)certainty (certainly, surely), 1st person, exclamation marks.	review, blog (personal blog, travel blog), editorial, advice, letter to editor, persuasive article or essay, formal speech, pamphlet, political propaganda, columns, political manifesto
Promotion	A subjective text intended to sell or promote an event, product, or service. It addresses the readers, often trying to convince them to participate in something or buy something. Common features: contains adjectives/adverbs that promote something (high-quality, perfect, amazing), comparative and superlative forms of adjectives and adverbs (the best, the greatest, the cheapest), addressing the reader (usage of 2nd person), exclamation marks.	advertisement, promotion of a product (e-shops), promotion of an accommodation, promotion of company's services, invitation to an event
Forum	A text in which people discuss a certain topic in form of comments. Common features: multiple authors, informal language, subjective (the writers express their opinions), written in 1st person.	discussion forum, reader/viewer responses, QA forum
Prose/Lyrical	A literary text that consists of paragraphs or verses. A literary text is deemed to have no other practical purpose than to give pleasure to the reader. Often the author pays attention to the aesthetic appearance of the text. It can be considered as art.	lyrics, poem, prayer, joke, novel, short story
Other	A text that which does not fall under any of other genre categories.

Performance

Comparison with other models at in-dataset and cross-dataset experiments

The X-GENRE model was compared with xlm-roberta-base classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison ).

At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.

Trained on	Micro F1	Macro F1
FTD	0.843	0.851
X-GENRE	0.797	0.794
CORE	0.778	0.627
GINCO	0.754	0.75

When applied on test splits of each of the datasets, the classifier performs well:

Trained on	Tested on	Micro F1	Macro F1
X-GENRE	CORE	0.837	0.859
X-GENRE	FTD	0.804	0.809
X-GENRE	X-GENRE	0.797	0.794
X-GENRE	X-GENRE-dev	0.784	0.784
X-GENRE	GINCO	0.749	0.758

The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):

EN-GINCO: a sample of the English enTenTen20 corpus
FinCORE : Finnish CORE corpus

Trained on	Tested on	Micro F1	Macro F1
X-GENRE	EN-GINCO	0.688	0.691
X-GENRE	FinCORE	0.674	0.581
GINCO	EN-GINCO	0.632	0.502
FTD	EN-GINCO	0.574	0.475
CORE	EN-GINCO	0.485	0.422

At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.

Citation

If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained:

 @misc{Kuzman2022,
  author = {Kuzman, Taja},
  title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
}

and the following paper on which the original model is based:

@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

To cite the datasets that were used for fine-tuning:

CORE dataset:

@article{egbert2015developing,
  title={Developing a bottom-up, user-based method of web register classification},
  author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
  journal={Journal of the Association for Information Science and Technology},
  volume={66},
  number={9},
  pages={1817--1831},
  year={2015},
  publisher={Wiley Online Library}
}

GINCO dataset:

@InProceedings{kuzman-rupnik-ljubei:2022:LREC,
  author    = {Kuzman, Taja  and  Rupnik, Peter  and  Ljube{\v{s}}i{\'c}, Nikola},
  title     = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1584--1594},
  url       = {https://aclanthology.org/2022.lrec-1.170}
}

FTD dataset:

@article{sharoff2018functional,
  title={Functional text dimensions for the annotation of web corpora},
  author={Sharoff, Serge},
  journal={Corpora},
  volume={13},
  number={1},
  pages={65--95},
  year={2018},
  publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
}

The datasets are available at:

http://hdl.handle.net/11356/1467 (GINCO)

https://github.com/TurkuNLP/CORE-corpus (CORE)

https://github.com/ssharoff/genre-keras (FTD)

作者:

CLASSLA - CLARIN Knowledge Centre for South-Slavic Languages

数据集大小:

2.09 GB