mt5-base-multilingual-summarization-multilarge-cs

This model is a fine-tuned checkpoint of google/mt5-base on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.

Task

The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>', 'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>', 'zh': '<extra_id_7>'

#Usage

## Configuration of summarization pipeline
#
def summ_config():
    cfg = OrderedDict([
        
        ## summarization model - checkpoint
        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
        
        ## language of summarization task
        #   language : string : cs, en, de, fr, es, tr, ru, zh
        ("language", "en"), 
        
        ## generation method parameters in dictionary
        #
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.95),
            ("repetition_penalty", 1.23),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 128),
            ("min_length", 10),
        ])),
        #texts to summarize values = (list of strings, string, dataset)
        ("texts",
            [
               "english text1 to summarize",
               "english text2 to summarize",
            ]
        ),
        #OPTIONAL: Target summaries values = (list of strings, string, None)
        ('golds',
         [
               "target english text1",
               "target english text2",
         ]),
        #('golds', None),
    ])
    return cfg

cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)

Dataset

Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.

Train set:        3 464 563 docs
Validation set:     121 260 docs

Stats	fragment	avg document length	avg summary length	Documents
dataset	compression	density	coverage	nsent	nwords	nsent	nwords	count
cnc	7.388	0.303	0.088	16.121	316.912	3.272	46.805	750K
sumeczech	11.769	0.471	0.115	27.857	415.711	2.765	38.644	1M
cnndm	13.688	2.983	0.538	32.783	676.026	4.134	54.036	300K
xsum	18.378	0.479	0.194	18.607	369.134	1.000	21.127	225K
mlsum/tu	8.666	5.418	0.461	14.271	214.496	1.793	25.675	274K
mlsum/de	24.741	8.235	0.469	32.544	539.653	1.951	23.077	243K
mlsum/fr	24.388	2.688	0.424	24.533	612.080	1.320	26.93	425K
mlsum/es	36.185	3.705	0.510	31.914	746.927	1.142	21.671	291K
mlsum/ru	78.909	1.194	0.246	62.141	948.079	1.012	11.976	27K
cnewsum	20.183	0.000	0.000	16.834	438.271	1.109	21.926	304K

Tokenization

Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).

Training

Trained based on cross-entropy loss.

Time: 3 days 20 hours
Epochs: 1080K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.462 - 1.797
tloss: 17.322 - 1.578

ROUGE results per individual dataset test set:

ROUGE	ROUGE-1	ROUGE-2	ROUGE-L
Precision	Recall	Fscore	Precision	Recall	Fscore	Precision	Recall	Fscore
cnc	30.62	19.83	23.44	9.94	6.52	7.67	22.92	14.92	17.6
sumeczech	27.57	17.6	20.85	8.12	5.23	6.17	20.84	13.38	15.81
cnndm	43.83	37.73	39.34	20.81	17.82	18.6	31.8	27.42	28.55
xsum	41.63	30.54	34.56	16.13	11.76	13.33	33.65	24.74	27.97
mlsum-tu-	54.4	43.29	46.2	38.78	31.31	33.23	48.18	38.44	41
mlsum-de	47.94	44.14	45.11	36.42	35.24	35.42	44.43	41.42	42.16
mlsum-fr	35.26	25.96	28.98	16.72	12.35	13.75	28.06	20.75	23.12
mlsum-es	33.37	24.84	27.52	13.29	10.05	11.05	27.63	20.69	22.87
mlsum-ru	0.79	0.66	0.66	0.26	0.2	0.22	0.79	0.66	0.65
cnewsum	24.49	24.38	23.23	6.48	6.7	6.24	24.18	24.04	22.91

USAGE

soon

作者:

AI Center FEE CTU

数据集大小:

2.18 GB