数据集:
dennlinger/eur-lex-sum
The EUR-Lex-Sum dataset is a multilingual resource intended for text summarization in the legal domain. It is based on human-written summaries of legal acts issued by the European Union. It distinguishes itself by introducing a smaller set of high-quality human-written samples, each of which have much longer references (and summaries!) than comparable datasets. Additionally, the underlying legal acts provide a challenging domain-specific application to legal texts, which are so far underrepresented in non-English languages. For each legal act, the sample can be available in up to 24 languages (the officially recognized languages in the European Union); the validation and test samples consist entirely of samples available in all languages, and are aligned across all languages at the paragraph level.
The dataset supports all official languages of the European Union . At the time of collection, those were 24 languages: Bulgarian, Croationa, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish.
Both the reference texts, as well as the summaries, are translated from an English original text (this was confirmed by private correspondence with the Publications Office of the European Union). Translations and summaries are written by external (professional) parties, contracted by the EU.
Depending on availability of document summaries in particular languages, we have between 391 (Irish) and 1505 (French) samples available. Over 80% of samples are available in at least 20 languages.
Data instances contain fairly minimal information. Aside from a unique identifier, corresponding to the Celex ID generated by the EU, two further fields specify the original long-form legal act and its associated summary.
{ "celex_id": "3A32021R0847", "reference": "REGULATION (EU) 2021/847 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\n [...]" "summary": "Supporting EU cooperation in the field of taxation: Fiscalis (2021-2027)\n\n [...]" }
We provide pre-split training, validation and test splits. To obtain the validation and test splits, we randomly assigned all samples that are available across all 24 languages into two equally large portions. In total, 375 instances are available in 24 languages, which means we obtain a validation split of 187 samples and 188 test instances. All remaining instances are assigned to the language-specific training portions, which differ in their exact size.
We particularly ensured that no duplicates exist across the three splits. For this purpose, we ensured that no exactly matching reference or summary exists for any sample. Further information on the length distributions (for the English subset) can be found in the paper.
The dataset was curated to provide a resource for under-explored aspects of automatic text summarization research. In particular, we want to encourage the exploration of abstractive summarization systems that are not limited by the usual 512 token context window, which usually works well for (short) news articles, but fails to generate long-form summaries, or does not even work with longer source texts in the first place. Also, existing resources primarily focus on a single (and very specialized) domain, namely news article summarization. We wanted to provide a further resource for legal summarization, for which many languages do not even have any existing datasets. We further noticed that no previous system had utilized the human-written samples from the EUR-Lex platform , which provide an excellent source for training instances suitable for summarization research. We later found out about a resource created in parallel based on EUR-Lex documents, which provides a monolingual (English) corpus constructed in similar fashion. However, we provide a more thorough filtering, and extend the process to the remaining 23 EU languages.
The data was crawled from the aforementioned EUR-Lex platform. In particular, we only use samples which have HTML versions of the texts available, which ensure the alignment across languages, given that translations have to retain the original paragraph structure, which is encoded in HTML elements. We further filter out samples that do not have associated document summaries available.
One particular design choice has to be expanded upon: For some summaries, several source documents are considered as an input by the EU. However, since we construct a single-document summarization corpus, we decided to use the longest reference document only . This means we explicitly drop the other reference texts from the corpus. One alternative would have been to concatenated all relevant source texts; however, this generally leads to degradation of positional biases in the text, which can be an important learned feature for summarization systems. Our paper details the effect of this decision in terms of n-gram novelty, which we find is affected by the processing choice.
Who are the source language producers?The language producers are external professionals contracted by the European Union offices. As previously noted, all non-English texts are generated from the respective English document (all summaries are direct translations the English summary, all reference texts are translated from the English reference text). No further information on the demographic of annotators is provided.
The European Union publishes their annotation guidelines for summaries, which targets a length between 600-800 words. No information on the guidelines for translations is known.
Who are the annotators?The language producers are external professionals contracted by the European Union offices. No further information on the annotators is available.
The original text was not modified in any way by the authors of this dataset. Explicit mentions of personal names can occur in the dataset, however, we rely on the European Union that no further sensitive information is provided in these documents.
The dataset can be used to provide summarization systems in languages that are previously under-represented. For example, language samples in Irish and Maltese (among others) enable the development and evaluation for these languages. A successful cross-lingual system would further enable the creation of automated legal summaries for legal acts, possibly enabling foreigners in European countries to automatically translate similar country-specific legal acts.
Given the limited amount of training data, this dataset is also suitable as a test bed for low-resource approaches, especially in comparsion to strong unsupervised (extractive) summarization systems. We also note that the summaries are explicitly provided as "not legally binding" by the EU. The implication of left-out details (a necessary evil of summaries) implies the existence of differences between the (legally binding) original legal act.
Risks associated with this dataset also largely stem from the potential application of systems trained on it. Decisions in the legal domain require careful analysis of the full context, and should not be made based on system-generated summaries at this point in time. Known biases of summarization, specifically factual hallucinations, should act as further deterrents.
Given the availability bias, some of the languages in the dataset are more represented than others. We attempt to mitigate influence on the evaluation by providing validation and test sets of the same size across all languages. Given that we require the availability of HTML documents, we see a particular temporal bias in our dataset, which features more documents from the years of 1990 onwards, simply due to the increase in EU-related activities, but also the native use of the internet as a data storage. This could imply a particular focus on more recent topics (e.g., Brexit, renewable eneriges, etc. come to mind).
Finally, due to the source of these documents being the EU, we expect a natural bias towards EU-centric (and therefore Western-centric) content; other nations and continents will be under-represented in the data.
As previously outlined, we are aware of some summaries relating to multiple (different) legal acts. For these samples, only one (the longest) text will be available in our dataset.
The web crawler was originally implemented by Ashish Chouhan. Post-filtering and sample correction was later performed by Dennis Aumiller. Both were PhD students employed at the Database Systems Research group of Heidelberg University, under the guidance of Prof. Dr. Michael Gertz.
Data from the EUR-Lex platform is available under the CC-BY SA 4.0 license. We redistribute the dataset under the same license.
For the pre-print version, please cite:
@article{aumiller-etal-2022-eur, author = {Aumiller, Dennis and Chouhan, Ashish and Gertz, Michael}, title = {{EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain}}, journal = {CoRR}, volume = {abs/2210.13448}, eprinttype = {arXiv}, eprint = {2210.13448}, url = {https://arxiv.org/abs/2210.13448} }