数据集:
joelito/legalnero
任务:
标记分类语言:
ro计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
other源数据集:
original许可:
cc-by-nc-nd-4.0LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).
The dataset supports the task of named entity recognition.
Since legal documents for LegalNERo are extracted from the larger MARCELL-RO corpus , the language in the dataset is Romanian as it used in national legislation ranging from 1881 to 2021.
The file format is jsonl and three data splits are present (train, validation and test). Named Entity annotations are non-overlapping.
Rows only containing one word (mostly words such as \t\t\t , \n or ----- ) have been filtered out.
The files contain the following data fields
The final tagset (in IOB notation) is the following: ['O', 'B-TIME', 'I-TIME', 'B-LEGAL', 'I-LEGAL', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-PER', 'I-PER']
Splits created by Joel Niklaus.
split | number of documents | number of sentences |
---|---|---|
train | 296 (80%) | 7552 |
validation | 37 (10%) | 966 |
test | 37 (10%) | 907 |
The dataset provides gold annotations for organizations, locations, persons, time and legal resources mentioned in Romanian legal documents.
The LegalNERo corpus consists of 370 documents from the larger MARCELL-RO corpus . In the following we give a short description of the crawling process for the MARCELL-RO corpus.
The MARCELL-RO corpus "contains 163,274 files, which represent the body of national legislation ranging from 1881 to 2021. This corpus includes mainly: governmental decisions, ministerial orders, decisions, decrees and laws. All the texts were obtained via crawling from the public Romanian legislative portal . We have not distinguished between in force and "out of force" laws because it is difficult to do this automatically and there is no external resource to use to distinguish between them. The texts were extracted from the original HTML format and converted into TXT files. Each file has multiple levels of annotation: firstly the texts were tokenized, lemmatized and morphologically annotated using the Tokenizing, Tagging and Lemmatizing (TTL) text processing platform developed at RACAI, then dependency parsed with NLP-Cube, named entities were identified using a NER tool developed at RACAI, nominal phrases were identified also with TTL, while IATE terms and EuroVoc descriptors were identified using an internal tool. All processing tools were integrated into an end-to-end pipeline available within the RELATE platform and as a dockerized version. The files were annotated with the latest version of the pipeline completed within Activity 4 of the MARCELL project." Link
Who are the source language producers?The source language producers are presumably politicians and lawyers.
“Annotation of the LegalNERo corpus was performed by 5 human annotators, supervised by two senior researchers at the Institute for Artificial Intelligence "Mihai Drăgănescu" of the Romanian Academy (RACAI). For annotation purposes we used the BRAT tool4 […]. Inside the legal reference class, we considered sub-entities of type organization and time . This allows for using the LegalNERo corpus in two scenarios: using all the 5 entity classes or using only the remaining general-purpose classes. The LegalNERo corpus contains a total of 370 documents from the larger MARCELL-RO corpus. These documents were split amongst the 5 annotators, with certain documents being annotated by multiple annotators. Each annotator manually annotated 100 documents. The annotators were unaware of the overlap, which allowed us to compute an inter-annotator agreement. We used the Cohen’s Kappa measure and obtained a value of 0.89, which we consider to be a good result.” (Pais et al., 2021)
Who are the annotators?"[...] 5 human annotators, supervised by two senior researchers at the Institute for Artificial Intelligence "Mihai Drăgănescu" of the Romanian Academy (RACAI)."
[More Information Needed]
[More Information Needed]
[More Information Needed]
Note that the information given in this dataset card refer to the dataset version as provided by Joel Niklaus and Veton Matoshi. The dataset at hand is intended to be part of a bigger benchmark dataset. Creating a benchmark dataset consisting of several other datasets from different sources requires postprocessing. Therefore, the structure of the dataset at hand, including the folder structure, may differ considerably from the original dataset. In addition to that, differences with regard to dataset statistics as give in the respective papers can be expected. The reader is advised to have a look at the conversion script convert_to_hf_dataset.py in order to retrace the steps for converting the original dataset into the present jsonl-format. For further information on the original dataset structure, we refer to the bibliographical references and the original Github repositories and/or web pages provided in this dataset card.
The names of the original dataset curators and creators can be found in references given below, in the section Citation Information . Additional changes were made by Joel Niklaus ( Email ; Github ) and Veton Matoshi ( Email ; Github ).
Creative Commons Attribution Non Commercial No Derivatives 4.0 International
@dataset{pais_vasile_2021_4922385, author = {Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Ianov, Alexandru and Ghiță, Corvin and Coneschi, Vlad Silviu and Onuț, Andrei}, title = {{Romanian Named Entity Recognition in the Legal domain (LegalNERo)}}, month = may, year = 2021, publisher = {Zenodo}, doi = {10.5281/zenodo.4922385}, url = {https://doi.org/10.5281/zenodo.4922385} }
@inproceedings{pais-etal-2021-named, author = {Pais, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru}, booktitle = {Proceedings of the Natural Legal Language Processing Workshop 2021}, doi = {10.18653/v1/2021.nllp-1.2}, month = {nov}, pages = {9--18}, publisher = {Association for Computational Linguistics}, title = {{Named Entity Recognition in the {R}omanian Legal Domain}}, url = {https://aclanthology.org/2021.nllp-1.2}, year = {2021} }
Thanks to @JoelNiklaus and @kapllan for adding this dataset.