数据集:

joelito/mapa

任务:

标记分类

子任务:

named-entity-recognition

语言:

multilingual

计算机处理:

multilingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

other

源数据集:

original

其他:

named-entity-recognition-and-classification

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for Multilingual European Datasets for Sensitive Entity Detection in the Legal Domain

Dataset Summary

The dataset consists of 12 documents (9 for Spanish due to parsing errors) taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. The documents have been annotated for named entities following the guidelines of the MAPA project which foresees two annotation level, a general and a more fine-grained one. The annotated corpus can be used for named entity recognition/classification.

Supported Tasks and Leaderboards

The dataset supports the task of Named Entity Recognition and Classification (NERC).

Languages

The following languages are supported: bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pt, ro, sk, sv

Dataset Structure

Data Instances

The file format is jsonl and three data splits are present (train, validation and test). Named Entity annotations are non-overlapping.

Data Fields

For the annotation the documents have been split into sentences. The annotations has been done on the token level. The files contain the following data fields

language : language of the sentence
type : The document type of the sentence. Currently, only EUR-LEX is supported.
file_name : The document file name the sentence belongs to.
sentence_number : The number of the sentence inside its document.
tokens : The list of tokens in the sentence.
coarse_grained : The coarse-grained annotations for each token
fine_grained : The fine-grained annotations for each token

As previously stated, the annotation has been conducted on a global and a more fine-grained level.

The tagset used for the global and the fine-grained named entities is the following:

Address
- Building
- City
- Country
- Place
- Postcode
- Street
- Territory
Amount
- Unit
- Value
Date
- Year
- Standard Abbreviation
- Month
- Day of the Week
- Day
- Calender Event
Person
- Age
- Email
- Ethnic Category
- Family Name
- Financial
- Given Name – Female
- Given Name – Male
- Health Insurance Number
- ID Document Number
- Initial Name
- Marital Status
- Medical Record Number
- Nationality
- Profession
- Role
- Social Security Number
- Title
- Url
Organisation
Time
Vehicle
- Build Year
- Colour
- License Plate Number
- Model
- Type

The final coarse grained tagset (in IOB notation) is the following:

['O', 'B-ORGANISATION', 'I-ORGANISATION', 'B-ADDRESS', 'I-ADDRESS', 'B-DATE', 'I-DATE', 'B-PERSON', 'I-PERSON', 'B-AMOUNT', 'I-AMOUNT', 'B-TIME', 'I-TIME']

The final fine grained tagset (in IOB notation) is the following:

[ 'O', 'B-BUILDING', 'I-BUILDING', 'B-CITY', 'I-CITY', 'B-COUNTRY', 'I-COUNTRY', 'B-PLACE', 'I-PLACE', 'B-TERRITORY', 'I-TERRITORY', 'I-UNIT', 'B-UNIT', 'B-VALUE', 'I-VALUE', 'B-YEAR', 'I-YEAR', 'B-STANDARD ABBREVIATION', 'I-STANDARD ABBREVIATION', 'B-MONTH', 'I-MONTH', 'B-DAY', 'I-DAY', 'B-AGE', 'I-AGE', 'B-ETHNIC CATEGORY', 'I-ETHNIC CATEGORY', 'B-FAMILY NAME', 'I-FAMILY NAME', 'B-INITIAL NAME', 'I-INITIAL NAME', 'B-MARITAL STATUS', 'I-MARITAL STATUS', 'B-PROFESSION', 'I-PROFESSION', 'B-ROLE', 'I-ROLE', 'B-NATIONALITY', 'I-NATIONALITY', 'B-TITLE', 'I-TITLE', 'B-URL', 'I-URL', 'B-TYPE', 'I-TYPE', ]

Data Splits

Splits created by Joel Niklaus.

language	# train files	# validation files	# test files	# train sentences	# validation sentences	# test sentences
bg	9	1	2	1411	166	560
cs	9	1	2	1464	176	563
da	9	1	2	1455	164	550
de	9	1	2	1457	166	558
el	9	1	2	1529	174	584
en	9	1	2	893	98	408
es	7	1	1	806	248	155
et	9	1	2	1391	163	516
fi	9	1	2	1398	187	531
fr	9	1	2	1297	97	490
ga	9	1	2	1383	165	515
hu	9	1	2	1390	171	525
it	9	1	2	1411	162	550
lt	9	1	2	1413	173	548
lv	9	1	2	1383	167	553
mt	9	1	2	937	93	442
nl	9	1	2	1391	164	530
pt	9	1	2	1086	105	390
ro	9	1	2	1480	175	557
sk	9	1	2	1395	165	526
sv	9	1	2	1453	175	539

Dataset Creation

Curation Rationale

„[…] to our knowledge, there exist no open resources annotated for NERC [Named Entity Recognition and Classificatio] in Spanish in the legal domain. With the present contribution, we intend to fill this gap. With the release of the created resources for fine-tuning and evaluation of sensitive entities detection in the legal domain, we expect to encourage the development of domain-adapted anonymisation tools for Spanish in this field“ (de Gibert Bonet et al., 2022)

Source Data

Initial Data Collection and Normalization

The dataset consists of documents taken from EUR-Lex corpus which is publicly available. No further information on the data collection process are given in de Gibert Bonet et al. (2022).

Who are the source language producers?

The source language producers are presumably lawyers.

Annotations

Annotation process

"The annotation scheme consists of a complex two level hierarchy adapted to the legal domain, it follows the scheme described in (Gianola et al., 2020) […] Level 1 entities refer to general categories (PERSON, DATE, TIME, ADDRESS...) and level 2 entities refer to more fine-grained subcategories (given name, personal name, day, year, month...). Eur-Lex, CPP and DE have been annotated following this annotation scheme […] The manual annotation was performed using INCePTION (Klie et al., 2018) by a sole annotator following the guidelines provided by the MAPA consortium." (de Gibert Bonet et al., 2022)

Who are the annotators?

Only one annotator conducted the annotation. More information are not provdided in de Gibert Bonet et al. (2022).

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Note that the dataset at hand presents only a small portion of a bigger corpus as described in de Gibert Bonet et al. (2022). At the time of writing only the annotated documents from the EUR-Lex corpus were available.

Note that the information given in this dataset card refer to the dataset version as provided by Joel Niklaus and Veton Matoshi. The dataset at hand is intended to be part of a bigger benchmark dataset. Creating a benchmark dataset consisting of several other datasets from different sources requires postprocessing. Therefore, the structure of the dataset at hand, including the folder structure, may differ considerably from the original dataset. In addition to that, differences with regard to dataset statistics as give in the respective papers can be expected. The reader is advised to have a look at the conversion script convert_to_hf_dataset.py in order to retrace the steps for converting the original dataset into the present jsonl-format. For further information on the original dataset structure, we refer to the bibliographical references and the original Github repositories and/or web pages provided in this dataset card.

Additional Information

Dataset Curators

The names of the original dataset curators and creators can be found in references given below, in the section Citation Information . Additional changes were made by Joel Niklaus ( Email ; Github ) and Veton Matoshi ( Email ; Github ).

Licensing Information

Attribution 4.0 International (CC BY 4.0)

Citation Information

@article{DeGibertBonet2022,
author = {{de Gibert Bonet}, Ona and {Garc{\'{i}}a Pablos}, Aitor and Cuadros, Montse and Melero, Maite},
journal = {Proceedings of the Language Resources and Evaluation Conference},
number = {June},
pages = {3751--3760},
title = {{Spanish Datasets for Sensitive Entity Detection in the Legal Domain}},
url = {https://aclanthology.org/2022.lrec-1.400},
year = {2022}
}

Contributions

Thanks to @JoelNiklaus and @kapllan for adding this dataset.

作者:

joelito

数据集大小:

31.22 MB