数据集:
joelito/greek_legal_ner
任务:
标记分类语言:
el计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
other源数据集:
original许可:
cc-by-nc-sa-4.0This dataset contains an annotated corpus for named entity recognition in Greek legislations. It is the first of its kind for the Greek language in such an extended form and one of the few that examines legal text in a full spectrum entity recognition.
The dataset supports the task of named entity recognition.
The language in the dataset is Greek as it used in the Greek Government Gazette.
The file format is jsonl and three data splits are present (train, validation and test).
The files contain the following data fields
The final tagset (in IOB notation) is the following: ['O', 'B-ORG', 'I-ORG', 'B-GPE', 'I-GPE', 'B-LEG-REFS', 'I-LEG-REFS', 'B-PUBLIC-DOCS', 'I-PUBLIC-DOCS', 'B-PERSON', 'I-PERSON', 'B-FACILITY', 'I-FACILITY', 'B-LOCATION-UNK', 'I-LOCATION-UNK', 'B-LOCATION-NAT', 'I-LOCATION-NAT']
The dataset has three splits: train , validation and test .
Split across the documents:
split | number of documents |
---|---|
train | 23723 |
validation | 5478 |
test | 5084 |
Split across NER labels
NER label + split | number of instances |
---|---|
('FACILITY', 'test') | 142 |
('FACILITY', 'train') | 1224 |
('FACILITY', 'validation') | 60 |
('GPE', 'test') | 1083 |
('GPE', 'train') | 5400 |
('GPE', 'validation') | 1214 |
('LEG-REFS', 'test') | 1331 |
('LEG-REFS', 'train') | 5159 |
('LEG-REFS', 'validation') | 1382 |
('LOCATION-NAT', 'test') | 26 |
('LOCATION-NAT', 'train') | 145 |
('LOCATION-NAT', 'validation') | 2 |
('LOCATION-UNK', 'test') | 205 |
('LOCATION-UNK', 'train') | 1316 |
('LOCATION-UNK', 'validation') | 283 |
('ORG', 'test') | 1354 |
('ORG', 'train') | 5906 |
('ORG', 'validation') | 1506 |
('PERSON', 'test') | 491 |
('PERSON', 'train') | 1921 |
('PERSON', 'validation') | 475 |
('PUBLIC-DOCS', 'test') | 452 |
('PUBLIC-DOCS', 'train') | 2652 |
('PUBLIC-DOCS', 'validation') | 556 |
Creating a big dataset for Greek named entity recognition and entity linking.
[More Information Needed]
Who are the source language producers?Greek Government Gazette
[More Information Needed]
Who are the annotators?According to (Angelidis et al., 2018) the authors of the paper annotated the data: "Our group annotated all of the above documents for the 6 entity types that we examine."
[More Information Needed]
[More Information Needed]
[More Information Needed]
Note that the information given in this dataset card refer to the dataset version as provided by Joel Niklaus and Veton Matoshi. The dataset at hand is intended to be part of a bigger benchmark dataset. Creating a benchmark dataset consisting of several other datasets from different sources requires postprocessing. Therefore, the structure of the dataset at hand, including the folder structure, may differ considerably from the original dataset. In addition to that, differences with regard to dataset statistics as give in the respective papers can be expected. The reader is advised to have a look at the conversion script convert_to_hf_dataset.py in order to retrace the steps for converting the original dataset into the present jsonl-format. For further information on the original dataset structure, we refer to the bibliographical references and the original Github repositories and/or web pages provided in this dataset card.
The names of the original dataset curators and creators can be found in references given below, in the section Citation Information . Additional changes were made by Joel Niklaus ( Email ; Github ) and Veton Matoshi ( Email ; Github ).
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
@inproceedings{Angelidis2018NamedER, author = {Angelidis, Iosif and Chalkidis, Ilias and Koubarakis, Manolis}, booktitle = {JURIX}, keywords = {greek,legal nlp,named entity recognition}, title = {{Named Entity Recognition, Linking and Generation for Greek Legislation}}, year = {2018} }
Thanks to @JoelNiklaus and @kapllan for adding this dataset.