数据集:
lener_br
许可:
license:unknown源数据集:
original批注创建人:
expert-generated语言创建人:
expert-generated大小:
10K<n<100K计算机处理:
monolingual语言:
pt任务:
标记分类LeNER-Br is a Portuguese language dataset for named entity recognition applied to legal documents. LeNER-Br consists entirely of manually annotated legislation and legal cases texts and contains tags for persons, locations, time entities, organizations, legislation and legal cases. To compose the dataset, 66 legal documents from several Brazilian Courts were collected. Courts of superior and state levels were considered, such as Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais and Tribunal de Contas da União. In addition, four legislation documents were collected, such as "Lei Maria da Penha", giving a total of 70 documents
[More Information Needed]
The language supported is Portuguese.
An example from the dataset looks as follows:
{ "id": "0", "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0], "tokens": [ "EMENTA", ":", "APELAÇÃO", "CÍVEL", "-", "AÇÃO", "DE", "INDENIZAÇÃO", "POR", "DANOS", "MORAIS", "-", "PRELIMINAR", "-", "ARGUIDA", "PELO", "MINISTÉRIO", "PÚBLICO", "EM", "GRAU", "RECURSAL"] }
The NER tags correspond to this list:
"O", "B-ORGANIZACAO", "I-ORGANIZACAO", "B-PESSOA", "I-PESSOA", "B-TEMPO", "I-TEMPO", "B-LOCAL", "I-LOCAL", "B-LEGISLACAO", "I-LEGISLACAO", "B-JURISPRUDENCIA", "I-JURISPRUDENCIA"
The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word.
The data is split into train, validation and test set. The split sizes are as follow:
Train | Val | Test |
---|---|---|
7828 | 1177 | 1390 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{luz_etal_propor2018, author = {Pedro H. {Luz de Araujo} and Te\'{o}filo E. {de Campos} and Renato R. R. {de Oliveira} and Matheus Stauffer and Samuel Couto and Paulo Bermejo}, title = {{LeNER-Br}: a Dataset for Named Entity Recognition in {Brazilian} Legal Text}, booktitle = {International Conference on the Computational Processing of Portuguese ({PROPOR})}, publisher = {Springer}, series = {Lecture Notes on Computer Science ({LNCS})}, pages = {313--323}, year = {2018}, month = {September 24-26}, address = {Canela, RS, Brazil}, doi = {10.1007/978-3-319-99722-3_32}, url = {https://cic.unb.br/~teodecampos/LeNER-Br/}, }
Thanks to @jonatasgrosman for adding this dataset.