数据集:
pierreguillou/lener_br_finetuning_language_model
The LeNER-Br language modeling dataset is a collection of legal texts in Portuguese from the LeNER-Br dataset ( official site ).
The legal texts were downloaded from this link (93.6MB) and processed to create a DatasetDict with train and validation dataset (20%).
The LeNER-Br language modeling dataset allows the finetuning of language models as BERTimbau base and large .
Portuguese from Brazil.
NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro (29/12/2021)
DatasetDict({
validation: Dataset({
features: ['text'],
num_rows: 3813
})
train: Dataset({
features: ['text'],
num_rows: 15252
})
})
!pip install datasets
from datasets import load_dataset
dataset = load_dataset("pierreguillou/lener_br_finetuning_language_model")