数据集:
Emanuel/UD_Portuguese-Bosque
语言:
ptThis dataset has been automatically processed by AutoNLP for project pos-tag-bosque.
The BCP-47 code for the dataset's language is pt.
A sample from this dataset looks as follows:
[ { "tags": [ 5, 7, 0 ], "tokens": [ "Um", "revivalismo", "refrescante" ] }, { "tags": [ 5, 11, 11, 11, 3, 5, 7, 1, 5, 7, 0, 12 ], "tokens": [ "O", "7", "e", "Meio", "\u00e9", "um", "ex-libris", "de", "a", "noite", "algarvia", "." ] } ]
The dataset has the following fields (also called "features"):
{ "tags": "Sequence(feature=ClassLabel(num_classes=17, names=['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X'], names_file=None, id=None), length=-1, id=None)", "tokens": "Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)" }
This dataset is split into a train and validation split. The split sizes are as follow:
Split name | Num samples |
---|---|
train | 8328 |
valid | 476 |