数据集:

large_spanish_corpus

语言:

es

计算机处理:

monolingual

语言创建人:

expert-generated

批注创建人:

no-annotation

源数据集:

original

许可:

mit
中文

Dataset Card for The Large Spanish Corpus

Dataset Summary

The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, all_wiki only includes examples from Spanish Wikipedia:

from datasets import load_dataset
all_wiki = load_dataset('large_spanish_corpus', name='all_wiki')

By default, the config is set to "combined" which loads all the corpora.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Spanish

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

The following is taken from the corpus' source repsository:

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @lewtun for adding this dataset.