数据集:

large_spanish_corpus

任务:

task_categories:other

语言:

计算机处理:

monolingual

大小:

100K<n<1M 100M<n<1B 10K<n<100K

语言创建人:

expert-generated

批注创建人:

no-annotation

源数据集:

original

许可:

mit

数据集介绍文件清单

中文

Dataset Card for The Large Spanish Corpus

Dataset Summary

The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, all_wiki only includes examples from Spanish Wikipedia:

from datasets import load_dataset
all_wiki = load_dataset('large_spanish_corpus', name='all_wiki')

By default, the config is set to "combined" which loads all the corpora.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Spanish

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

The following is taken from the corpus' source repsository:

Spanish Wikis: Which include Wikipedia, Wikinews, Wikiquotes and more. These were first processed with wikiextractor ( https://github.com/josecannete/wikiextractorforBERT ) using the wikis dump of 20/04/2019.
ParaCrawl: Spanish portion of ParaCrawl ( http://opus.nlpl.eu/ParaCrawl.php )
EUBookshop: Spanish portion of EUBookshop ( http://opus.nlpl.eu/EUbookshop.php )
MultiUN: Spanish portion of MultiUN ( http://opus.nlpl.eu/MultiUN.php )
OpenSubtitles: Spanish portion of OpenSubtitles2018 ( http://opus.nlpl.eu/OpenSubtitles-v2018.php )
DGC: Spanish portion of DGT ( http://opus.nlpl.eu/DGT.php )
DOGC: Spanish portion of DOGC ( http://opus.nlpl.eu/DOGC.php )
ECB: Spanish portion of ECB ( http://opus.nlpl.eu/ECB.php )
EMEA: Spanish portion of EMEA ( http://opus.nlpl.eu/EMEA.php )
Europarl: Spanish portion of Europarl ( http://opus.nlpl.eu/Europarl.php )
GlobalVoices: Spanish portion of GlobalVoices ( http://opus.nlpl.eu/GlobalVoices.php )
JRC: Spanish portion of JRC ( http://opus.nlpl.eu/JRC-Acquis.php )
News-Commentary11: Spanish portion of NCv11 ( http://opus.nlpl.eu/News-Commentary-v11.php )
TED: Spanish portion of TED ( http://opus.nlpl.eu/TED2013.php )
UN: Spanish portion of UN ( http://opus.nlpl.eu/UN.php )

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @lewtun for adding this dataset.

作者:

佚名

数据集大小:

5.4 GB