数据集:
large_spanish_corpus
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, all_wiki only includes examples from Spanish Wikipedia:
from datasets import load_dataset all_wiki = load_dataset('large_spanish_corpus', name='all_wiki')
By default, the config is set to "combined" which loads all the corpora.
[More Information Needed]
Spanish
[More Information Needed]
[More Information Needed]
The following is taken from the corpus' source repsository:
Spanish Wikis: Which include Wikipedia, Wikinews, Wikiquotes and more. These were first processed with wikiextractor ( https://github.com/josecannete/wikiextractorforBERT ) using the wikis dump of 20/04/2019.
ParaCrawl: Spanish portion of ParaCrawl ( http://opus.nlpl.eu/ParaCrawl.php )
EUBookshop: Spanish portion of EUBookshop ( http://opus.nlpl.eu/EUbookshop.php )
MultiUN: Spanish portion of MultiUN ( http://opus.nlpl.eu/MultiUN.php )
OpenSubtitles: Spanish portion of OpenSubtitles2018 ( http://opus.nlpl.eu/OpenSubtitles-v2018.php )
DGC: Spanish portion of DGT ( http://opus.nlpl.eu/DGT.php )
DOGC: Spanish portion of DOGC ( http://opus.nlpl.eu/DOGC.php )
ECB: Spanish portion of ECB ( http://opus.nlpl.eu/ECB.php )
EMEA: Spanish portion of EMEA ( http://opus.nlpl.eu/EMEA.php )
Europarl: Spanish portion of Europarl ( http://opus.nlpl.eu/Europarl.php )
GlobalVoices: Spanish portion of GlobalVoices ( http://opus.nlpl.eu/GlobalVoices.php )
JRC: Spanish portion of JRC ( http://opus.nlpl.eu/JRC-Acquis.php )
News-Commentary11: Spanish portion of NCv11 ( http://opus.nlpl.eu/News-Commentary-v11.php )
TED: Spanish portion of TED ( http://opus.nlpl.eu/TED2013.php )
UN: Spanish portion of UN ( http://opus.nlpl.eu/UN.php )
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thanks to @lewtun for adding this dataset.