数据集:
tatoeba
Tatoeba is a collection of sentences and translations.
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/Tatoeba.php E.g.
dataset = load_dataset("tatoeba", lang1="en", lang2="he")
The default date is v2021-07-22, but you can also change the date with
dataset = load_dataset("tatoeba", lang1="en", lang2="he", date="v2020-11-09")
[More Information Needed]
The languages in the dataset are:
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thanks to @abhishekkrthakur for adding this dataset.