数据集:
opus_gnome
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/GNOME.php E.g.
dataset = load_dataset("opus_gnome", lang1="it", lang2="pl")
[More Information Needed]
[More Information Needed]
{ 'id': '0', 'translation': { 'ar': 'إعداد سياسة القفل', 'bal': 'تنظیم کتن سیاست کبل' } }
Each instance has two fields:
Each subset simply consists in a train set. We provide the number of examples for certain language pairs:
train | |
---|---|
ar-bal | 60 |
bg-csb | 10 |
ca-en_GB | 7982 |
cs-eo | 73 |
de-ha | 216 |
cs-tk | 18686 |
da-vi | 149 |
en_GB-my | 28232 |
el-sk | 150 |
de-tt | 2169 |
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@InProceedings{TIEDEMANN12.463, author = {J{"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
Thanks to @rkc007 for adding this dataset.