数据集:
opus_ubuntu
These are translations of the Ubuntu software package messages, donated by the Ubuntu community.
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/Ubuntu.php E.g.
dataset = load_dataset("opus_ubuntu", lang1="it", lang2="pl")
[More Information Needed]
[More Information Needed]
Example instance:
{ 'id': '0', 'translation': { 'it': 'Comprende Gmail, Google Docs, Google+, YouTube e Picasa', 'pl': 'Zawiera Gmail, Google Docs, Google+, YouTube oraz Picasa' } }
Each instance has two fields:
Each subset simply consists in a train set. We provide the number of examples for certain language pairs:
train | |
---|---|
as-bs | 8583 |
az-cs | 293 |
bg-de | 184 |
br-es_PR | 125 |
bn-ga | 7324 |
br-hi | 15551 |
br-la | 527 |
bs-szl | 646 |
br-uz | 1416 |
br-yi | 2799 |
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
BSD "Revised" license (see ( https://help.launchpad.net/Legal#Translations_copyright)[https://help.launchpad.net/Legal#Translations_copyright] )
@InProceedings{TIEDEMANN12.463, author = {J{\"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
Thanks to @rkc007 for adding this dataset.