数据集:
emea
任务:
翻译计算机处理:
multilingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original许可:
license:unknownTo load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/EMEA.php E.g.
dataset = load_dataset("emea", lang1="en", lang2="nl")
[More Information Needed]
[More Information Needed]
Here is an example of the en-nl configuration:
{'id': '4', 'translation': {'en': 'EPAR summary for the public', 'nl': 'EPAR-samenvatting voor het publiek'}}
The data fields are:
Sizes of some language pairs:
name | train |
---|---|
bg-el | 1044065 |
cs-et | 1053164 |
de-mt | 1000532 |
fr-sk | 1062753 |
es-lt | 1051370 |
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@InProceedings{TIEDEMANN12.463, author = {J{\"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
Thanks to @abhishekkrthakur for adding this dataset.