数据集:
europarl_bilingual
任务:
翻译计算机处理:
translation大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original许可:
license:unknownA parallel corpus extracted from the European Parliament web site by Philipp Koehn (University of Edinburgh). The main intended use is to aid statistical machine translation research.
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: https://opus.nlpl.eu/Europarl.php E.g.
dataset = load_dataset("europarl_bilingual", lang1="fi", lang2="fr")
Tasks: Machine Translation, Cross Lingual Word Embeddings (CWLE) Alignment
Every pair of the following languages is available:
Here is an example from the en-fr pair:
{ 'translation': { 'en': 'Resumption of the session', 'fr': 'Reprise de la session' } }
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
The data set comes with the same license as the original sources. Please, check the information about the source that is given on http://opus.nlpl.eu/Europarl-v8.php
@InProceedings{TIEDEMANN12.463, author = {J�rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
Thanks to @lucadiliello for adding this dataset.