数据集:
poleval2019_mt
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted solutions compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. One of the tasks in PolEval-2019 was Machine Translation (Task-4).
The task is to train as good as possible machine translation system, using any technology,with limited textual resources. The competition will be done for 2 language pairs, more popular English-Polish (into Polish direction) and pair that can be called low resourced Russian-Polish (in both directions).
Here, Polish-English is also made available to allow for training in both directions. However, the test data is ONLY available for English-Polish
Supports Machine Translation between Russian to Polish and English to Polish (and vice versa).
As the training data set, a set of bi-lingual corpora aligned at the sentence level has been prepared. The corpora are saved in UTF-8 encoding as plain text, one language per file.
One example of the translation is as below:
{ 'translation': {'ru': 'не содержала в себе моделей. Модели это сравнительно новое явление. ', 'pl': 'nie miała w sobie modeli. Modele to względnie nowa dziedzina. Tak więc, jeśli '} }
The dataset is divided into two splits. All the headlines are scraped from news websites on the internet.
train | validation | test | |
---|---|---|---|
ru-pl | 20001 | 3001 | 2969 |
pl-ru | 20001 | 3001 | 2969 |
en-pl | 129255 | 1000 | 9845 |
This data was curated as a task for the PolEval-2019. The task is to train as good as possible machine translation system, using any technology, with limited textual resources. The competition will be done for 2 language pairs, more popular English-Polish (into Polish direction) and pair that can be called low resourced Russian-Polish (in both directions).
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures.
PolEval 2019-related papers were presented at AI & NLP Workshop Day (Warsaw, May 31, 2019). The links for the top performing models on various tasks (including the Task-4: Machine Translation) is present in this link
[More Information Needed]
Who are the source language producers?The organization details of PolEval is present in this link
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@proceedings{ogr:kob:19:poleval, editor = {Maciej Ogrodniczuk and Łukasz Kobyliński}, title = {{Proceedings of the PolEval 2019 Workshop}}, year = {2019}, address = {Warsaw, Poland}, publisher = {Institute of Computer Science, Polish Academy of Sciences}, url = {http://2019.poleval.pl/files/poleval2019.pdf}, isbn = "978-83-63159-28-3"} }
Thanks to @vrindaprabhu for adding this dataset.