数据集:

poleval2019_mt

任务:

翻译

语言:

计算机处理:

translation

大小:

10K<n<100K

语言创建人:

expert-generated found

批注创建人:

no-annotation

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for poleval2019_mt

Dataset Summary

PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted solutions compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. One of the tasks in PolEval-2019 was Machine Translation (Task-4).

The task is to train as good as possible machine translation system, using any technology,with limited textual resources. The competition will be done for 2 language pairs, more popular English-Polish (into Polish direction) and pair that can be called low resourced Russian-Polish (in both directions).

Here, Polish-English is also made available to allow for training in both directions. However, the test data is ONLY available for English-Polish

Supported Tasks and Leaderboards

Supports Machine Translation between Russian to Polish and English to Polish (and vice versa).

Languages

Polish (pl)
Russian (ru)
English (en)

Dataset Structure

Data Instances

As the training data set, a set of bi-lingual corpora aligned at the sentence level has been prepared. The corpora are saved in UTF-8 encoding as plain text, one language per file.

Data Fields

One example of the translation is as below:

{
  'translation': {'ru': 'не содержала в себе моделей. Модели это сравнительно новое явление. ', 
                  'pl': 'nie miała w sobie modeli. Modele to względnie nowa dziedzina. Tak więc, jeśli '}
}

Data Splits

The dataset is divided into two splits. All the headlines are scraped from news websites on the internet.

train	validation	test
ru-pl	20001	3001	2969
pl-ru	20001	3001	2969
en-pl	129255	1000	9845

Dataset Creation

Curation Rationale

This data was curated as a task for the PolEval-2019. The task is to train as good as possible machine translation system, using any technology, with limited textual resources. The competition will be done for 2 language pairs, more popular English-Polish (into Polish direction) and pair that can be called low resourced Russian-Polish (in both directions).

PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures.

PolEval 2019-related papers were presented at AI & NLP Workshop Day (Warsaw, May 31, 2019). The links for the top performing models on various tasks (including the Task-4: Machine Translation) is present in this link

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

The organization details of PolEval is present in this link

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@proceedings{ogr:kob:19:poleval,
  editor    = {Maciej Ogrodniczuk and Łukasz Kobyliński},
  title     = {{Proceedings of the PolEval 2019 Workshop}},
  year      = {2019},
  address   = {Warsaw, Poland},
  publisher = {Institute of Computer Science, Polish Academy of Sciences},
  url       = {http://2019.poleval.pl/files/poleval2019.pdf},
  isbn      = "978-83-63159-28-3"}
}

Contributions

Thanks to @vrindaprabhu for adding this dataset.

作者:

佚名

数据集大小:

26.59 KB