数据集:

un_pc

任务:

翻译

计算机处理:

multilingual

大小:

10M<n<100M

语言创建人:

found

批注创建人:

found

源数据集:

original
中文

Dataset Card for [Dataset Name]

Dataset Summary

This parallel corpus consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. 6 languages, 15 bitexts

Supported Tasks and Leaderboards

The underlying task is machine translation.

Languages

[More Information Needed]

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{ziemski-etal-2016-united,
    title = "The {U}nited {N}ations Parallel Corpus v1.0",
    author = "Ziemski, Micha{\\l}  and
      Junczys-Dowmunt, Marcin  and
      Pouliquen, Bruno",
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://www.aclweb.org/anthology/L16-1561",
    pages = "3530--3534",
    abstract = "This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.",
}

Contributions

Thanks to @patil-suraj for adding this dataset.