数据集:
opus_tedtalks
任务:
翻译计算机处理:
multilingual大小:
10K<n<100K语言创建人:
found批注创建人:
found源数据集:
original许可:
license:unknownThis is a Croatian-English parallel corpus of transcribed and translated TED talks, originally extracted from https://wit3.fbk.eu . The corpus is compiled by Željko Agić and is taken from http://lt.ffzg.hr/zagic provided under the CC-BY-NC-SA license. This corpus is sentence aligned for both language pairs. The documents were collected and aligned using the Hunalign algorithm.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Dataset provided for research purposes only. Please check dataset license for additional information.
[More Information Needed]
[CC-BY-NC-SA license] http://creativecommons.org/licenses/by-sa/3.0/
@InProceedings{TIEDEMANN12.463, author = {J{"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
Thanks to @rkc007 for adding this dataset.