Parallel corpora from Web Crawls collected in the ParaCrawl project.
Tha dataset contains:
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g.
dataset = load_dataset("opus_paracrawl", lang1="en", lang2="so")
You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/ParaCrawl.php
[More Information Needed]
The languages in the dataset are:
{ 'id': '0', 'translation': { "el": "Συνεχίστε ευθεία 300 μέτρα μέχρι να καταλήξουμε σε μια σωστή οδός (ul. Gagarina)? Περπατήστε περίπου 300 μέτρα μέχρι να φτάσετε το πρώτο ορθή οδός (ul Khotsa Namsaraeva)?", "en": "Go straight 300 meters until you come to a proper street (ul. Gagarina); Walk approximately 300 meters until you reach the first proper street (ul Khotsa Namsaraeva);" } }
The dataset contains a single train split.
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{banon-etal-2020-paracrawl, title = "{P}ara{C}rawl: Web-Scale Acquisition of Parallel Corpora", author = "Ba{\~n}{\'o}n, Marta and Chen, Pinzhen and Haddow, Barry and Heafield, Kenneth and Hoang, Hieu and Espl{\`a}-Gomis, Miquel and Forcada, Mikel L. and Kamran, Amir and Kirefu, Faheem and Koehn, Philipp and Ortiz Rojas, Sergio and Pla Sempere, Leopoldo and Ram{\'\i}rez-S{\'a}nchez, Gema and Sarr{\'\i}as, Elsa and Strelec, Marek and Thompson, Brian and Waites, William and Wiggins, Dion and Zaragoza, Jaume", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.acl-main.417", doi = "10.18653/v1/2020.acl-main.417", pages = "4555--4567", }
@InProceedings{TIEDEMANN12.463, author = {Jörg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
Thanks to @rkc007 for adding this dataset.