数据集:
opus_wikipedia
This is a corpus of parallel sentences extracted from Wikipedia by Krzysztof Wołk and Krzysztof Marasek.
Tha dataset contains 20 languages and 36 bitexts.
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g.
dataset = load_dataset("opus_wikipedia", lang1="it", lang2="pl")
You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/Wikipedia.php
[More Information Needed]
The languages in the dataset are:
{ 'id': '0', 'translation': { "ar": "* Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics.", "en": "*Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics." } }
The dataset contains a single train split.
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@article{WOLK2014126, title = {Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs}, journal = {Procedia Technology}, volume = {18}, pages = {126-132}, year = {2014}, note = {International workshop on Innovations in Information and Communication Science and Technology, IICST 2014, 3-5 September 2014, Warsaw, Poland}, issn = {2212-0173}, doi = {https://doi.org/10.1016/j.protcy.2014.11.024}, url = {https://www.sciencedirect.com/science/article/pii/S2212017314005453}, author = {Krzysztof Wołk and Krzysztof Marasek}, keywords = {Comparable corpora, machine translation, NLP}, }
@InProceedings{TIEDEMANN12.463, author = {J{\"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
Thanks to @rkc007 for adding this dataset.