数据集:
yhavinga/ccmatrix
This corpus has been extracted from web crawls using the margin-based bitext mining techniques described at https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix .
[More Information Needed]
Configs are generated for all language pairs in both directions. You can find the valid pairs in Homepage section of Dataset Description: https://opus.nlpl.eu/CCMatrix.php E.g.
from datasets import load_dataset dataset = load_dataset("yhavinga/ccmatrix", "en-nl", streaming=True)
This will open the en-nl dataset in streaming mode. Without streaming, download and prepare will take tens of minutes. You can inspect elements with:
print(next(iter(dataset['train']))) {'id': 0, 'score': 1.2499677, 'translation': {'en': 'They come from all parts of Egypt, just like they will at the day of His coming.', 'nl': 'Zij kwamen uit alle delen van Egypte, evenals zij op de dag van Zijn komst zullen doen.'}}
For example:
{ "id": 1, "score": 1.2498379, "translation": { "nl": "En we moeten elke waarheid vals noemen die niet minstens door een lach vergezeld ging.”", "en": "And we should call every truth false which was not accompanied by at least one laugh.”" } }
Each example contains an integer id starting with 0, a score, and a translation dictionary with the language 1 and language 2 texts.
Only a train split is provided.
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
IMPORTANT: Please cite reference [2][3] if you use this data.
This HuggingFace CCMatrix dataset is a wrapper around the service and files prepared and hosted by OPUS: