数据集:

yhavinga/ccmatrix

中文

Dataset Card for CCMatrix v1

Dataset Summary

This corpus has been extracted from web crawls using the margin-based bitext mining techniques described at https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix .

  • 90 languages, 1,197 bitexts
  • total number of files: 90
  • total number of tokens: 112.14G
  • total number of sentence fragments: 7.37G

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Configs are generated for all language pairs in both directions. You can find the valid pairs in Homepage section of Dataset Description: https://opus.nlpl.eu/CCMatrix.php E.g.

from datasets import load_dataset
dataset = load_dataset("yhavinga/ccmatrix", "en-nl", streaming=True)

This will open the en-nl dataset in streaming mode. Without streaming, download and prepare will take tens of minutes. You can inspect elements with:

print(next(iter(dataset['train'])))
{'id': 0, 'score': 1.2499677, 'translation': {'en': 'They come from all parts of Egypt, just like they will at the day of His coming.', 'nl': 'Zij kwamen uit alle delen van Egypte, evenals zij op de dag van Zijn komst zullen doen.'}}

Dataset Structure

Data Instances

For example:

{
        "id": 1,
        "score": 1.2498379,
        "translation": {
            "nl": "En we moeten elke waarheid vals noemen die niet minstens door een lach vergezeld ging.”",
            "en": "And we should call every truth false which was not accompanied by at least one laugh.”"
        }
    }

Data Fields

Each example contains an integer id starting with 0, a score, and a translation dictionary with the language 1 and language 2 texts.

Data Splits

Only a train split is provided.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

[More Information Needed]

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

IMPORTANT: Please cite reference [2][3] if you use this data.

  • CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data by Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Jouli and Edouard Grave .
  • CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB by Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin .
  • Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin.
  • This HuggingFace CCMatrix dataset is a wrapper around the service and files prepared and hosted by OPUS:

    Contributions