数据集:
neuclir/neuclir1
许可:
odc-by源数据集:
extended|c4批注创建人:
no-annotation语言创建人:
found大小:
1M<n<10M计算机处理:
multilingual子任务:
document-retrieval任务:
文本检索This is the dataset created for TREC 2022 NeuCLIR Track. The collection designed to be similar to HC4 and a large portion of documents from HC4 are ported to this collection. The documents are Web pages from Common Crawl in Chinese, Persian, and Russian.
Split | Documents |
---|---|
fas (Persian) | 2.2M |
rus (Russian) | 4.6M |
zho (Chinese) | 3.2M |
Using ? Datasets:
from datasets import load_dataset dataset = load_dataset('neuclir/neuclir1') dataset['fas'] # Persian documents dataset['rus'] # Russian documents dataset['zho'] # Chinese documents