数据集:

neuclir/neuclir1

许可:

odc-by

源数据集:

extended|c4

批注创建人:

no-annotation

语言创建人:

found

大小:

1M<n<10M

计算机处理:

multilingual
中文

Dataset Card for NeuCLIR1

Dataset Summary

This is the dataset created for TREC 2022 NeuCLIR Track. The collection designed to be similar to HC4 and a large portion of documents from HC4 are ported to this collection. The documents are Web pages from Common Crawl in Chinese, Persian, and Russian.

Languages

  • Chinese
  • Persian
  • Russian

Dataset Structure

Data Instances

Split Documents
fas (Persian) 2.2M
rus (Russian) 4.6M
zho (Chinese) 3.2M

Data Fields

  • id : unique identifier for this document
  • cc_file : source file from connon crawl
  • time : extracted date/time from article
  • title : title extracted from article
  • text : extracted article body
  • url : source URL

Dataset Usage

Using ? Datasets:

from datasets import load_dataset

dataset = load_dataset('neuclir/neuclir1')
dataset['fas'] # Persian documents
dataset['rus'] # Russian documents
dataset['zho'] # Chinese documents