Dataset Card for CSC

中文拼写纠错数据集

Repository: https://github.com/shibing624/pycorrector

Dataset Description

Chinese Spelling Correction (CSC) is a task to detect and correct misspelled characters in Chinese texts.

CSC is challenging since many Chinese characters are visually or phonologically similar but with quite different semantic meanings.

中文拼写纠错数据集，共27万条，是通过原始SIGHAN13、14、15年数据集和Wang271k数据集合并整理后得到，json格式，带错误字符位置信息。

Original Dataset Summary

test.json 和 dev.json 为 SIGHAN数据集，包括SIGHAN13 14 15，来自官方csc.html ，文件大小：339kb，4千条。
train.json 为 Wang271k数据集，包括 Wang271k ，来自 Automatic-Corpus-Generation dimmywang提供，文件大小：93MB，27万条。

如果只想用SIGHAN数据集，可以这样取数据：

from datasets import load_dataset
dev_ds = load_dataset('shibing624/CSC', split='validation')
print(dev_ds)
print(dev_ds[0])
test_ds = load_dataset('shibing624/CSC', split='test')
print(test_ds)
print(test_ds[0])

Supported Tasks and Leaderboards

中文拼写纠错任务

The dataset designed for csc task training pretrained language models.

Languages

The data in CSC are in Chinese.

Dataset Structure

Data Instances

An example of "train" looks as follows:

{
    "id": "B2-4029-3",
    "original_text": "晚间会听到嗓音，白天的时候大家都不会太在意，但是在睡觉的时候这嗓音成为大家的恶梦。",
    "wrong_ids": [
        5,
        31
    ],
    "correct_text": "晚间会听到噪音，白天的时候大家都不会太在意，但是在睡觉的时候这噪音成为大家的恶梦。"
}

Data Fields

字段解释：

id：唯一标识符，无意义
original_text: 原始错误文本
wrong_ids：错误字的位置，从0开始
correct_text: 纠正后的文本

Data Splits

train	dev	test
CSC	251835条	27981条	1100条

Licensing Information

The dataset is available under the Apache 2.0.

Citation Information

@misc{Xu_Pycorrector_Text_error,
  title={Pycorrector: Text error correction tool},
  author={Xu Ming},
  year={2021},
  howpublished={\url{https://github.com/shibing624/pycorrector}},
}

Contributions

shibing624 整理并上传

作者:

shibing624

数据集大小:

105.33 MB