数据集:
shibing624/CSC
中文拼写纠错数据集
Chinese Spelling Correction (CSC) is a task to detect and correct misspelled characters in Chinese texts.
CSC is challenging since many Chinese characters are visually or phonologically similar but with quite different semantic meanings.
中文拼写纠错数据集,共27万条,是通过原始SIGHAN13、14、15年数据集和Wang271k数据集合并整理后得到,json格式,带错误字符位置信息。
如果只想用SIGHAN数据集,可以这样取数据:
from datasets import load_dataset dev_ds = load_dataset('shibing624/CSC', split='validation') print(dev_ds) print(dev_ds[0]) test_ds = load_dataset('shibing624/CSC', split='test') print(test_ds) print(test_ds[0])
中文拼写纠错任务
The dataset designed for csc task training pretrained language models.
The data in CSC are in Chinese.
An example of "train" looks as follows:
{ "id": "B2-4029-3", "original_text": "晚间会听到嗓音,白天的时候大家都不会太在意,但是在睡觉的时候这嗓音成为大家的恶梦。", "wrong_ids": [ 5, 31 ], "correct_text": "晚间会听到噪音,白天的时候大家都不会太在意,但是在睡觉的时候这噪音成为大家的恶梦。" }
字段解释:
train | dev | test | |
---|---|---|---|
CSC | 251835条 | 27981条 | 1100条 |
The dataset is available under the Apache 2.0.
@misc{Xu_Pycorrector_Text_error, title={Pycorrector: Text error correction tool}, author={Xu Ming}, year={2021}, howpublished={\url{https://github.com/shibing624/pycorrector}}, }
shibing624 整理并上传