数据集:
Bingsu/KcBERT_Pre-Training_Corpus
语言:
ko计算机处理:
monolingual大小:
10M<n<100M语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original许可:
cc-by-sa-4.0Github KcBERT Repo: https://github.com/Beomi/KcBERT KcBERT is Korean Comments BERT pretrained on this Corpus set. (You can use it via Huggingface's Transformers library!)
This Kaggle Dataset contains CLEANED dataset preprocessed with the code below.
import re import emoji from soynlp.normalizer import repeat_normalize emojis = ''.join(emoji.UNICODE_EMOJI.keys()) pattern = re.compile(f'[^ .,?!/@$%~%·∼()\x00-\x7Fㄱ-힣{emojis}]+') url_pattern = re.compile( r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)') def clean(x): x = pattern.sub(' ', x) x = url_pattern.sub('', x) x = x.strip() x = repeat_normalize(x, num_repeats=2) return x
>>> from datasets import load_dataset >>> dataset = load_dataset("Bingsu/KcBERT_Pre-Training_Corpus") >>> dataset DatasetDict({ train: Dataset({ features: ['text'], num_rows: 86246285 }) })
download: 7.90 GiB generated: 11.86 GiB total: 19.76 GiB
※ You can download this dataset from kaggle , and it's 5 GiB. (12.48 GiB when uncompressed)
train | |
---|---|
# of texts | 86246285 |