数据集:

HeNLP/HeDC4

语言:

he

大小:

1B<n<10B

预印本库:

arxiv:2304.11077
中文

Dataset Summary

A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning.

Citing

If you use HeDC4 in your research, please cite HeRo: RoBERTa and Longformer Hebrew Language Models .

@article{shalumov2023hero,
      title={HeRo: RoBERTa and Longformer Hebrew Language Models}, 
      author={Vitaly Shalumov and Harel Haskey},
      year={2023},
      journal={arXiv:2304.11077},
}