数据集:
HeNLP/HeDC4
A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning.
If you use HeDC4 in your research, please cite HeRo: RoBERTa and Longformer Hebrew Language Models .
@article{shalumov2023hero, title={HeRo: RoBERTa and Longformer Hebrew Language Models}, author={Vitaly Shalumov and Harel Haskey}, year={2023}, journal={arXiv:2304.11077}, }