数据集:

hrwac

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

1B<n<10B

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

cc-by-sa-3.0

数据集介绍文件清单

中文

Dataset Card for HrWac

Dataset Summary

The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Croatian vs. Serbian).

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Dataset is monolingual in Croatian language.

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

sentence: sentences as strings

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Dataset is under the CC-BY-SA 3.0 license.

Citation Information

 @misc{11356/1064,
 title = {Croatian web corpus {hrWaC} 2.1},
 author = {Ljube{\v s}i{\'c}, Nikola and Klubi{\v c}ka, Filip},
 url = {http://hdl.handle.net/11356/1064},
 note = {Slovenian language resource repository {CLARIN}.{SI}},
 copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
 year = {2016} }

Contributions

Thanks to @IvanZidov for adding this dataset.

作者:

佚名

数据集大小:

12.34 KB