数据集:
hrwac
语言:
hr计算机处理:
monolingual大小:
1B<n<10B语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
cc-by-sa-3.0The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Croatian vs. Serbian).
[More Information Needed]
Dataset is monolingual in Croatian language.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Dataset is under the CC-BY-SA 3.0 license.
@misc{11356/1064, title = {Croatian web corpus {hrWaC} 2.1}, author = {Ljube{\v s}i{\'c}, Nikola and Klubi{\v c}ka, Filip}, url = {http://hdl.handle.net/11356/1064}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, year = {2016} }
Thanks to @IvanZidov for adding this dataset.