数据集:
srwac
语言:
sr计算机处理:
monolingual大小:
100M<n<1B语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
cc-by-sa-3.0The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Serbian vs. Croatian).
[More Information Needed]
Dataset is monolingual in Serbian language.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Dataset is under the CC-BY-SA 3.0 license.
@misc{11356/1063, title = {Serbian web corpus {srWaC} 1.1}, author = {Ljube{\v s}i{\'c}, Nikola and Klubi{\v c}ka, Filip}, url = {http://hdl.handle.net/11356/1063}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, year = {2016} }
Thanks to @IvanZidov for adding this dataset.