数据集:
bswac
语言:
bs计算机处理:
monolingual大小:
100M<n<1B语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
cc-by-sa-3.0The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Bosnian vs. Croatian vs. Serbian).
[More Information Needed]
Dataset is monolingual in Bosnian language.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Dataset is under the CC-BY-SA 3.0 license.
@misc{11356/1062, title = {Bosnian web corpus {bsWaC} 1.1}, author = {Ljube{\v s}i{\'c}, Nikola and Klubi{\v c}ka, Filip}, url = {http://hdl.handle.net/11356/1062}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, year = {2016} }
Thanks to @IvanZidov for adding this dataset.