数据集:
cawac
语言:
计算机处理:
monolingual大小:
10M<n<100M语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
caWaC is a 780-million-token web corpus of Catalan built from the .cat top-level-domain in late 2013.
[More Information Needed]
Dataset is monolingual in Catalan language.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Dataset is under the CC-BY-SA 3.0 license.
@inproceedings{DBLP:conf/lrec/LjubesicT14,
author = {Nikola Ljubesic and
Antonio Toral},
editor = {Nicoletta Calzolari and
Khalid Choukri and
Thierry Declerck and
Hrafn Loftsson and
Bente Maegaard and
Joseph Mariani and
Asunci{\'{o}}n Moreno and
Jan Odijk and
Stelios Piperidis},
title = {caWaC - {A} web corpus of Catalan and its application to language
modeling and machine translation},
booktitle = {Proceedings of the Ninth International Conference on Language Resources
and Evaluation, {LREC} 2014, Reykjavik, Iceland, May 26-31, 2014},
pages = {1728--1732},
publisher = {European Language Resources Association {(ELRA)}},
year = {2014},
url = {http://www.lrec-conf.org/proceedings/lrec2014/summaries/841.html},
timestamp = {Mon, 19 Aug 2019 15:23:35 +0200},
biburl = {https://dblp.org/rec/conf/lrec/LjubesicT14.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Thanks to @albertvillanova for adding this dataset.