数据集:

cawac

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

10M<n<100M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

cc-by-sa-3.0

数据集介绍文件清单

中文

Dataset Card for caWaC

Dataset Summary

caWaC is a 780-million-token web corpus of Catalan built from the .cat top-level-domain in late 2013.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Dataset is monolingual in Catalan language.

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Dataset is under the CC-BY-SA 3.0 license.

Citation Information

@inproceedings{DBLP:conf/lrec/LjubesicT14,
  author    = {Nikola Ljubesic and
               Antonio Toral},
  editor    = {Nicoletta Calzolari and
               Khalid Choukri and
               Thierry Declerck and
               Hrafn Loftsson and
               Bente Maegaard and
               Joseph Mariani and
               Asunci{\'{o}}n Moreno and
               Jan Odijk and
               Stelios Piperidis},
  title     = {caWaC - {A} web corpus of Catalan and its application to language
               modeling and machine translation},
  booktitle = {Proceedings of the Ninth International Conference on Language Resources
               and Evaluation, {LREC} 2014, Reykjavik, Iceland, May 26-31, 2014},
  pages     = {1728--1732},
  publisher = {European Language Resources Association {(ELRA)}},
  year      = {2014},
  url       = {http://www.lrec-conf.org/proceedings/lrec2014/summaries/841.html},
  timestamp = {Mon, 19 Aug 2019 15:23:35 +0200},
  biburl    = {https://dblp.org/rec/conf/lrec/LjubesicT14.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contributions

Thanks to @albertvillanova for adding this dataset.

作者:

佚名

数据集大小:

10.55 KB