数据集:

Finnish-NLP/mc4_fi_cleaned

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

size_categories:unknown

源数据集:

extended|mc4

数据集介绍文件清单

中文

Dataset Card for mC4 Finnish Cleaned

Dataset Summary

mC4 Finnish cleaned is cleaned version of the original mC4 Finnish split.

Supported Tasks and Leaderboards

mC4 Finnish is mainly intended to pretrain Finnish language models and word representations.

Languages

Finnish

Dataset Structure

Data Instances

[Needs More Information]

Data Fields

The data have several fields:

url: url of the source as a string
text: text content as a string
timestamp: timestamp as a string
perplexity_kenlm_full: perplexity of the text calculated by KenLM model

Data Splits

Train Validation

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

[Needs More Information]

Citation Information

[Needs More Information]

作者:

Finnish-NLP

数据集大小:

60.49 GB