数据集:

wili_2018

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1801.07779

许可:

odbl
中文

Dataset Card for wili_2018

Dataset Summary

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

235 Different Languages

Dataset Structure

Data Instances

{
    'label': 207,
    'sentence': 'Ti Turkia ket maysa a demokrata, sekular, unitario, batay-linteg a republika nga addaan ti taga-ugma a tinawtawid a kultura. Ti Turkia ket umadadu a naipatipon iti Laud babaen ti panagkameng kadagiti organisasion a kas ti Konsilo iti Europa, NATO, OECD, OSCE ken ti G-20 a dagiti kangrunaan nga ekonomia. Ti Turkia ket nangrugi a nakitulag ti napno a panagkameng iti Kappon ti Europa idi 2005, nga isu ket maysa idin a kumaduaan a kameng iti Europeano a Komunidad ti Ekonomia manipud idi 1963 ken nakadanon ti maysa a tulagan ti kappon ti aduana idi 1995. Ti Turkia ket nagtaraken iti asideg a kultural, politikal, ekonomiko ken industria a panakibiang iti Tengnga a Daya, dagiti Turko nga estado iti Tengnga nga Asia ken dagiti pagilian ti Aprika babaen ti panagkameng kadagiti organisasion a kas ti Turko a Konsilo, Nagsaupan nga Administrasion iti Turko nga Arte ken Kultura, Organisasion iti Islamiko a Panagtitinnulong ken ti Organisasion ti Ekonomiko a Panagtitinnulong.'
}

Data Fields

[Needs More Information]

Data Splits

175000 lines of text each for train and test data.

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

The dataset was initially created by Thomas Martin

Licensing Information

ODC Open Database License v1.0

Citation Information

@dataset{thoma_martin_2018_841984,
  author       = {Thoma, Martin},
  title        = {{WiLI-2018 - Wikipedia Language Identification database}},
  month        = jan,
  year         = 2018,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.841984},
  url          = {https://doi.org/10.5281/zenodo.841984}
}

Contributions

Thanks to @Shubhambindal2017 for adding this dataset.