数据集:

universal_morphologies

任务:

标记分类

文本分类

子任务:

multi-class-classification multi-label-classification

语言:

language:ady

language:ang

计算机处理:

monolingual

大小:

10K<n<100K 1K<n<10K n<1K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

其他:

morphology

许可:

cc-by-sa-3.0

数据集介绍文件清单

中文

Dataset Card for [Dataset Name]

Dataset Summary

The Universal Morphology (UniMorph) project is a collaborative effort to improve how NLP handles complex morphology in the world’s languages. The goal of UniMorph is to annotate morphological data in a universal schema that allows an inflected word from any language to be defined by its lexical meaning, typically carried by the lemma, and by a rendering of its inflectional form in terms of a bundle of morphological features from our schema. The specification of the schema is described in Sylak-Glassman (2016).

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The current version of the UniMorph dataset covers 110 languages.

Dataset Structure

Data Instances

Each data instance comprises of a lemma and a set of possible realizations with morphological and meaning annotations. For example:

{'forms': {'Aktionsart': [[], [], [], [], []],
  'Animacy': [[], [], [], [], []],
  ...
  'Finiteness': [[], [], [], [1], []],
  ...
  'Number': [[], [], [0], [], []],
  'Other': [[], [], [], [], []],
  'Part_Of_Speech': [[7], [10], [7], [7], [10]],
  ...
  'Tense': [[1], [1], [0], [], [0]],
  ...
  'word': ['ablated', 'ablated', 'ablates', 'ablate', 'ablating']},
 'lemma': 'ablate'}

Data Fields

Each instance in the dataset has the following fields:

lemma : the common lemma for all all_forms
forms : all annotated forms for this lemma, with:
- word : the full word form
- [ category ]: a categorical variable denoting one or several tags in a category (several to represent composite tags, originally denoted with A+B ). The full list of categories and possible tags for each can be found here

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @yjernite for adding this dataset.

作者:

佚名

数据集大小:

974.55 KB