数据集:

trec

任务:

文本分类

子任务:

multi-class-classification

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for "trec"

Dataset Summary

The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.

The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.

Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.

Supported Tasks and Leaderboards

More Information Needed

Languages

The language in this dataset is English ( en ).

Dataset Structure

Data Instances

Size of downloaded dataset files: 0.36 MB
Size of the generated dataset: 0.41 MB
Total amount of disk used: 0.78 MB

An example of 'train' looks as follows.

{
  'text': 'How did serfdom develop in and then leave Russia ?',
  'coarse_label': 2,
  'fine_label': 26
}

Data Fields

The data fields are the same among all splits.

text ( str ): Text of the question.
coarse_label ( ClassLabel ): Coarse class label. Possible values are:
- 'ABBR' (0): Abbreviation.
- 'ENTY' (1): Entity.
- 'DESC' (2): Description and abstract concept.
- 'HUM' (3): Human being.
- 'LOC' (4): Location.
- 'NUM' (5): Numeric value.
fine_label ( ClassLabel ): Fine class label. Possible values are:
- ABBREVIATION:
  - 'ABBR:abb' (0): Abbreviation.
  - 'ABBR:exp' (1): Expression abbreviated.
- ENTITY:
  - 'ENTY:animal' (2): Animal.
  - 'ENTY:body' (3): Organ of body.
  - 'ENTY:color' (4): Color.
  - 'ENTY:cremat' (5): Invention, book and other creative piece.
  - 'ENTY:currency' (6): Currency name.
  - 'ENTY:dismed' (7): Disease and medicine.
  - 'ENTY:event' (8): Event.
  - 'ENTY:food' (9): Food.
  - 'ENTY:instru' (10): Musical instrument.
  - 'ENTY:lang' (11): Language.
  - 'ENTY:letter' (12): Letter like a-z.
  - 'ENTY:other' (13): Other entity.
  - 'ENTY:plant' (14): Plant.
  - 'ENTY:product' (15): Product.
  - 'ENTY:religion' (16): Religion.
  - 'ENTY:sport' (17): Sport.
  - 'ENTY:substance' (18): Element and substance.
  - 'ENTY:symbol' (19): Symbols and sign.
  - 'ENTY:techmeth' (20): Techniques and method.
  - 'ENTY:termeq' (21): Equivalent term.
  - 'ENTY:veh' (22): Vehicle.
  - 'ENTY:word' (23): Word with a special property.
- DESCRIPTION:
  - 'DESC:def' (24): Definition of something.
  - 'DESC:desc' (25): Description of something.
  - 'DESC:manner' (26): Manner of an action.
  - 'DESC:reason' (27): Reason.
- HUMAN:
  - 'HUM:gr' (28): Group or organization of persons
  - 'HUM:ind' (29): Individual.
  - 'HUM:title' (30): Title of a person.
  - 'HUM:desc' (31): Description of a person.
- LOCATION:
  - 'LOC:city' (32): City.
  - 'LOC:country' (33): Country.
  - 'LOC:mount' (34): Mountain.
  - 'LOC:other' (35): Other location.
  - 'LOC:state' (36): State.
- NUMERIC:
  - 'NUM:code' (37): Postcode or other code.
  - 'NUM:count' (38): Number of something.
  - 'NUM:date' (39): Date.
  - 'NUM:dist' (40): Distance, linear measure.
  - 'NUM:money' (41): Price.
  - 'NUM:ord' (42): Order, rank.
  - 'NUM:other' (43): Other number.
  - 'NUM:period' (44): Lasting time of something
  - 'NUM:perc' (45): Percent, fraction.
  - 'NUM:speed' (46): Speed.
  - 'NUM:temp' (47): Temperature.
  - 'NUM:volsize' (48): Size, area and volume.
  - 'NUM:weight' (49): Weight.

Data Splits

name	train	test
default	5452	500

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@inproceedings{li-roth-2002-learning,
    title = "Learning Question Classifiers",
    author = "Li, Xin  and
      Roth, Dan",
    booktitle = "{COLING} 2002: The 19th International Conference on Computational Linguistics",
    year = "2002",
    url = "https://www.aclweb.org/anthology/C02-1150",
}
@inproceedings{hovy-etal-2001-toward,
    title = "Toward Semantics-Based Answer Pinpointing",
    author = "Hovy, Eduard  and
      Gerber, Laurie  and
      Hermjakob, Ulf  and
      Lin, Chin-Yew  and
      Ravichandran, Deepak",
    booktitle = "Proceedings of the First International Conference on Human Language Technology Research",
    year = "2001",
    url = "https://www.aclweb.org/anthology/H01-1069",
}

Contributions

Thanks to @lhoestq , @thomwolf for adding this dataset.

作者:

佚名

数据集大小:

19.76 KB

Dataset Card for "trec"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions