数据集:

TUKE-DeutscheTelekom/skquad

中文

Dataset Card for [Dataset Name]

Dataset Summary

SK-QuAD is the first QA dataset for the Slovak language. It is manually annotated, so it has no distortion caused by machine translation. The dataset is thematically diverse – it does not overlap with SQuAD – it brings new knowledge. It passed the second round of annotation – each question and the answer were seen by at least two annotators.

Supported Tasks and Leaderboards

  • Question answering
  • Document retrieval

Languages

  • Slovak

Dataset Structure

squad_v2
  • Size of downloaded dataset files: 44.34 MB
  • Size of the generated dataset: 122.57 MB
  • Total amount of disk used: 166.91 MB
  • An example of 'validation' looks as follows.
This example was too long and was cropped:
{
    "answers": {
        "answer_start": [94, 87, 94, 94],
        "text": ["10th and 11th centuries", "in the 10th and 11th centuries", "10th and 11th centuries", "10th and 11th centuries"]
    },
    "context": "\"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave thei...",
    "id": "56ddde6b9a695914005b9629",
    "question": "When were the Normans in Normandy?",
    "title": "Normans"
}

Data Fields

The data fields are the same among all splits.

squad_v2
  • id : a string feature.
  • title : a string feature.
  • context : a string feature.
  • question : a string feature.
  • answers : a dictionary feature containing:
    • text : a string feature.
    • answer_start : a int32 feature.

Data Splits

Train Dev Translated
Documents 8,377 940 442
Paragraphs 22,062 2,568 18,931
Questions 81,582 9,583 120,239
Answers 65,839 7,822 79,978
Unanswerable 15,877 1,784 40,261

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

  • Deutsche Telekom Systems Solutions Slovakia
  • Technical Univesity of Košice

Licensing Information

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Citation Information

[More Information Needed]

Contributions

Thanks to @github-username for adding this dataset.