数据集:

voidful/NMSQA

任务:

问答

自动语音识别

子任务:

abstractive-qa

语言:

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

expert-generated machine-generated crowdsourced

批注创建人:

crowdsourced machine-generated

源数据集:

original

预印本库:

arxiv:2203.04911

其他:

speech-recognition

数据集介绍文件清单

中文

Dataset Card for NMSQA(Natural Multi-speaker Spoken Question Answering)

Download audio data: https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz Unzip audio data: tar -xf nmsqa_audio.tar.gz

Dataset Summary

The Natural Multi-speaker Spoken Question Answering (NMSQA) dataset is designed for the task of textless spoken question answering. It is based on the SQuAD dataset and contains spoken questions and passages. The dataset includes the original text, transcriptions, and audio files of the spoken content. This dataset is created to evaluate the performance of models on textless spoken question answering tasks.

Supported Tasks and Leaderboards

The primary task supported by this dataset is textless spoken question answering, where the goal is to answer questions based on spoken passages without relying on textual information. The dataset can also be used for automatic speech recognition tasks.

Languages

The dataset is in English.

Dataset Structure

Data Instances

Each instance in the dataset contains the following fields:

id: Unique identifier for the instance
title: The title of the passage
context: The passage text
question: The question text
- answer_start: The start index of the answer in the text
audio_full_answer_end: The end position of the audio answer in seconds
audio_full_answer_start: The start position of the audio answer in seconds
audio_full_neg_answer_end: The end position of the audio answer in seconds for an incorrect answer with the same words
audio_full_neg_answer_start: The start position of the audio answer in seconds for an incorrect answer with the same words
audio_segment_answer_end: The end position of the audio answer in seconds for the segment
audio_segment_answer_start: The start position of the audio answer in seconds for the segment
text: The answer text
content_segment_audio_path: The audio path for the content segment
content_full_audio_path: The complete audio path for the content
content_audio_sampling_rate: The audio sampling rate
content_audio_speaker: The audio speaker
content_segment_text: The segment text of the content
content_segment_normalized_text: The normalized text for generating audio
question_audio_path: The audio path for the question
question_audio_sampling_rate: The audio sampling rate
question_audio_speaker: The audio speaker
question_normalized_text: The normalized text for generating audio

Data Fields

The dataset includes the following data fields:

id
title
context
question
answers
content_segment_audio_path
content_full_audio_path
content_audio_sampling_rate
content_audio_speaker
content_segment_text
content_segment_normalized_text
question_audio_path
question_audio_sampling_rate
question_audio_speaker
question_normalized_text

Data Splits

The dataset is split into train, dev, and test sets.

Dataset Creation

Curation Rationale

The NMSQA dataset is created to address the challenge of textless spoken question answering, where the model must answer questions based on spoken passages without relying on textual information.

Source Data

The NMSQA dataset is based on the SQuAD dataset, with spoken questions and passages created from the original text data.

Initial Data Collection and Normalization

The initial data collection involved converting the original SQuAD dataset's text-based questions and passages into spoken audio files. The text was first normalized, and then audio files were generated using text-to-speech methods.

Who are the source language producers?

The source language producers are the creators of the SQuAD dataset and the researchers who generated the spoken audio files for the NMSQA dataset.

Annotations

Annotation process

The annotations for the NMSQA dataset are derived from the original SQuAD dataset. Additional annotations, such as audio start and end positions for correct and incorrect answers, as well as audio file paths and speaker information, are added by the dataset creators.

Who are the annotators?

The annotators for the NMSQA dataset are the creators of the SQuAD dataset and the researchers who generated the spoken audio files and additional annotations for the NMSQA dataset.

Personal and Sensitive Information

The dataset does not contain any personal or sensitive information.

Considerations for Using the Data

Social Impact of Dataset

The NMSQA dataset contributes to the development and evaluation of models for textless spoken question answering tasks, which can lead to advancements in natural language processing and automatic speech recognition. Applications of these technologies can improve accessibility and convenience in various domains, such as virtual assistants, customer service, and voice-controlled devices.

Discussion of Biases

The dataset inherits potential biases from the original SQuAD dataset, which may include biases in the selection of passages, questions, and answers. Additionally, biases may be introduced in the text-to-speech process and the choice of speakers used to generate the spoken audio files.

Other Known Limitations

As the dataset is based on the SQuAD dataset, it shares the same limitations, including the fact that it is limited to the English language and mainly focuses on factual questions. Furthermore, the dataset may not cover a wide range of accents, dialects, or speaking styles.

Additional Information

Dataset Curators

The NMSQA dataset is curated by Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-Wen Yang, Hsuan-Jui Chen, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, and Lin-Shan Lee.

Licensing Information

The licensing information for the dataset is not explicitly mentioned.

Citation Information

@article{lin2022dual,
    title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning},
    author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan},
    journal={arXiv preprint arXiv:2203.04911},
    year={2022}
}

Contributions

Thanks to @voidful for adding this dataset.

作者:

voidful

数据集大小:

25.37 GB