中文

Dataset Card for NMSQA(Natural Multi-speaker Spoken Question Answering)

Download audio data: https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz Unzip audio data: tar -xf nmsqa_audio.tar.gz

Dataset Summary

The Natural Multi-speaker Spoken Question Answering (NMSQA) dataset is designed for the task of textless spoken question answering. It is based on the SQuAD dataset and contains spoken questions and passages. The dataset includes the original text, transcriptions, and audio files of the spoken content. This dataset is created to evaluate the performance of models on textless spoken question answering tasks.

Supported Tasks and Leaderboards

The primary task supported by this dataset is textless spoken question answering, where the goal is to answer questions based on spoken passages without relying on textual information. The dataset can also be used for automatic speech recognition tasks.

Languages

The dataset is in English.

Dataset Structure

Data Instances

Each instance in the dataset contains the following fields:

  • id: Unique identifier for the instance
  • title: The title of the passage
  • context: The passage text
  • question: The question text
    • answer_start: The start index of the answer in the text
  • audio_full_answer_end: The end position of the audio answer in seconds
  • audio_full_answer_start: The start position of the audio answer in seconds
  • audio_full_neg_answer_end: The end position of the audio answer in seconds for an incorrect answer with the same words
  • audio_full_neg_answer_start: The start position of the audio answer in seconds for an incorrect answer with the same words
  • audio_segment_answer_end: The end position of the audio answer in seconds for the segment
  • audio_segment_answer_start: The start position of the audio answer in seconds for the segment
  • text: The answer text
  • content_segment_audio_path: The audio path for the content segment
  • content_full_audio_path: The complete audio path for the content
  • content_audio_sampling_rate: The audio sampling rate
  • content_audio_speaker: The audio speaker
  • content_segment_text: The segment text of the content
  • content_segment_normalized_text: The normalized text for generating audio
  • question_audio_path: The audio path for the question
  • question_audio_sampling_rate: The audio sampling rate
  • question_audio_speaker: The audio speaker
  • question_normalized_text: The normalized text for generating audio

Data Fields

The dataset includes the following data fields:

  • id
  • title
  • context
  • question
  • answers
  • content_segment_audio_path
  • content_full_audio_path
  • content_audio_sampling_rate
  • content_audio_speaker
  • content_segment_text
  • content_segment_normalized_text
  • question_audio_path
  • question_audio_sampling_rate
  • question_audio_speaker
  • question_normalized_text

Data Splits

The dataset is split into train, dev, and test sets.

Dataset Creation

Curation Rationale

The NMSQA dataset is created to address the challenge of textless spoken question answering, where the model must answer questions based on spoken passages without relying on textual information.

Source Data

The NMSQA dataset is based on the SQuAD dataset, with spoken questions and passages created from the original text data.

Initial Data Collection and Normalization

The initial data collection involved converting the original SQuAD dataset's text-based questions and passages into spoken audio files. The text was first normalized, and then audio files were generated using text-to-speech methods.

Who are the source language producers?

The source language producers are the creators of the SQuAD dataset and the researchers who generated the spoken audio files for the NMSQA dataset.

Annotations

Annotation process

The annotations for the NMSQA dataset are derived from the original SQuAD dataset. Additional annotations, such as audio start and end positions for correct and incorrect answers, as well as audio file paths and speaker information, are added by the dataset creators.

Who are the annotators?

The annotators for the NMSQA dataset are the creators of the SQuAD dataset and the researchers who generated the spoken audio files and additional annotations for the NMSQA dataset.

Personal and Sensitive Information

The dataset does not contain any personal or sensitive information.

Considerations for Using the Data

Social Impact of Dataset

The NMSQA dataset contributes to the development and evaluation of models for textless spoken question answering tasks, which can lead to advancements in natural language processing and automatic speech recognition. Applications of these technologies can improve accessibility and convenience in various domains, such as virtual assistants, customer service, and voice-controlled devices.

Discussion of Biases

The dataset inherits potential biases from the original SQuAD dataset, which may include biases in the selection of passages, questions, and answers. Additionally, biases may be introduced in the text-to-speech process and the choice of speakers used to generate the spoken audio files.

Other Known Limitations

As the dataset is based on the SQuAD dataset, it shares the same limitations, including the fact that it is limited to the English language and mainly focuses on factual questions. Furthermore, the dataset may not cover a wide range of accents, dialects, or speaking styles.

Additional Information

Dataset Curators

The NMSQA dataset is curated by Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-Wen Yang, Hsuan-Jui Chen, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, and Lin-Shan Lee.

Licensing Information

The licensing information for the dataset is not explicitly mentioned.

Citation Information

@article{lin2022dual,
    title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning},
    author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan},
    journal={arXiv preprint arXiv:2203.04911},
    year={2022}
}

Contributions

Thanks to @voidful for adding this dataset.