数据集:
voidful/NMSQA
Download audio data: https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz Unzip audio data: tar -xf nmsqa_audio.tar.gz
The Natural Multi-speaker Spoken Question Answering (NMSQA) dataset is designed for the task of textless spoken question answering. It is based on the SQuAD dataset and contains spoken questions and passages. The dataset includes the original text, transcriptions, and audio files of the spoken content. This dataset is created to evaluate the performance of models on textless spoken question answering tasks.
The primary task supported by this dataset is textless spoken question answering, where the goal is to answer questions based on spoken passages without relying on textual information. The dataset can also be used for automatic speech recognition tasks.
The dataset is in English.
Each instance in the dataset contains the following fields:
The dataset includes the following data fields:
The dataset is split into train, dev, and test sets.
The NMSQA dataset is created to address the challenge of textless spoken question answering, where the model must answer questions based on spoken passages without relying on textual information.
The NMSQA dataset is based on the SQuAD dataset, with spoken questions and passages created from the original text data.
Initial Data Collection and NormalizationThe initial data collection involved converting the original SQuAD dataset's text-based questions and passages into spoken audio files. The text was first normalized, and then audio files were generated using text-to-speech methods.
Who are the source language producers?The source language producers are the creators of the SQuAD dataset and the researchers who generated the spoken audio files for the NMSQA dataset.
The annotations for the NMSQA dataset are derived from the original SQuAD dataset. Additional annotations, such as audio start and end positions for correct and incorrect answers, as well as audio file paths and speaker information, are added by the dataset creators.
Who are the annotators?The annotators for the NMSQA dataset are the creators of the SQuAD dataset and the researchers who generated the spoken audio files and additional annotations for the NMSQA dataset.
The dataset does not contain any personal or sensitive information.
The NMSQA dataset contributes to the development and evaluation of models for textless spoken question answering tasks, which can lead to advancements in natural language processing and automatic speech recognition. Applications of these technologies can improve accessibility and convenience in various domains, such as virtual assistants, customer service, and voice-controlled devices.
The dataset inherits potential biases from the original SQuAD dataset, which may include biases in the selection of passages, questions, and answers. Additionally, biases may be introduced in the text-to-speech process and the choice of speakers used to generate the spoken audio files.
As the dataset is based on the SQuAD dataset, it shares the same limitations, including the fact that it is limited to the English language and mainly focuses on factual questions. Furthermore, the dataset may not cover a wide range of accents, dialects, or speaking styles.
The NMSQA dataset is curated by Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-Wen Yang, Hsuan-Jui Chen, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, and Lin-Shan Lee.
The licensing information for the dataset is not explicitly mentioned.
@article{lin2022dual, title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning}, author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan}, journal={arXiv preprint arXiv:2203.04911}, year={2022} }
Thanks to @voidful for adding this dataset.