数据集:
language-and-voice-lab/samromur_asr
任务:
自动语音识别语言:
is计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original许可:
cc-by-4.0This is the first release of the Samrómur Icelandic Speech corpus that contains 100.000 validated utterances.
The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology.
The Samrómur Corpus is divided in 3 splits: train, validation and test. To load a specific split pass its name as a config name:
from datasets import load_dataset samromur_asr = load_dataset("language-and-voice-lab/samromur_asr")
To load an specific split (for example, the validation split) do:
from datasets import load_dataset samromur_asr = load_dataset("language-and-voice-lab/samromur_asr",split="validation")
automatic-speech-recognition: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER).
The audio is in Icelandic. The reading prompts were gathered from a variety of sources, mainly from the Icelandic Gigaword Corpus . The corpus includes text from novels, news, plays, and from a list of location names in Iceland. The prompts also came from the Icelandic Web of Science .
{ 'audio_id': '009123-0150695', 'audio': { 'path': '/home/david/.cache/HuggingFace/datasets/downloads/extracted/cb428a7f1e46b058d76641ef32f36b49d28b73aea38509983170495408035a10/dev/009123/009123-0150695.flac', 'array': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': '009123', 'gender': 'female', 'age': '18-19', 'duration': 3.299999952316284, 'normalized_text': 'það skipti heldur engu' }
The corpus is split into train, validation, and test subsets with no speaker overlap. Each subset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac. Lenghts of each portion are: train=114h/34m, test=15h51m, validation=15h16m.
To load an specific portion please see the above section "Example Usage".
The recording has started in October 2019 and continues to this day (May 2021).
This release has been authorized for release in May 2021.
The aim is to create an open-source speech corpus to enable research and development for Icelandic Language Technology.
The corpus contains audio recordings and a metadata file that contains the prompts the participants read.
A Kaldi based script using this data can be found on the Language and Voice Lab gitHub page https://github.com/cadia-lvl/samromur-asr
The utterances were recorded by a smartphone or the web app.
The data was collected using the website https://samromur.is , code of which is available at https://github.com/cadia-lvl/samromur .
Each recording contains one read sentence from a script.
The script contains 85.080 unique sentences and 90.838 unique tokens.
Prompts were pulled from these corpora if they met the criteria of having only letters which are present in the Icelandic alphabet, and if they are listed in the DIM: Database Icelandic Morphology .
There are also synthesised prompts consisting of a name followed by a question or a demand, in order to simulate a dialogue with a smart-device.
Who are the annotators?The audio files content was manually verified against the prompts by one or more listener (summer students mainly).
The dataset consists of people who have donated their voice. You agree to not attempt to determine the identity of speakers in this dataset.
This contribution describes an ongoing project of speech data collection, using the web application Samrómur which is built upon Common Voice, Mozilla Foundation's web platform for open-source voice collection. The goal of the project is to build a large-scale speech corpus for Automatic Speech Recognition (ASR) for Icelandic. Upon completion, Samrómur will be the largest open speech corpus for Icelandic collected from the public domain.
The participants are aged between 18 to 90, 59,782 recordings are from female speakers and 40,218 are from male, recorded by a smartphone or the web app.
Participants self-reported their age group, gender, and the native language.
The corpus contains 100 000 utterance from 8392 speaker, totalling 145 hours.
"Samromur 21.05" by the Language and Voice Laboratory (LVL) at the Reykjavik University is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology.
@inproceedings{mollberg-etal-2020-samromur, title = "{S}amr{\'o}mur: Crowd-sourcing Data Collection for {I}celandic Speech Recognition", author = "Mollberg, David Erik and J{\'o}nsson, {\'O}lafur Helgi and {\TH}orsteinsd{\'o}ttir, Sunneva and Steingr{\'\i}msson, Stein{\th}{\'o}r and Magn{\'u}sd{\'o}ttir, Eyd{\'\i}s Huld and Gudnason, Jon", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.425", pages = "3463--3467", language = "English", ISBN = "979-10-95546-34-4", }
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.
The verification for the dataset was funded by the the Icelandic Directorate of Labour's Student Summer Job Program.
Special thanks for the summer students for all the hard work.