数据集:
language-and-voice-lab/samromur_children
任务:
自动语音识别语言:
is计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original其他:
samromurThe Samrómur Children Corpus consists of audio recordings and metadata files containing prompts read by the participants. It contains more than 137000 validated speech-recordings uttered by Icelandic children.
The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology. The recording process has started in October 2019 and continues to this day (Spetember 2021).
The Samrómur Children Corpus is divided in 3 splits: train, validation and test. To load a specific split pass its name as a config name:
from datasets import load_dataset samromur_children = load_dataset("language-and-voice-lab/samromur_children")
To load an specific split (for example, the validation split) do:
from datasets import load_dataset samromur_children = load_dataset("language-and-voice-lab/samromur_children",split="validation")
automatic-speech-recognition: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER).
The audio is in Icelandic. The reading prompts were gathered from a variety of sources, mainly from the Icelandic Gigaword Corpus . The corpus includes text from novels, news, plays, and from a list of location names in Iceland. The prompts also came from the Icelandic Web of Science .
{ 'audio_id': '015652-0717240', 'audio': { 'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/2c6b0d82de2ef0dc0879732f726809cccbe6060664966099f43276e8c94b03f2/test/015652/015652-0717240.flac', 'array': array([ 0. , 0. , 0. , ..., -0.00311279, -0.0007019 , 0.00128174], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': '015652', 'gender': 'female', 'age': '11', 'duration': 4.179999828338623, 'normalized_text': 'eiginlega var hann hin unga rússneska bylting lifandi komin' }
The corpus is split into train, dev, and test portions. Lenghts of every portion are: train = 127h25m, test = 1h50m, dev=1h50m.
To load an specific portion please see the above section "Example Usage".
In the field of Automatic Speech Recognition (ASR) is a known fact that the children's speech is particularly hard to recognise due to its high variability produced by developmental changes in children's anatomy and speech production skills.
For this reason, the criteria of selection for the train/dev/test portions have to take into account the children's age. Nevertheless, the Samrómur Children is an unbalanced corpus in terms of gender and age of the speakers. This means that the corpus has, for example, a total of 1667 female speakers (73h38m) versus 1412 of male speakers (52h26m).
These unbalances impose conditions in the type of the experiments than can be performed with the corpus. For example, a equal number of female and male speakers through certain ranges of age is impossible. So, if one can't have a perfectly balance corpus in the training set, at least one can have it in the test portion.
The test portion of the Samrómur Children was meticulously selected to cover ages between 6 to 16 years in both female and male speakers. Every of these range of age in both genders have a total duration of 5 minutes each.
The development portion of the corpus contains only speakers with an unknown gender information. Both test and dev sets have a total duration of 1h50m each.
In order to perform fairer experiments, speakers in the train and test sets are not shared. Nevertheless, there is only one speaker shared between the train and development set. It can be identified with the speaker ID=010363. However, no audio files are shared between these two sets.
The data was collected using the website https://samromur.is , code of which is available at https://github.com/cadia-lvl/samromur . The age range selected for this corpus is between 4 and 17 years.
The original audio was collected at 44.1 kHz or 48 kHz sampling rate as *.wav files, which was down-sampled to 16 kHz and converted to *.flac. Each recording contains one read sentence from a script. The script contains 85.080 unique sentences and 90.838 unique tokens.
There was no identifier other than the session ID, which is used as the speaker ID. The corpus is distributed with a metadata file with a detailed information on each utterance and speaker. The madata file is encoded as UTF-8 Unicode.
The prompts were gathered from a variety of sources, mainly from The Icelandic Gigaword Corpus, which is available at http://clarin.is/en/resources/gigaword . The corpus includes text from novels, news, plays, and from a list of location names in Iceland. The prompts also came from the Icelandic Web of Science .
Prompts were pulled from these corpora if they met the criteria of having only letters which are present in the Icelandic alphabet, and if they are listed in the DIM: Database Icelandic Morphology .
There are also synthesised prompts consisting of a name followed by a question or a demand, in order to simulate a dialogue with a smart-device.
Who are the annotators?The audio files content was manually verified against the prompts by one or more listener (summer students mainly).
The dataset consists of people who have donated their voice. You agree to not attempt to determine the identity of speakers in this dataset.
This is the first ASR corpus of Icelandic children.
The utterances were recorded by a smartphone or the web app.
Participants self-reported their age group, gender, and the native language.
Participants are aged between 4 to 17 years.
The corpus contains 137597 utterances from 3175 speakers, totalling 131 hours.
The amount of data due to female speakers is 73h38m, the amount of data due to male speakers is 52h26m and the amount of data due to speakers with an unknown gender information is 05h02m
The number of female speakers is 1667, the number of male speakers is 1412. The number of speakers with an unknown gender information is 96.
The audios due to female speakers are 78993, the audios due to male speakers are 53927 and the audios due to speakers with an unknown gender information are 4677.
"Samrómur Children: Icelandic Speech 21.09" by the Language and Voice Laboratory (LVL) at the Reykjavik University is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology. The recording process has started in October 2019 and continues to this day (Spetember 2021). The corpus was curated by Carlos Daniel Hernández Mena in 2021.
@misc{menasamromurchildren2021, title={Samrómur Children Icelandic Speech 1.0}, ldc_catalog_no={LDC2022S11}, DOI={https://doi.org/10.35111/frrj-qd60}, author={Hernández Mena, Carlos Daniel and Borsky, Michal and Mollberg, David Erik and Guðmundsson, Smári Freyr and Hedström, Staffan and Pálsson, Ragnar and Jónsson, Ólafur Helgi and Þorsteinsdóttir, Sunneva and Guðmundsdóttir, Jóhanna Vigdís and Magnúsdóttir, Eydís Huld and Þórhallsdóttir, Ragnheiður and Guðnason, Jón}, publisher={Reykjavík University} journal={Linguistic Data Consortium, Philadelphia}, year={2019}, url={https://catalog.ldc.upenn.edu/LDC2022S11}, }
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.
The verification for the dataset was funded by the the Icelandic Directorate of Labour's Student Summer Job Program in 2020 and 2021.
Special thanks for the summer students for all the hard work.