数据集:
KBLab/rixvox
RixVox is a speech dataset comprised of speeches from the Riksdag (the Swedish Parliament). It covers speeches from debates during the period 2003-2023. Audio from speeches have been aligned, on the sentence level, with transcripts from written protocols using aeneas . An observation may consist of one or several concatenated sentences (up to 30 seconds in duration). Detailed speaker metadata is available for each observation, including the speaker's name, gender, political party, birth year and the electoral district they represent. The dataset contains a total of 5493 hours of speech with transcriptions.
Tasks are not supported by default (there are no label fields). The dataset may however be suited for:
To download and extract the files locally you can use load_dataset() . We recommend you set the cache_dir argument to point to a location that has plenty of disk space (1.2TB+). Here's how to download the train split:
from datasets import load_dataset # To download/load all splits at once, don't specify a split rixvox = load_dataset("KBLab/rixvox", split="train", cache_dir="data_rixvox")
You can also stream the dataset. This is useful if you want to explore the dataset or if you don't have enough disk space to download the entire dataset. Here's how to stream the train split:
from datasets import load_dataset rixvox = load_dataset("KBLab/rixvox", cache_dir="data_rixvox", split="train", streaming=True) print(next(iter(rixvox))) # Grab 5 observations rixvox_subset = rixvox.take(5) for example in rixvox_subset: print(example)
Create a PyTorch dataloader with your dataset.
Local mode:
from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler # Dataset is not pre-shuffled, recommend shuffling it before training. rixvox = load_dataset("KBLab/rixvox", split="train", cache_dir="data_rixvox") batch_sampler = BatchSampler(RandomSampler(rixvox), batch_size=32, drop_last=False) dataloader = DataLoader(rixvox, batch_sampler=batch_sampler)
Streaming mode:
from datasets import load_dataset from torch.utils.data import DataLoader rixvox = load_dataset("KBLab/rixvox", split="train", cache_dir="data_rixvox") dataloader = DataLoader(rixvox, batch_size=32)
See Huggingface's guide on streaming datasets for more information on how to shuffle in streaming mode.
There are a total of 835044 observations from 1194 different speakers. Each observation can be up to 30 seconds in duration. An observation belongs to a debate ( dokid ), is extratected from a speech ( anforande_nummer ), and is numbered according to its order within the speech ( observation_nr ). Here is an example of an observation:
{'dokid': 'GR01BOU3', 'anforande_nummer': 191, 'observation_nr': 0, 'audio': {'path': 'GR01BOU3/2442210220028601121_anf191_1_25.wav', 'array': array([0.01171875, 0.01242065, 0.01071167, ..., 0.00689697, 0.00918579, 0.00650024]), 'sampling_rate': 16000}, 'text': 'Kristdemokraterna står bakom alla reservationer med kristdemokratiska förtecken, men jag nöjer mig med att yrka bifall till reservation 1. Jag ska i det här inlägget beröra några av de åtta punkter som är föremål för reservationer från kristdemokratiskt håll, i vissa fall tillsammans med andra partier.', 'debatedate': datetime.datetime(2003, 12, 4, 0, 0), 'speaker': 'Göran Hägglund', 'party': 'KD', 'gender': 'male', 'birth_year': 1959, 'electoral_district': 'Hallands län', 'intressent_id': '0584659199514', 'speaker_from_id': True, 'speaker_audio_meta': 'Göran Hägglund (Kd)', 'start': 1.4, 'end': 24.96, 'duration': 23.560000000000002, 'bleu_score': 0.7212783273624307, 'filename': 'GR01BOU3/2442210220028601121_anf191_1_25.wav', 'path': 'GR01BOU3/2442210220028601121_anf191_1_25.wav', 'speaker_total_hours': 30.621333333333332}
See more examples in the dataset viewer .
Dataset splits were randomly sampled on the speaker level. That is, a speaker is only present in a single split. We sample speakers for each split until the following conditions are met:
Dataset Split | Observations | Total duration of speech (hours) | Average duration obs. (seconds) | Number of speakers |
---|---|---|---|---|
Train | 818227 | 5383 | 23.69 | 1165 |
Validation | 7933 | 52 | 23.50 | 18 |
Test | 8884 | 59 | 23.74 | 11 |
For more information about the creation of this dataset, see the article "Finding Speeches in the Riksdag's Debates" from our blog.
Before RixVox, there was only a couple of hundred hours of transcribed speech available to train ASR models for Swedish. ASR models such as Whisper have shown that the performance of models can benefit significantly from adding more supervised data during pretraining or finetuning. Media from debates in the Riksdag are published openly on the web together with transcripts and other metadata. The open data initiatives of the Riksdag presented an opportunity to create a high quality open speech corpus for Swedish.
The Swedish Parliament.
For information on how the speeches were segmented and identified in debate audio files, see the article "Finding Speeches in the Riksdag's Debates" .
For information on how the speech segmentations were used to create the final RixVox dataset, see the article "RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates" .
The code to replicate the creation of the dataset is open and available at the GitHub repository KBLab/riksdagen_anforanden . Processing everything can take 1-3 weeks on a workstation with consumer grade GPU.
Who are the source language producers?The written protocols of speeches are manually produced by the Riksdag. Transcription is not always verbatim, but rather catches the intent of the speaker.
Segmenting speeches to determine when they start and end in a debate was done automatically. Sentence level alignment of the written protocols to the audio files was also done automatically using aeneas . See the articles in citation information for more details.
The process of aligning speech to written protocols was automatic. It followed the following general steps:
No manual annotations.
The speakers are members of parliament or ministers speaking publicly in the Riksdag. The Riksdag is a public institution and the speeches are publicly available on the web as open data.
We except the dataset primarily to be used in training ASR models for Swedish. The performance of Swedish text-to-speech in multillingual ASR models may also benefit from the availability of a large Swedish speech corpus. In turn, improved ASR models can serve to help increase accessibility of audio and video media content for people with hearing impairments.
The dataset can also be used to train models for other audio tasks such as speaker diarization, speaker verification, and speaker recognition.
Since metadata regarding the age, gender, and electoral district of the speaker is included, the dataset can possibly also be used to study bias in ASR models.
The dataset includes parliamentary speeches, which are often more formal than everyday speech.
During the creation of the dataset, we found that speech segmentations based on speaker diarization were more likely to fail when a preceding speaker, the speaker of the house, and the speaker of the following speech were all of the same gender. However, all in all, only a small number of speeches were filtered out of the final RixVox dataset. After quality filtering of the dataset, 5500 out of 5858 hours remained. We do not believe any significant systematic bias was introduced by this filtering.
Only minimal deduplication was performed to weed out commonly repeated phrases. For example, certain phrases such as "Fru talman!", "Herr Talman!", tend to be used a lot as a matter of formality. These phrases tend to be present at the beginning of most transcripts regardless whether it was uttered by the speaker or not. For this reason we have removed the first aligned sentence of each speech when creating RixVox. The aforementioned phrases are repeated frequently in speeches as well, though. As such it might be beneficial to perform more aggressive deduplication of the dataset before training models.
KBLab at the the National Library of Sweden.
There is a possiblity RixVox will be periodically, and irregularly, updated by including both older and newer speeches. Older recordings of parliamentary debates from 1966 to 2002 do exist, but they are not yet part of the Riksdag's open data. KBLab are exploring the possibility of adding metadata to these recordings by applying the existing speech segmentation and alignment pipeline to them.
Each year also brings new parliamentary debates, with recent years adding 400-500 hours of speech per year.
Cite the Swedish Parliament.
To reference RixVox, feel free to cite KBLab blog posts in the citation information below.
@misc{rekathati2023rixvox:, author = {Rekathati, Faton}, title = {The KBLab Blog: RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates}, url = {https://kb-labb.github.io/posts/2023-03-09-rixvox-a-swedish-speech-corpus/}, year = {2023} }
@misc{rekathati2023finding, author = {Rekathati, Faton}, title = {The KBLab Blog: Finding Speeches in the Riksdag's Debates}, url = {https://kb-labb.github.io/posts/2023-02-15-finding-speeches-in-the-riksdags-debates/}, year = {2023} }
The Swedish Parliament.
Thanks to @lhoestq for reviewing the dataset script.