数据集:

cdminix/libritts-r-aligned

其他:

automatic-speech-recognition audio speech

许可:

cc-by-4.0

预印本库:

arxiv:2211.16049 arxiv:1904.02882

语言:

批注创建人:

crowdsourced

任务:

文本转语音

自动语音识别

数据集介绍文件清单

中文

This dataset is identical to cdminix/libritts-aligned except it uses the newly released LibriTTS-R corpus. Please cite Y. Koizumi, et al., "LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus", Interspeech 2023

When using this dataset to download LibriTTS-R, make sure you agree to the terms on https://www.openslr.org

Dataset Card for LibriTTS-R with Forced Alignments (and Measures)

This dataset downloads LibriTTS-R and preprocesses it on your machine to create alignments using montreal forced aligner . You need to run pip install alignments phones before using this dataset. When running this the first time, it can take an hour or two, but subsequent runs will be lightning fast.

Requirements

pip install alignments phones (required)
pip install speech-collator (optional)

Note: version >=0.0.15 of alignments is required for this corpus

Example Item

{
    'id': '100_122655_000073_000002.wav',
    'speaker': '100',
    'text': 'the day after, diana and mary quitted it for distant b.',
    'start': 0.0,
    'end': 3.6500000953674316, 
    'phones': ['[SILENCE]', 'ð', 'ʌ', '[SILENCE]', 'd', 'eɪ', '[SILENCE]', 'æ', 'f', 't', 'ɜ˞', '[COMMA]', 'd', 'aɪ', 'æ', 'n', 'ʌ', '[SILENCE]', 'æ', 'n', 'd', '[SILENCE]', 'm', 'ɛ', 'ɹ', 'i', '[SILENCE]', 'k', 'w', 'ɪ', 't', 'ɪ', 'd', '[SILENCE]', 'ɪ', 't', '[SILENCE]', 'f', 'ɜ˞', '[SILENCE]', 'd', 'ɪ', 's', 't', 'ʌ', 'n', 't', '[SILENCE]', 'b', 'i', '[FULL STOP]'], 
    'phone_durations': [5, 2, 4, 0, 5, 13, 0, 16, 7, 5, 20, 2, 6, 9, 15, 4, 2, 0, 11, 3, 5, 0, 3, 8, 9, 8, 0, 13, 3, 5, 3, 6, 4, 0, 8, 5, 0, 9, 5, 0, 7, 5, 6, 7, 4, 5, 10, 0, 3, 35, 9],
    'audio': '/dev/shm/metts/train-clean-360-alignments/100/100_122655_000073_000002.wav'
}

The phones are IPA phones, and the phone durations are in frames (assuming a hop length of 256, sample rate of 22050 and window length of 1024). These attributes can be changed using the hop_length , sample_rate and window_length arguments to LibriTTSAlign .

Data Collator

This dataset comes with a data collator which can be used to create batches of data for training. It can be installed using pip install speech-collator ( MiniXC/speech-collator ) and can be used as follows:

import json
from datasets import load_dataset
from speech_collator import SpeechCollator
from torch.utils.data import DataLoader

dataset = load_dataset('cdminix/libritts-aligned', split="train")

speaker2ixd = json.load(open("speaker2idx.json"))
phone2ixd = json.load(open("phone2idx.json"))

collator = SpeechCollator(
    speaker2ixd=speaker2idx,
    phone2ixd=phone2idx ,
)
dataloader = DataLoader(dataset, collate_fn=collator.collate_fn, batch_size=8)

You can either download the speaker2idx.json and phone2idx.json files from here or create them yourself using the following code:

import json
from datasets import load_dataset
from speech_collator import SpeechCollator, create_speaker2idx, create_phone2idx

dataset = load_dataset("cdminix/libritts-aligned", split="train")

# Create speaker2idx and phone2idx
speaker2idx = create_speaker2idx(dataset, unk_idx=0)
phone2idx = create_phone2idx(dataset, unk_idx=0)

# save to json
with open("speaker2idx.json", "w") as f:
    json.dump(speaker2idx, f)
with open("phone2idx.json", "w") as f:
    json.dump(phone2idx, f)

Measures

When using speech-collator you can also use the measures argument to specify which measures to use. The following example extracts Pitch and Energy on the fly.

import json
from torch.utils.data import DataLoader
from datasets import load_dataset
from speech_collator import SpeechCollator, create_speaker2idx, create_phone2idx
from speech_collator.measures import PitchMeasure, EnergyMeasure

dataset = load_dataset("cdminix/libritts-aligned", split="train")

speaker2idx = json.load(open("data/speaker2idx.json"))
phone2idx = json.load(open("data/phone2idx.json"))

# Create SpeechCollator
speech_collator = SpeechCollator(
    speaker2idx=speaker2idx,
    phone2idx=phone2idx,
    measures=[PitchMeasure(), EnergyMeasure()],
    return_keys=["measures"]
)

# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=8,
    collate_fn=speech_collator.collate_fn,
)

COMING SOON: Detailed documentation on how to use the measures at MiniXC/speech-collator .

Splits

This dataset has the following splits:

train : All the training data, except one sample per speaker which is used for validation.
dev : The validation data, one sample per speaker.
train.clean.100 : Training set derived from the original materials of the train-clean-100 subset of LibriSpeech.
train.clean.360 : Training set derived from the original materials of the train-clean-360 subset of LibriSpeech.
train.other.500 : Training set derived from the original materials of the train-other-500 subset of LibriSpeech.
dev.clean : Validation set derived from the original materials of the dev-clean subset of LibriSpeech.
dev.other : Validation set derived from the original materials of the dev-other subset of LibriSpeech.
test.clean : Test set derived from the original materials of the test-clean subset of LibriSpeech.
test.other : Test set derived from the original materials of the test-other subset of LibriSpeech.

Environment Variables

There are a few environment variable which can be set.

LIBRITTS_VERBOSE : If set, will print out more information about the dataset creation process.
LIBRITTS_MAX_WORKERS : The number of workers to use when creating the alignments. Defaults to cpu_count() .
LIBRITTS_PATH : The path to download LibriTTS to. Defaults to the value of HF_DATASETS_CACHE .

Citation

When using LibriTTS-R please cite the following papers:

When using the Measures please cite the following paper (ours):

Evaluating and reducing the distance between synthetic and real speech distributions

作者:

cdminix

数据集大小:

278.83 MB