数据集:
MLCommons/peoples_speech
The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.
[Needs More Information]
English
{ "id": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac", "audio": { "path": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac" "array": array([-6.10351562e-05, ...]), "sampling_rate": 16000 } "duration_ms": 14490, "text": "contends that the suspension clause requires a [...]" }
{ "id": datasets.Value("string"), "audio": datasets.Audio(sampling_rate=16_000), "duration_ms": datasets.Value("int32"), "text": datasets.Value("string"), }
We provide the following configurations for the dataset: cc-by-clean , cc-by-dirty , cc-by-sa-clean , cc-by-sa-dirty , and microset . We don't provide splits for any of the configurations.
See our paper .
Data was downloaded via the archive.org API. No data inference was done.
Who are the source language producers?[Needs More Information]
No manual annotation is done. We download only source audio with already existing transcripts.
Who are the annotators?For the test and dev sets, we paid native American English speakers to do transcriptions. We do not know the identities of the transcriptionists for data in the training set. For the training set, we have noticed that some transcriptions are likely to be the output of automatic speech recognition systems.
Several of our sources are legal and government proceedings, spoken histories, speeches, and so on. Given that these were intended as public documents and licensed as such, it is natural that the involved individuals are aware of this.
The dataset could be used for speech synthesis. However, this requires careful cleaning of the dataset, as background noise is not tolerable for speech synthesis.
The dataset could be used for keyword spotting tasks as well. In particular, this is good use case for the non-English audio in the dataset.
Our sincere hope is that the large breadth of sources our dataset incorporates reduces existing quality of service issues today, like speech recognition system’s poor understanding of non-native English accents. We cannot think of any unfair treatment that come from using this dataset at this time.
Our data is downloaded from archive.org. As such, the data is biased towards whatever users decide to upload there.
Almost all of our data is American accented English.
As of version 1.0, a portion of data in the training, test, and dev sets is poorly aligned. Specifically, some words appear in the transcript, but not the audio, or some words appear in the audio, but not the transcript. We are working on it.
[Needs More Information]
We provide CC-BY and CC-BY-SA subsets of the dataset.
Please cite:
@article{DBLP:journals/corr/abs-2111-09344, author = {Daniel Galvez and Greg Diamos and Juan Ciro and Juan Felipe Cer{\'{o}}n and Keith Achorn and Anjali Gopi and David Kanter and Maximilian Lam and Mark Mazumder and Vijay Janapa Reddi}, title = {The People's Speech: {A} Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage}, journal = {CoRR}, volume = {abs/2111.09344}, year = {2021}, url = {https://arxiv.org/abs/2111.09344}, eprinttype = {arXiv}, eprint = {2111.09344}, timestamp = {Mon, 22 Nov 2021 16:44:07 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2111-09344.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }