数据集:

jimregan/clarinpl_sejmsenat

任务:

task_categories:other

自动语音识别

语言:

计算机处理:

monolingual

大小:

1K<n<10K

批注创建人:

expert-generated

源数据集:

original

许可:

other

数据集介绍文件清单

中文

Dataset Card for ClarinPL Sejm/Senat Speech Corpus

Dataset Summary

A collection of 97 hours of parliamentary speeches published on the ClarinPL website.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

The audio is in Polish.

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, usually called file and its transcription, called text . An example from the dataset is:

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/4143b1d75559b10028c1c7e8800c9ccc05934ca5a8ea15f8f9a92770576a1ee3/SejmSenat/audio/AdamAbramowicz-20130410/file000.wav',
 'id': 'AdamAbramowicz-20130410-file000',
 'speaker_id': 'AdamAbramowicz',
 'text': 'panie marszałku wysoka izbo panie ministrze próbuje się przedstawiać polskę jako zieloną wyspę kraj który się szybko rozwija tymczasem rzeczywistość jest zupełnie inna a widać ją także dzisiaj przed polskim parlamentem próbuje się rząd próbuje zagonić polaków do pracy aż do śmierci przedłużać wiek emerytalny czyliczyli sytuacja gospodarcza polski w tym wypadku jest przedstawiana już zupełnie inaczej pakiet klimatyczny i protokół z kioto jak się zgadzają fachowcy od gospodarki jest szkodliwy dla krajów które są na dorobku a polska właśnie jest takim krajem'}

Data Fields

file: A path to the downloaded audio file in .wav format.
text: the transcription of the audio file.
speaker_id: The ID of the speaker of the audio.

Data Splits

Train	Test
dataset	6622	130

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

[Needs More Information]

Citation Information

[Needs More Information]

Contributions

[Needs More Information]

作者:

jimregan

数据集大小:

11.07 KB