数据集:
Jzuluaga/uwb_atcc
The UWB-ATCC Corpus is provided provided by University of West Bohemia, Department of Cybernetics. The corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono.
Important, from the <id (string)> field, you can obtain the speaker roles. For instance:
The text and the recordings are in English. The authors took advantage of the fact that one of their industrial partners develops complex IT solutions for several ATC authorities and airports and, as such, has access to the ATC communication recordings collected in the Czech airspace. This partner was able to secure the following data:
(Not all data is released. Check their website here )
The licensing status of the dataset hinges on the legal status of the UWB-ATCC corpus creators.
They used Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) licensing.
Contributors who prepared, processed, normalized and uploaded the dataset in HuggingFace:
@article{zuluaga2022how, title={How Does Pre-trained Wav2Vec2. 0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications}, author={Zuluaga-Gomez, Juan and Prasad, Amrutha and Nigmatulina, Iuliia and Sarfjoo, Saeed and others}, journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar}, year={2022} } @article{zuluaga2022bertraffic, title={BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications}, author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and others}, journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar}, year={2022} } @article{zuluaga2022atco2, title={ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications}, author={Zuluaga-Gomez, Juan and Vesel{\`y}, Karel and Sz{\"o}ke, Igor and Motlicek, Petr and others}, journal={arXiv preprint arXiv:2211.04054}, year={2022} }
Authors of the dataset:
@article{vsmidl2019air, title={Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development}, author={{\v{S}}m{\'\i}dl, Lubo{\v{s}} and {\v{S}}vec, Jan and Tihelka, Daniel and Matou{\v{s}}ek, Jind{\v{r}}ich and Romportl, Jan and Ircing, Pavel}, journal={Language Resources and Evaluation}, volume={53}, number={3}, pages={449--464}, year={2019}, publisher={Springer} }