ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications
Code for the paper ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications.
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications....
ATC aims at guiding aircraft and controlling the
airspace in a safe and optimal manner. These voice-based dialogues
are carried between an air traffic controller (ATCO) and pilots via
very-high frequency radio channels. In order to incorporate these
novel technologies into ATC (low-resource domain), large-scale
annotated datasets are required to develop the data-driven AI
systems. Two examples are automatic speech recognition (ASR) and
natural language understanding (NLU). In this paper, we introduce the
ATCO2 corpus, a dataset that aims at fostering research on the
challenging ATC field, which has lagged behind due to lack of
annotated data. The ATCO2 corpus covers 1) data collection and pre-
processing, 2) pseudo-annotations of speech data, and 3) extraction
of ATC-related named entities. The ATCO2 corpus is split into three
subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with
manual transcripts and a subset with gold annotations for named-
entity recognition (callsign, command, value). 2) The ATCO2-PL-set
corpus consists of 5281 hours of unlabeled ATC data enriched with
automatic transcripts from an in-domain speech recognizer, contextual
information, speaker turn information, signal-to-noise ratio estimate
and English language detection score per sample. Both available for
purchase through ELDA at this http URL. 3) The ATCO2-test-set-1h
corpus is a one-hour subset from the original test set corpus, that
we are offering for free at this https URL. We expect the ATCO2
corpus will foster research on robust ASR and NLU not only in the
field of ATC communications but also in the general research
community.
ATCO2 corpus ecosystem. Blue circles denote annotations only available for ATCO2 test set corpus. Green circles denote annotations and metadata available for both ATCO2 test set and ATCO2 pseudo-labeled corpus sets.
Repository written by: Juan Pablo Zuluaga.
The first step is to create your environment with the required packages for data preparation, formatting, and to carry out the experiments. You can run the following commands to create the conda environment (assuming CUDA - 11.7):
- Step 1: Using
python 3.10
: install python and the requirements
git clone https://github.com/idiap/w2v2-air-traffic
conda create -n atco2_corpus python==3.10
conda activate atco2_corpus
python -m pip install -r requirements.txt
Before running any script, make sure you have en_US
locale set and PYTHONPATH
in repository root folder.
export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
export PYTHONPATH=$PYTHONPATH:$(pwd) #assuming you are in root repository folder
There are several steps to replicate/use our proposed models:
- This system allows to optain the text level information of what was said in the ATC communication. It is normally used later in the next systems below
- With this module, you can detect who is talking in the given communication
- Here, you aim at understanding what was said in the communicaiton. With ATCO2 corpus you can train a system that can detect callsigns, commands and values in the communication.
Here is a list of papers that are somehow related to AI/ML targeted to Air traffic control communications:
-
Fine-tuning a pretrained BERT model on the named entity recognition task to perform text-based diarization for ATC communications:
-
Fine-tuning a pretrained Wav2vec 2.0 model for automatic speech recognition:
-
How to use contextual data (biasing) in ATC automatic speech recognition:
-
Ethics in collection of ATC audio data: Legal and Ethical Challenges in Recording Air Traffic Control Speech
Some other papers:
- Boosting of contextual information in ASR for air-traffic call-sign recognition
- Grammar Based Identification Of Speaker Role For Improving ATCO And Pilot ASR
- Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems
- Automatic Processing Pipeline for Collecting and Annotating Air-Traffic Voice Communication Data
- Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications
- Improving callsign recognition with air-surveillance data in air-traffic communication
- Automatic Speech Recognition Benchmark for Air-Traffic Communications
If you use this code for your research, please cite our papers with the following bibtex items:
# article 1 - MAIN
@article{zuluaga2022atco2,
title={ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications},
author={Zuluaga-Gomez, Juan and Vesel{\'y}, Karel and Sz{\"o}ke, Igor and Motlicek, Petr and others},
journal={arXiv preprint arXiv:2211.04054},
year={2022}
}
# article 2 - Mainly on ASR
@inproceedings{zuluaga2023does,
title={How does pre-trained Wav2Vec 2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communications},
author={Zuluaga-Gomez, Juan and Prasad, Amrutha and Nigmatulina, Iuliia and Sarfjoo, Seyyed Saeed and Motlicek, Petr and Kleinert, Matthias and Helmke, Hartmut and Ohneiser, Oliver and Zhan, Qingran},
booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
pages={205--212},
year={2023},
organization={IEEE}
}
# article 3 - Mainly on sequence classification and BERT
@inproceedings{zuluaga2023bertraffic,
title={Bertraffic: Bert-based joint speaker role and speaker change detection for air traffic control communications},
author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and Nigmatulina, Iuliia and Motlicek, Petr and Ondrej, Karel and Ohneiser, Oliver and Helmke, Hartmut},
booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
pages={633--640},
year={2023},
organization={IEEE}
}