Automatic speech recognition (ASR) has surpassed all other forms of modern human-machine interaction thanks to the proliferation of high-tech Internet of Things (IoT) gadgets. However, ASR presents difficulties in real-world scenarios because of background noise, speaker accent, gender, etc. The speech signal picked up by a microphone is typically distorted by background noise. Speakers with accents may pronounce words differently than the norm reflected in the lexicon and training data, leading to ASR misclassification.
Conversely, in Chinese ASR, there are many polyphonic words with identical pronunciations, making it much more challenging to distinguish the correct one from the candidates, especially when the speech is tainted.
The acoustic model (AM) converts a speech wave to a phone sequence or a lattice. The language model (LM) decodes the AM output to a natural language sentence, making up the bulk of the industrial ASR pipeline. For Chinese polyphonic words, traditional LMs lack the robustness necessary to withstand noisy AM outputs. The second LM decoding pass is unlikely to rectify a flawed first attempt. Some existing works propose training the ASR with intentionally noisy speech data to solve these issues.
According to studies, most misclassified words can be recovered from context information if the LM can extract semantics from the contaminated context by studying the worst-case scenarios of typical ASR systems. In addition to not using the context information of phone sequences, the error is also propagated at each stage of the conventional two-pass LM decoding. Therefore, it is preferable to directly convert the AM outputs to the sentence with full context.
Researchers have found that using Transformer can help make LM more resilient to contaminated calls. Meanwhile, pre-training is useful to alleviate the difficulty by providing a more robust LM that can convert the corrupted phone sequence to the intended sentence, as demonstrated by the recent success of the pre-training model on several natural language processing (NLP) tasks.
A new study by the Hong Kong University of Science and Technology and WeBank proposes training an encoder-decoder model that can convert a phone sequence into a phrase by matching the phone numbers to words.
The proposed framework combines novel phone-perturbation heuristics with self-supervised training techniques and massive amounts of unpaired text data for pre-training to train its models. They first conduct the data augmentation to replicate the model’s output error during training to make the acoustic model more robust. They next use the phone sequences generated by the acoustic model on various audio data sets as input to the pre-training model and evaluate the results alongside those generated by a conventional ASR system.
The experimental evaluation of synthetic noisy speech datasets at varying signal-to-noise ratios (SNRs) demonstrates the resistance of the proposed framework to environmental noises. The team conducted experiments on two independent real-world corpora and evaluated them against many industry-standard ASR benchmarks. The findings show that the models outperform the state-of-the-art ASR pipeline, achieving relative character error rate (CER) reductions of 28.63% and 26.38%, respectively.
The researchers plan to maximize the efficacy of pre-training for noise-robust LM by exploring more efficient PSP pre-training techniques with larger unpaired datasets in the coming stages.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and reference article. Please Don't Forget To Join Our ML Subreddit
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.