Posted: 11 Jan 2022 04:00

“Speech Recognition” January 2022 — summary from Astrophysics Data System and Arxiv

Astrophysics Data System - summary generated by Brevi Assistant

Automatic speech recognition in low resource languages boosts gain access to etymological minorities to technological advantages offered by Artificial Intelligence.

In this paper, we attend to a trouble of data deficiency of Hong Kong Cantonese language by creating a new Cantonese dataset. Speech recognition is extremely difficult in student learning environments that are defined by considerable cross-talk and history sound. To resolve this issue, we present a multilingual speech recognition system that uses an interactive video evaluation system to estimate the 3D audio speaker geometry for sensible audio simulations.

Self-supervised acoustic pre-training has attained outstanding results in the automated speech recognition job. A lot of the successful acoustic pre-training methods utilize contrastive learning to learn the acoustic representations by differentiating the representations from different time steps, ignoring the audio speaker and environment effectiveness. Despite the fast development of end-to-end automated speech recognition, it has been revealed that incorporating external language models into the decoding can better improve the recognition performance of E2E ASR systems. Although several methods have been recommended to include word-level outside LMs in E2E ASR, these methods are primarily made for languages with clear word boundaries such as English and can not be directly put on languages like Mandarin, in which each personality sequence can have multiple corresponding word sequences.

Code-Switching is a typical linguistic phenomenon in multilingual communities that includes changing between languages while talking. We check out the effect of different language model combination methods on the efficiency of the proposed model.

Arxiv - summary generated by Brevi Assistant

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus including 10000+ hours premium classified speech, 2400+ hrs weakly labeled speech, and regarding 10000 hrs unlabeled speech, with 22400+ hours in total. To the best of our expertise, WenetSpeech is the current biggest open-sourced Mandarin speech corpus with transcriptions, which benefits from research on production-level speech recognition. In this paper, we propose an open source, production first, and manufacturing all set speech recognition toolkit called WeNet in which a new two-pass strategy is implemented to merge streaming and non-streaming end-to-end speech recognition in a solitary model. WeNet supplies an efficient way to ship ASR applications in several real-world situations, which is the main difference and benefit to various other open source E2E speech recognition toolkits.

In this paper, we present a unique two-pass approach to unify streaming and non-streaming end-to-end speech recognition in a solitary model. On the AISHELL-1 test collection, our combined model achieves 5. 60% relative character error rate reduction in non-streaming ASR compared to a standard non-streaming transformer.

The unified streaming and non-streaming two-pass end-to-end model for speech recognition has shown wonderful performance in terms of streaming ability, accuracy, real-time aspect, and latency. We additionally proposed a new information enhancement method called SpecSub to help the U2+ model to be extra precise and durable. Audio-based automatic speech recognition breaks down significantly in loud environments and is especially vulnerable to interfering speech, as the model can not determine which speaker to transcribe. Audio-visual speech recognition systems boost toughness by complementing the audio stream with the visual info that is regular to sound and helps the model concentrate on the wanted speaker.

