Posted: 26 Oct 2021

"Speech Synthesis" October 2021 — summary from Astrophysics Data System, Arxiv and DOAJ

This research study targets developing an environment-aware text-to-speech system that can generate speech to suit particular acoustic environments. The key concept is to model the acoustic environment in speech sound as a factor of information irregularity and integrate it as a condition in the procedure of neural network based speech synthesis. In this paper, we present a unique design to realize fine-grained design control on the transformer-based text-to-speech synthesis. In particular, we model the talking style by drawing out a time series of local design symbols from the reference speech. Speech synthesis is used in a variety of markets. We evaluate four various configurations taking into consideration various inputs and training techniques, study them and prove how our finest model can generate speech files that depend on the very same circulation as the first training dataset. Learning feeling embedding from recommendation audio is an uncomplicated method for multi-emotion speech synthesis in encoder-decoder systems. However, just how to improve feeling embedding and exactly how to inject it into the TTS acoustic model better are still under examination. In a speech-to-speech translation pipe, the text-to-speech component is an essential component for delivering the equated speech to users. Moreover, we bring our step-by-step TTS system to the sensible scenario in combination with an upstream simultaneous speech translation system, and show the gains rollover to this use-case.

This paper provides a novel design of a neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech synthesis. Collaborative learning and adversarial learning approaches are applied in order to achieve efficient disentanglement of content and style aspects in speech and alleviate the content leak trouble in style modeling. End-to-end text-to-speech synthesis systems have achieved immense success in recent times, with enhanced simplicity and intelligibility. Experiments on the Telugu language information of the IndicTTS database show that the recommended Prosody-TTS model attains cutting edge performance with a mean opinion rating of 4.08, with a very low inference time. Just recently, psychological speech synthesis has accomplished amazing performance. The feeling strength of manufactured speech can be controlled flexibly making use of a stamina descriptor, which is acquired by a feeling characteristic ranking function. This work offers a lifelong learning strategy to educate a multilingual Text-To-Speech system, where each language is seen as a specific task and is discovered sequentially and constantly. One of the obstacles of long-lasting learning approaches is catastrophic neglecting: in TTS scenario, it means that model performance rapidly weakens in previous languages when adjusted to a new language. The end-to-end speech synthesis model can directly take an articulation as recommendation sound, and generate speech from the text with prosody and speaker attributes comparable to the recommendation sound. Due to the fact that only the matched text and speech are utilized in the training process, using unrivaled text and speech for reasoning would create the model to synthesize speech with low content high quality.

One of the most recent end-to-end speech synthesis systems makes use of phonemes as acoustic input tokens and ignores the details concerning which word the phonemes come from. Subjective assessments on simplicity demonstrate that the incorporation of acoustic word embedding can considerably outshine both pure phone-based system and the TTS system with pre-trained etymological word embedding. Intonation generation in meaningful speech such as storytelling is essential to producing premium quality Malay language meaningful speech synthesizer. Then, a boosted iterative two-step sinusoidal pitch contour solution was introduced to customize the pitch shapes of a neutral speech right into an expressive pitch shape of natural speeches. A central difficulty for articulatory speech synthesis is the simulation of sensible articulatory activities, which is essential for the generation of highly natural and apprehensible speech. To achieve this, the vocal tract target form of a consonant in the context of an offered vowel is derived as the weighted standard of 3 determined and acoustically-optimized reference vocal tract shapes for that consonant in the context of the corner vowels/ a/, i/, and/ u/. The paper presents a novel style and approach for training neural networks to generate synthesized speech in a certain voice and talking design, based on a small quantity of target speaker/style training data. The initial model where speaker/style adaptation was performed was a multi-speaker/multi-style model based upon 8.5 hrs of American English speech data which corresponds to 16 different speaker/style combinations. Chinese speech synthesis describes the technology that machines transforms human speech signals into corresponding messages or commands with recognition and understanding. For that reason, the Chinese efficiency of global students has a great relationship with their rate of interest in the Chinese language, that is, the higher the interest in Chinese, the stronger their inspiration to learn, and the Chinese proficiency will be great.

