Posted: 23 Aug 2021 23:00

“Speech Synthesis” August 2021 — summary from Arxiv

Text to speech, or speech synthesis, which intends to synthesize all-natural and intelligible speech given text, is a hot research study topic in speech, language, and machine learning areas and has wide applications in the market. As the advancement of deep learning and artificial intelligence, neural network-based TTS has considerably enhanced the high quality of synthesized speech in recent times. This paper offers Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art on inter-speaker and inter-text prosody transfer. Moreover, results show that adversarial training properly disposes of speaker identification information from the prosody depiction, which ensures Daft-Exprt will regularly generate speech with the preferred voice. Expressive neural text-to-speech systems include a design encoder to learn an unrealized embedding as the design info. The rationale of this approach is that the design encoder can be compelled to concentrate on design information as opposed to on textual details included in the recommendation speech by a properly designed downsample-upsample filter, i. e., the removed style embeddings can be downsampled at a specific interval and after that upsampled by replication. Co-speech motion generation is to synthesize a motion sequence that not just looks real yet matches with the input speech sound. Motivated by the truth that the speech can not completely determine the gesture, we make a technique that learns a set of motion layout vectors to design the unexposed problems, which relieve the ambiguity.

Source texts:


