First, the introduction of speech synthesis technology

This article is reproduced in the public AI Technology Base Camp.Original address

Voice has played a huge role in the development of human beings. Speech is the external form of language. It is the symbol system that most directly records people's thinking activities. It is also one of the most basic and important communication methods for human beings to survive and develop and engage in various social activities. Letting the machine speak is the dream of mankind for thousands of years. Text To Speech is a scientific practice in which human beings constantly explore and realize this dream. It is also a technical field that is constantly promoted and continuously improved by this dream.

In the long process of exploration, the synthetic system that really produced practical significance was produced in the 20s. Benefiting from the development of computer technology and signal processing technology, the first generation of parameter synthesis system-formant synthesis system was born. It uses the formant information of different pronunciations to achieve an intelligible speech synthesis effect, but the overall sound quality is still difficult to meet commercial requirements.

In the era of 90, storage technology has been greatly developed, resulting in the splicing synthesis system. The splicing and compositing system uses the PSOLA algorithm to adjust and store the original uttered segments of the storage, thereby achieving a better sound quality than the formant parameters.

After that, the speech synthesis technology has been continuously developed, and the two main technical routes of parameter synthesis and splicing and synthesis have made great progress, competing with each other and promoting each other, which has greatly improved the quality of synthesized speech. Speech synthesis technology has been applied in many scenarios. . On the whole, it mainly includes the following aspects:

From rule-driven to data-driven:In the early systems, most of them required a lot of expert knowledge, and the adjustment of pronunciation or acoustic parameters was not only time-consuming and laborious, but also difficult to meet the coverage of different contexts, and also affected the implementation of the technology to some extent. With the development of technology, more and more data has been applied to the system. Taking the speech synthesis sound library as an example, from the first few hundred sentences to the subsequent thousands and tens of thousands of sentences, the number of pronunciation samples The technology based on statistical models has been greatly expanded. From the initial tree model, hidden Markov model, Gaussian mixture model, to the neural network model in recent years, the speech synthesis system's ability to describe speech has been greatly improved.

Increasingly understandable and comfortable synthetic effects:The synthesis effect evaluation of the speech synthesis system is generally performed by subjective evaluation experiments, using multiple participants to score multiple speech samples. If the voice samples are from different systems, it is called a comparison evaluation. In order to improve the sound quality of speech, the parameter synthesis system has adopted LPC synthesizer, STRAIGHT synthesizer, neural network vocoder represented by wavenet, etc.; in the splicing and synthesizing system, the strategy of continuously expanding the size of the sound bank and improving the context coverage is adopted. , have achieved significant results. In an ideal situation, the user wants speech-synthesized speech to be able to achieve true human pronunciation. With the continuous development of technology, this goal has become closer and closer. In an extreme case, a set of samples comes from a synthetic system, and a set of samples comes from real-life pronunciation, so the comparative evaluation done can be regarded as the Turing test of the speech synthesis system. If the user cannot accurately distinguish which speech samples are machine-generated and which are human-generated, then the synthetic system can be considered to pass the Turing test.

Text processing capabilities continue to increase:When human beings read aloud text, there is actually a process of understanding. This understanding process is essential in order for the machine to read well. In the speech synthesis system, a front end of text processing is generally included, and the input text is subjected to numbers, symbols processing, word segmentation, and multi-word processing. By utilizing massive text data and statistical model techniques, the level of text processing in synthetic systems is already sufficient for commercial applications in most scenarios. Furthermore, natural language understanding techniques can also be used to predict the focus, mood, tone of tone, etc. of a sentence, but since this part is greatly influenced by the context, and such data is relatively small, this part of the emotion is currently relevant. The technology is not mature enough.

Speech synthesis system block diagram

The above is an overview of the development of speech synthesis technology. Next, let's explore the impact of deep learning techniques on the development of synthetic technology in recent years.

Second, deep learning and speech synthesis

The deep learning technology, the impact on speech synthesis, is mainly divided into two stages:

The first stage: icing on the cake.Since 2012, deep learning technology has gradually gained attention and application in the field of speech. At this stage, the main role of deep learning technology is to replace the original statistical model and enhance the ability of the model to describe. For example, using DNN instead of the duration model, RNN Replace acoustic parameter models and so on. The generation part of the speech is still a method of synthesizing by splicing synthesis or vocoder, and there is no essential difference from the previous system. Comparing the two systems found that in the case of careful comparison, the effect of the replaced system is slightly better than the original system, but the overall feeling is not much different, failing to produce a qualitative leap.

The second stage: another way.Many of the research work at this stage is groundbreaking and a major innovation in speech synthesis.In 2016, an iconic article was published, proposing the WaveNet program. At the beginning of 2017, another iconic article was published, proposing an end-to-end Tacotron solution. At the beginning of 2018, Tacotron2 merged the two to form the benchmark system in the field of speech synthesis.In the process, many valuable research documents such as DeepVoice, SampleRNN, and Char2Wav have been published one after another, which greatly promoted the development of speech synthesis technology and attracted more and more researchers to participate.

 Hole convolution structure in WaveNet

WaveNet Inspired by PixelRNN, the autoregressive model was applied to successful attempts to generate time domain waveforms. Using the voice generated by WaveNet, the sound quality greatly surpassed the previous parameter synthesis effect, and even some of the synthesized sentences can reach the level of false realism, causing a huge sensation. Among them, the used convolution greatly enhances the receptive field to meet the requirements of high-sampling audio time domain signal modeling. The advantages of WaveNet are obvious, but because they use the first N-1 samples to predict the Nth sample, the efficiency is very low, which is a clear disadvantage of WaveNet. Later, Parallel WaveNet and ClariNet were proposed to solve this problem. The idea is to use neural network refinement technology to train parallel-calculated IAF models with pre-trained WaveNet models to achieve real-time synthesis. At the same time, it maintains the high sound quality of natural speech.

Tacotron It is a representative of the end-to-end speech synthesis system. Unlike the previous synthesis system, the end-to-end synthesis system can directly use the recorded text and the corresponding speech data pairs for model training without excessive expert knowledge and professional processing capabilities. It greatly reduces the threshold for entering the field of speech synthesis and provides a new catalyst for the rapid development of speech synthesis.

Tacotron's end-to-end network architecture

Tacotron takes the text symbol as input, takes the amplitude spectrum as an output, and then reconstructs the signal through Griffin-Lim to output high-quality speech. The core structure of Tacotron is a mechanism of attention encoder-decoder Model is a typical seq2seq structure. With this structure, it is no longer necessary to separately process the local correspondence between the speech and the text, which greatly reduces the difficulty in processing the training data. Due to the complexity of the Tacotron model, the parameters and attention mechanisms of the model can be fully utilized to more accurately characterize the sequence to enhance the expressiveness of the synthesized speech. Compared with the point-by-sample modeling of the WaveNet model, the Tacotron model is model-by-frame modeling, the synthesis efficiency is greatly improved, and there is a certain product potential, but the synthesized sound quality is lower than WaveNet.

Tacotron2 It is a natural result based on the fusion of Tacotron and WaveNet, which makes full use of the end-to-end synthesis framework and utilizes high-quality speech generation algorithms. In this framework, a structure similar to Tacotron is used to generate the Mel spectrum as an input to WaveNet, while WaveNet degenerates into a neural network vocoder, which together form an end-to-end high-quality system.

Tacotron 2 network structure

Third, the application of speech synthesis

Speech synthesis technology has been successfully applied in many fields, including voice navigation and information broadcasting.Biaobei Technology has its own views on the application prospects of speech synthesis.Because Biaobei Technology is both a voice and data service provider and a provider of total solutions for speech synthesis, it has also done a lot of thinking about the application prospects of speech synthesis.At present, the voice of speech synthesis can meet the needs of most users in terms of synthesis effect, but it is not rich enough in the choice of timbre; in terms of pronunciation, it is still monotonous.In response to this situation, Biaobei Technology has launched the "Sound Supermarket" to provide partners with an alternative, what you hear is what you get.We believe that speech synthesis will be widely used in the following three scenarios with a synthesis effect closer to the needs of the scene: voice interaction, reading & education, and pan-entertainment.

Voice interaction

In recent years, with the promotion of the concept of artificial intelligence, voice interaction has become a hot spot, and applications such as intelligent assistants and intelligent customer service have emerged one after another. In voice interaction, there are three key technologies, speech recognition, speech synthesis and semantic understanding. The role of speech synthesis is obvious. Limited by the level of technological development of semantic understanding, the current application is mainly focused on different vertical areas, which are used to solve problems in certain specific fields, and still have certain limitations.

Reading & Education

Reading is a long-term and wide-ranging demand. Every day we need to read a lot of information through reading. There are both fragmented information acquisition and deep reading. It includes news, circle of friends, blog posts, novels and famous works. Some are for Social synchronization, some are killing time, and some are to improve self-cultivation. Among these multi-dimensional information needs, speech synthesis technology provides a "simple" way, a way to "parallel" input, and a "cheap" way. Compared to traditional reading, it has its own advantages. Information can be easily obtained while driving, walking, and exercising.

In education, especially in language education, imitation and interaction are essential exercises. In the current education method, it is necessary to have a large amount of cost to learn the standard pronunciation, such as various extracurricular classes and even one-on-one education. With the continuous advancement of speech synthesis technology, on the one hand, the sound education material can be greatly increased, and on the other hand, the educational content of the human conversation can be partially replaced.

Pan entertainment

Pan-fun is a scene that has had little crossover with speech synthesis, but we think this is a huge market to be developed. We already have a wealth of voice IP resources, and can be displayed through the sound supermarket for everyone to buy their favorite voice. These are all preparations for the widespread application of speech synthesis technology to the field of pan-entertainment. Taking the voice-over field as an example, the use of speech synthesis technology can greatly reduce the cost and cycle of dubbing; taking the current short video as an example, it is very easy to use the speech synthesis technology to match the content of the video with interesting sounds; Taking the virtual host as an example, the use of speech synthesis technology can improve the timeliness of information, and at the same time greatly ease the work pressure of the host and reduce its work intensity.

In short, with the rapid development of speech synthesis technology, the generated speech will become more and more natural and vivid, and will have more and more emotional expression. We firmly believe that the advancement of technology will continue to break through the original obstacles, meet the needs of more and more users, make better applications continue to emerge, and realize the beautiful vision of changing lives with sound!