A text to understand speech recognition

What is speech recognition? What is his value and what is his technical principle? This article will answer your common questions about speech recognition.


What is Speech Recognition Technology (ASR)?

To make a dialogue with people, you need to implement three steps:

Corresponding to the work of "ears", "brains", "mouths", the machine must understand the human speech, it is inseparable from the speech recognition technology (ASR).

Speech recognition use scenario

Speech recognition has become a very common technology that everyone often uses in their daily lives:

  • Apple users must have experienced Siri, which is a typical speech recognition.
  • There is a function in WeChat that is "text-to-text", which also uses voice recognition
  • Recently popular smart speakers are products with speech recognition as the core.
  • Compared with the new car, the basic function of voice control is also the voice recognition.


Speech recognition technology

The speech recognition technology is split down and can be divided into "input-encoding-decoding-output" 4 processes.

Speech recognition 4 processes: input-encoding-decoding-output

How does speech recognition work?

First of all, the sound itself is a kind of wave, just like we often use a segment of waveform to represent the audio. We use bands to represent audio

Then follow the steps:

  1. After the signal processing of the audio, it is split according to the frame (millisecond level), and the segmented waveform is changed into a multi-dimensional according to the characteristics of the human ear.vectorInformation
  2. Identify these frame information as status (can be understood as an intermediate process, a ratiophonemeStill small process)
  3. Combine the states to form phonemes (usually 3 states = 1 phonemes)
  4. Finally, the phonemes are composed of words (dà jiā hǎo) and concatenated into sentences. Thus, this can be converted from speech to text.Combine phonemes into words


Baidu Encyclopedia version

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert vocabulary content in human speech into computer readable input such as buttons, binary codes or sequences of characters. Unlike speaker recognition and speaker confirmation, the latter attempts to identify or confirm the speaker of the speech rather than the vocabulary content contained therein.

Wikipedia version

Speech recognition is an interdisciplinary sub-area of ​​computational linguistics, and its development methods and techniques enable the recognition and translation of spoken language by computer. It is also known as Automatic Speech Recognition (ASR), Computer Speech Recognition or Speech to Text (STT). It combines knowledge and research in the fields of linguistics, computer science and electrical engineering.

Some speech recognition systems require "training" (also called "registration"), in which individual speakers read text or isolated vocabulary into the system.The system analyzes a person’s specific voice and uses it to fine-tune the recognition of that person’s voice, thereby improving accuracy.A system that does not use training is called a "speaker-independent" system.The system that uses training is called "speaker dependence".

