This article is reproduced from the public reading core,Original address

RNNThe training is different from the ordinary feedforward neural network. In the back propagation, the weight matrix is ​​multiplied, so that the weak offset under the long time step is amplified, and the gradient disappears and explodes. The orthogonal initialization and activation described above. The purpose of the function selection is to keep the eigenvalues ​​of the parameter matrix as close as possible to 1. We introduce a "curve to save the country" approach, which does not directly change the eigenvalues ​​of the weights, but parameterizes the neurons, and by generating a linear self-looping path, cancels the weight parameter W in the original RNN, which is Long and short time memory unitLSTM(long short-term memory).

There is a widespread misunderstanding. This unit is not a unit with long-term memory and short-term memory, but a long short-term memory unit. Short-term memory is a psychological representation of working memory, similar to memory. As can be seen from the English name, it is long short-term memory, not long-short term memory.

In the previous "How to Understand the Cyclic Structure of Neural Networks," we have clarified the nature of the loop structure. It uses the same network structure for data sequences at different time steps, but stores one or several layers for storage. The input of the next time step, the structure of this storage is the memory unit:

As shown in the simple structure of RNN, we input the serialized data in turn. The hidden layer of the previous time step stores the obtained data in the memory unit, and then multiplies the memory unit bymatrixW enters the hidden layer of the next time step.

In the above figure, the information (Ct) stored in the memory unit of our next time step is:

We cancel the weight parameter W and re-parameterize this memory unit:

  • The step of parameterizing storage by a gate, when the door is opened, we will store the information in the memory unit, which is called the Input Gate.
  • Whether the memory unit needs to flow information to the next time step is also controlled by a gate. Only when the door is opened, we will input information to the next time step. This gate is called the Output Gate.

The memory unit is not an entity, it can be embedded in the hidden layer itself. We added two more gates to control the input and output, which gives us a basic structure: 

Among them, the input and output are controlled by the sigmoid function, and the output of the sigmoid function is in [0,1], which can well describe the opening or closing state of the door. The value of the value can indicate the degree to which the door is opened. We use the function Fi to represent the result of the input gate. The result of the output gate is represented by Fo. In order to ensure the influence of the opening and closing states of the input gate on the input, we directly multiply it and enter the memory, the state of the memory. We use C to indicate:

The value stored in memory is adjusted by the input gate. When it is completely closed, it means that the information does not flow in. Next, at the time of the output, we multiply the result of the output gate again, but if we want the input and output gates to be as independent as possible, because direct multiplication will inevitably lead to an output gate that is very small when the input gate is small. Large, it will not produce much output, so we use a function g to act on the result of the memory, and then control the output gate:

At this point we get a more complex neuron, the input gate controls the inflow of information, the output gate controls the flow of information, then it seems that our memory unit is unnecessary, but in RNN, we must use the weight W To control the flow of information, in LSTM, we do not use weights, but simply add:

As the sequence gets longer and longer, the time step is getting bigger and bigger. The memory of the previous step will flow into our next memory, which will make the value of the memory unit in the back memory larger and larger. At this point, there are two possible consequences:

  • If our function g is also an activation function with a squeeze property, then an excessive value will cause the activation function to be active forever and lose its ability to learn.
  • If our function g is a function of type ReLU, the value becomes very large, which will cause the output gate to fail, because the value of the output gate is small, and when it is multiplied by a large value, it will become very large.

In either case, it indicates that we need to discard some information in the memory unit. The solution of LSTM is to add a forget gate to the original unit, which is to re-parameterize the memory unit and remember The information input by the unit is multiplied by the result Ff of the forgetting gate and stored in the memory unit as information, so the current information becomes:

You can write the formula as follows:

It is important to note that if we use the sigmoid function as the activation function, then when the forgotten gate is 1, it means that the information of the previous step is stored intact, which is exactly the opposite of its name, that is Say, when the Forgotten Gate is closed, it will forget that it will be remembered when the Forgotten Gate is opened. (a bit of a mouthful)

The whole process is that we multiply the data of the current time step by the result of the input gate, and the memory unit of the previous step is multiplied by the result of the forgetting gate, and the two are added together, and the result of the output gate is multiplied together to obtain the output of the next layer. At the same time, the memory unit participates in the operation of the next time step.

This is a schematic diagram of LSTM that everyone loves to see. If you understand the previous content, then you can walk with me. 1 represents the forget gate, 2 represents the input gate, 3 represents the output gate, and the symbol in the middle There are ➕ and ✖️, which represent two operations. From left to right, the memory unit C (t-1) of the previous step is multiplied by the forgetting gate, and the multiplication result of the input gate and the input is added to get the current memory unit Ct, Ct participate in the memory operation of the next time step along the arrow. At the same time, Ct is multiplied by the output gate to get the output Ht, and Ht participates in the input operation of the next time step along the arrow.

After understanding the workflow of LSTM, we naturally ask how it solves the problem of long-term dependence. Some people think that this is because it contains short-term memory and long-term memory. Some people will look at the picture and say This is because the information of the upper layer flows into the next layer without loss. Both of these statements are wrong.

The real answer is the cancellation weight W above. In a normal RNN, as the time step, the same memory unit stores more and more information, and chooses to use a weight parameter W to learn what information to retain or remove, but it will bring long-term dependence. The most essential part of the whole LSTM is the design of the Forgotten Gate. It can solve the problem of information redundancy without the weight W.

Change to

When the forgetting gate is opened, it is close to 1, and the gradient disappears with a smaller probability in repeated propagation. In practice, we often have to ensure that the Forgotten Gate is opened most of the time.