Cyclic neural network generates motivation

Suppose we now use a traditional language model to do a simple forecasting task, there is such a sentence:

Core reading is an excellent public number with lots of useful information. I often look for information and knowledge in ___.

If we use the n-gram model to predict the contents of a space, such as 2-gram, the model will give the word with the highest probability after "being". Although in our opinion, "reading core" is a high probability option, after training a lot of information, the probability of "reading core" appearing behind "being" must be extremely low, because we generally say " Common words such as "at home." Of course, we can continue to expand the range seen by the model until we see the name of the core reading in the previous article, but this kind of thinking is not feasible in practice. The larger the N, the longer the Markov chain and the sparse data. Serious, computationally difficult, and more importantly, we need to set this N to a fixed value so that we can't handle sentences of any length.

A network with a cyclic structure can be called a recurrent neural network.RNNThis problem can be solved very effectively. From a simple theoretical point of view, it can handle sequences of any length and does not need to fix N in advance, which is more flexible.

Convolutional neural networks can also deal with time series problems, especially when we care about whether certain features appear, not where they are.CNNHave certain advantages. But when dealing with natural language types of data, such as in sentiment classification problems, we divide the text into positive and negative. CNN may process positive words into certain words, such as “likes” and “beauty”. If such a sentence:

It’s not that I don’t like this movie, it can’t be said to be unbeautiful, but...

We can quickly know that this evaluation is a negative evaluation. Then if you only extract features and ignore the relationship between words and words, CNN may find that the text is "like" and it is classified as positive. When you see "dislike", it is negative, and then see "not no." It is obviously unreasonable to like it and then classify it as positive. At this time we need to use the language model to understand the meaning of this sentence.

Understanding cyclic neural networks

We often see RNN icons in the learning materials:

This picture is unfriendly to beginners, and it is very easy to produce more misunderstandings, although we can see the dependence of time series from it, adding a propagating arrow from st-1 to st, indicating that there is information. The neurons from the front flow directly to the neurons behind, but more questions about this picture cannot be told, such as:

  • Is the data entered at the same time?
  • At the very beginning, will the first unit also accept input from the previous neuron? If so, the first neuron has no previous neuron, then where does this input come from?
  • Weights for propagation in hidden units on the same layermatrixW, and the weight matrix U input to the hidden layer, the hidden layer to the output weight matrix V, why is it the same?

Although the above picture is a widely used schematic, in fact, the way to better understand the loop structure is not the case.

In order to better help you understand the loop structure, I drew a few sketches. Imagine a simple feedforward neural network with only one neuron in the input, output, and hidden layers, as shown below. The circle here has no abstract meaning, a circle is a neuron:

When we enter the first data point of the time series, we store the value of the hidden layer in a memory unit. This memory unit is not an entity, but a structure for accessing things:

The function of the memory unit we stored accesses the characteristics of X1. When we input the second data point of the time series, we must note that the structure of the network remains unchanged, but we also input the information in the memory to the hidden In the layer, the hidden layer of the second data point accepts the data and the characteristics of the first data point for output, and returns the information of the hidden layer to the memory at this time, and saves it again:

When the third data point is input, the operation of the memory unit is the information obtained in the previous step, and the obtained information is saved for input of the next time step:

In this way, we get a structure similar to the one we like to hear. In order to express the influence of the previous information on the present, we use a weight matrix to parameterize the memory unit when inputting the memory unit. A correspondence becomes:

Here, we understand each circle as a neuron, but if we understand a circle as a layer of neurons, that is, abstract it into a structure, input, hidden layer, and output are multidimensional, and Expand by time, you will get:

On this basis, we explain the above questions:

  • Is the data entered at the same time? Obviously not, the data is entered in order. When dealing with serialized data, we often use sliding windows to adjust different structures.
  • At the very beginning, will the first unit also accept input from the previous neuron? If so, but the first neuron has no previous neuron, then where does this input come from?
    When we enter the first data point, we need to set the initial value of the memory unit. This initial value can be used as a parameter, or it can be simply set to zero, indicating that there is no information before.
  • Why is the weight matrix W propagating in the hidden unit of the same layer, and the weight matrix U input to the hidden layer, and the weight matrix V of the hidden layer to the output, why are the same?
    This is a very important issue and knowledge that we cannot infer from the diagram. Significantly reducing the number of parameters is one reason, but not essential. We do not consider the matrix W first, only the ordinary neural network, because the same neural network is used for different data in the sequence. CNN uses the convolution kernel shared by the parameters to extract the same feature. In the RNN, the U, V shared by the parameters is used to ensure that the output produced by the same input is the same. For example, in a piece of text, a large number of "puppies" may appear. The parameter sharing makes the neural network produce the same hidden layer and output when inputting the "puppy".
    At this point, we discuss matrix W, the matrix of parameter sharing W ensures that for the same above, the same is produced below.