A text to understand reinforcement learning

Reinforcement learning is a learning method of machine learning, which corresponds to supervised learning and unsupervised learning. This article will introduce in detail the basic concepts of reinforcement learning, application scenarios and mainstream reinforcement learning algorithms and classification.


What is reinforcement learning?

Reinforcement learning is not a specific algorithm, but a general term for a class of algorithms.

If used for comparison, he is similar to supervised learning, unsupervised learning, and is a collective learning method.

Reinforcement learning is one of the learning methods of machine learning

The idea of ​​reinforce learning algorithms is very simple. Take games as an example. If you adopt a strategy in the game to achieve higher scores, then further strengthen this strategy in order to continue to achieve better results. This strategy is very similar to the various "performance rewards" in everyday life. We often use such strategies to improve our game level.

In the game Flappy bird, we need a simple click to control the birds, avoid the various water pipes, and fly as far as possible, because the farther you fly, you can get higher points rewards.

This is a typical intensive learning scenario:

  • The machine has a clear bird role - agent
  • Need to control the bird farther away - the goal
  • Need to avoid all kinds of water pipes throughout the game - environment
  • The way to avoid the water pipe is to let the bird fly hard - action
  • The farther you fly, the more points you will earn - rewards

The game is a typical intensive learning scene

You will find that the biggest difference between intensive learning and supervised learning, unsupervised learning is that you don't need a lot of “data feeding”. Instead, learn some skills by trying to keep on trying.


Application scenarios for reinforcement learning

Intensive learning is not mature enough at present, and the application scenarios are relatively limited. The biggest application scenario is the game.


Intensive learning is most used in the game field

2016 Year: AlphaGo Master defeats Li Shishi, using reinforcement learning AlphaGo Zero It took only 40 days to beat his predecessor AlphaGo Master.

What does AlphaGo Zero, which is hailed by scientists as a "world feat," for ordinary people? 》

2019 1 Month 25 Day:AlphaStar Defeat the top human pros in XScaleX:2 in StarCraft 10.

"StarCraft 2 Human 1: 10 loses to AI! DeepMind "AlphaStar" Evolution

2019 4 Month 13 Day: OpenAI defeated the Human World Championship in the Dota2 competition.

"2: 0! Dota2 World Champion OG, rubbed by OpenAI on the ground



Intensive learning also has many applications in the field of robotics.

Robots are much like "agents" in intensive learning. In the field of robotics, intensive learning can also play a huge role.

"Robots can achieve balanced control like people through intensive learning"

"Deep Learning and Reinforcement Learning, Google's Long-Term Reasoning Ability to Train Robot Arms"

"Berkeley Intensive Learning New Research: Robots Learn Trajectory Tracking with Only a Frequent Random Data"



Intensive learning also has some applications in the areas of recommendation systems, dialogue systems, education and training, advertising, and finance:

'Strengthening the strong combination of learning and recommendation systems"

'Strategy adaptation in dialogue management based on deep reinforcement learning"

'Strengthen the practical application of learning in the industry"


Mainstream algorithm for reinforcement learning

Model-Free vs. Model-Based

Before introducing the detailed algorithm, let's take a look at the 2 big classification of the reinforcement learning algorithm. The important differences in this 2 classification are:Whether the agent can fully understand or learn the model of the environment

Model-Based has an early understanding of the environment, and can consider planning in advance, but the disadvantage is that if the model is inconsistent with the real world, it will not perform well in the actual use scenario.

Model-Free has abandoned model learning and is not as efficient as the former, but it is easier to implement and easier to adjust to a good state in real-world scenarios. and soModel-free learning methods are more popular and are more widely developed and tested.


Mainstream reinforcement learning algorithm classification

Model-free learning-Policy Optimization (Policy Optimization)

This series of methods expresses the strategy display as: \pi_{\theta}(a|s) . They directly target performance J(\pi_{\theta}) Gradient descent is performed to optimize, or indirectly, to optimize the local approximation function of the performance target. Optimization is basically based on Same strategy That is to say, each step of the update will only use the data collected when the latest policy is executed. Strategy optimization usually also includes learning out V_{\phi}(s) As V^{\pi}(s) Approximation, this function is used to determine how to update the policy.

An example of a strategy based optimization strategy:

  • A2C / A3C, maximizing performance directly through gradient descent
  • PPO , not directly by maximizing performance updates, but maximizing Target estimate Function, this function is the objective function J(\pi_{\theta}) Approximate estimate.


Model-free learning – Q-Learning

This series of algorithms learns the optimal action value function Q^*(s,a) Approximate function: Q_{\theta}(s,a) . They are usually based on Bellman equation The objective function. Optimization process belongs to Different strategy Series, which means that training data can be used at any point in time for each update, regardless of how the agent chooses to explore the environment when acquiring data. The corresponding strategy is passed Q^*and \pi^* The link between the get. The action of the agent is given by the following formula:

a(s) = \arg \max_a Q_{\theta}(s,a).

Q-Learning based approach

  • DQN, a classic way to develop deep reinforcement learning
  • And C51, learning about the distribution function of returns, the expectation is Q^*


Model learning-pure planning

This most basic approach, which never shows a representation strategy, is purely using planning techniques to select actions, such as Model predictive control (model-predictive control, MPC). In the model predictive control, each time the agent observes the environment, it will calculate a plan that is optimal for the current model. The plan here refers to all the actions that the agent will take in a fixed period of time in the future (through the learned value). Functions, planning algorithms may take into account future rewards that are out of scope). The agent first performs the first action of the plan and then immediately discards the rest of the plan. Each time it prepares to interact with the environment, it calculates a new plan to avoid performing actions that are less than planned.

  • MBMF Model predictive control based on learned environmental model on some standard benchmark tasks of deep reinforcement learning


Model learning – Expert Iteration

The later representation of pure planning, the display representation of the use and learning strategies: \pi_{\theta}(a|s) . The agent applies a planning algorithm in the model, similar to Monte Carlo Tree Search, which generates candidate behaviors of the plan by sampling the current strategy. This algorithm gets better actions than the strategy itself generates, so it is an "expert" relative to the strategy. The strategy is then updated to produce actions that are more similar to the output of the planning algorithm.

  • ExIt The algorithm uses this algorithm to train deep neural networks to play Hex
  • AlphaZero Another example of this method


In addition to model-free learning and classification with model learning, there are several other ways to classify reinforcement learning:

  • Based on probability vs based value
  • Round update VS single step update
  • Online learning VS offline learning

Please check the detailsIntensive learning method summary "


Baidu Encyclopedia and Wikipedia

Baidu Encyclopedia version

Reinforcement learning, also known as reinforcement learning and evaluation learning, is an important machine learning method, and has many applications in the fields of intelligent control robots and analysis and prediction.

However, reinforcement learning has not been mentioned in the traditional machine learning classification. In the connectionist learning, the learning algorithms are divided into three types, namely unsupervised learning, supervised leaning and reinforcement learning.

Read More

Wikipedia version

Reinforcement Learning (RL) is an area of ​​machine learning that involves how software agents can take action in the environment to maximize the concept of cumulative rewards. This problem is studied in many other disciplines due to its generality, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, group intelligence, statistics and genetic algorithms. . In the field of operations research and control, reinforcement learning is called approximate dynamic programming or neural dynamic programming.

Read More