Reinforcement learning is a learning method of machine learning, which corresponds to supervised learning and unsupervised learning. This article will introduce in detail the basic concepts of reinforcement learning, application scenarios and mainstream reinforcement learning algorithms and classification.
What is reinforcement learning?
Reinforcement learning is not a specific algorithm, but a general term for a class of algorithms.
If used for comparison, he is similar to supervised learning, unsupervised learning, and is a collective learning method.
The idea of reinforce learning algorithms is very simple. Take games as an example. If you adopt a strategy in the game to achieve higher scores, then further strengthen this strategy in order to continue to achieve better results. This strategy is very similar to the various "performance rewards" in everyday life. We often use such strategies to improve our game level.
In the game Flappy bird, we need a simple click to control the birds, avoid the various water pipes, and fly as far as possible, because the farther you fly, you can get higher points rewards.
This is a typical intensive learning scenario:
- The machine has a clear bird role - agent
- Need to control the bird farther away - the goal
- Need to avoid all kinds of water pipes throughout the game - environment
- The way to avoid the water pipe is to let the bird fly hard - action
- The farther you fly, the more points you will earn - rewards
You will find that the biggest difference between intensive learning and supervised learning, unsupervised learning is that you don't need a lot of “data feeding”. Instead, learn some skills by trying to keep on trying.
Application scenarios for reinforcement learning
Intensive learning is not mature enough at present, and the application scenarios are relatively limited. The biggest application scenario is the game.
2016 Year: AlphaGo Master defeats Li Shishi, using reinforcement learning AlphaGo Zero It took only 40 days to beat his predecessor AlphaGo Master.
2019 1 Month 25 Day:AlphaStar Defeat the top human pros in XScaleX:2 in StarCraft 10.
2019 4 Month 13 Day: OpenAI defeated the Human World Championship in the Dota2 competition.
Robots are much like "agents" in intensive learning. In the field of robotics, intensive learning can also play a huge role.
Intensive learning also has some applications in the areas of recommendation systems, dialogue systems, education and training, advertising, and finance:
Mainstream algorithm for reinforcement learning
Model-Free vs. Model-Based
Before introducing the detailed algorithm, let's take a look at the 2 big classification of the reinforcement learning algorithm. The important differences in this 2 classification are:Whether the agent can fully understand or learn the model of the environment
Model-Based has an early understanding of the environment, and can consider planning in advance, but the disadvantage is that if the model is inconsistent with the real world, it will not perform well in the actual use scenario.
Model-Free has abandoned model learning and is not as efficient as the former, but it is easier to implement and easier to adjust to a good state in real-world scenarios. and soModel-free learning methods are more popular and are more widely developed and tested.
Model-free learning-Policy Optimization (Policy Optimization)
This series of methods expresses the strategy display as: . They directly target performance Gradient descent is performed to optimize, or indirectly, to optimize the local approximation function of the performance target. Optimization is basically based on Same strategy That is to say, each step of the update will only use the data collected when the latest policy is executed. Strategy optimization usually also includes learning out As Approximation, this function is used to determine how to update the policy.
An example of a strategy based optimization strategy:
- A2C / A3C, maximizing performance directly through gradient descent
- PPO , not directly by maximizing performance updates, but maximizing Target estimate Function, this function is the objective function Approximate estimate.
Model-free learning – Q-Learning
This series of algorithms learns the optimal action value function Approximate function: . They are usually based on Bellman equation The objective function. Optimization process belongs to Different strategy Series, which means that training data can be used at any point in time for each update, regardless of how the agent chooses to explore the environment when acquiring data. The corresponding strategy is passed and The link between the get. The action of the agent is given by the following formula:
Q-Learning based approach
- DQN, a classic way to develop deep reinforcement learning
- And C51, learning about the distribution function of returns, the expectation is
Model learning-pure planning
This most basic approach, which never shows a representation strategy, is purely using planning techniques to select actions, such as Model predictive control (model-predictive control, MPC). In the model predictive control, each time the agent observes the environment, it will calculate a plan that is optimal for the current model. The plan here refers to all the actions that the agent will take in a fixed period of time in the future (through the learned value). Functions, planning algorithms may take into account future rewards that are out of scope). The agent first performs the first action of the plan and then immediately discards the rest of the plan. Each time it prepares to interact with the environment, it calculates a new plan to avoid performing actions that are less than planned.
- MBMF Model predictive control based on learned environmental model on some standard benchmark tasks of deep reinforcement learning
Model learning – Expert Iteration
The later representation of pure planning, the display representation of the use and learning strategies: . The agent applies a planning algorithm in the model, similar to Monte Carlo Tree Search, which generates candidate behaviors of the plan by sampling the current strategy. This algorithm gets better actions than the strategy itself generates, so it is an "expert" relative to the strategy. The strategy is then updated to produce actions that are more similar to the output of the planning algorithm.
- ExIt The algorithm uses this algorithm to train deep neural networks to play Hex
- AlphaZero Another example of this method
In addition to model-free learning and classification with model learning, there are several other ways to classify reinforcement learning:
- Based on probability vs based value
- Round update VS single step update
- Online learning VS offline learning
Please check the detailsIntensive learning method summary "
Baidu Encyclopedia and Wikipedia