Although convolutional neural networks (CNNAnd recurrent neural networks (RNNBecause of its in computer vision (CV) and natural language processing (NLPThe application in the application becomes more and more important, and the reinforcement learning (RL) as a framework model for computational neuroscience seems to be underestimated. In addition, there seem to be few resources detailing how RL can be applied to different industries. Despite criticism of the weaknesses of RL, RL should never be overlooked in the field of corporate research given its enormous potential to assist decision making. As Koray Kavukcuoglu, research director at Deepmind, said at a conference, “If one of our goals in working here is artificial intelligence, then it is the core. Reinforcement learning is a very common framework for learning sequential decision tasks. On the one hand, deep learning is of course the best set of algorithms we must learn to represent. So far, the combination of these two different models is the best answer, and we are very challenging in learning very good state representations. These tasks are not just about solving the toy world, but actually solving challenging real-world problems."

Therefore, this article aims to 1) study the breadth and depth of RL applications in the real world; 2) look at RL from different aspects; 3) persuade decision makers and researchers to increase research efforts.

The rest of this paper is organized as follows. The first section is a general introduction. The second section introduces the application of RL in different fields and a brief description of how to apply RL. The third section summarizes what is needed to apply RL. The fourth part is the intuition from other disciplines, and the fifth part is about how RL will be useful in the future. Section VI is the conclusion.

Intensive learning introduction

RL, known as the semi-supervised learning model in machine learning, is a technique that allows agents to take action and interact with the environment to maximize the total reward. RL is usually modeled asMarkov decision process(MDP).

Source: Reinforcement Learning: Introduction
source:Reinforcement Learning: Introduction

Imagine providing a TV remote control for your baby in your home (environment). Simply put, the baby (agent) will first observe and construct his/her own environment (state) representation. Then the curious baby will take certain actions, such as hitting the remote control (action) and watching how the TV responds (next state). Since the unresponsive TV is boring, the baby doesn't like it (receives negative rewards) and takes fewer actions to cause such results (update policy) and vice versa. The baby will repeat the process until he/she finds his/her satisfactory policy (what to do in different situations) (maximize (discount) total reward).

RL's research is to establish a mathematical framework for problem solving. For example, to find a good strategy, we can use a value-based approach (such asQ-learningTo measure how well a particular state is behaving, or to use a policy-based approach to directly identify actions to be taken in different states, without knowing how good the action is.

However, the problems we face in the real world can be extremely complicated in many different ways, so the typical RL algorithm cannot be solved. For example, in the game of GO, the state space is very large, the environment cannot be fully observed in the poker game, and many agents interact with each other in the real world. Researchers have invented methods that use deep neural networks to simulate the required strategies, value functions, and even transition models to solve some problems, hence the so-called deep reinforcement learning. This article does not distinguish between RL and Deep RL.

There are a lot of RL online and interested readers who can access good things.Awesome - RL.ArgminDennybritz.

Intensive learning application

This section is written for general readers. At the same time, it will have greater value for readers with RL knowledge.

Resource management in a computer cluster

Algorithms designed to allocate finite resources to different tasks are challenging and require artificially generated heuristics. Article "Resource Management of Deep Reinforcement Learning" [2]Shows how to use RL to automatically learn to allocate and schedule computer resources to wait for work, in order to minimize the average job slowdown.

National space is developed as an overview of current resource allocation and employment resources. For action spaces, they use tricks to allow the agent to select multiple actions at each time step. The reward is the sum of all the jobs in the system (-1 / job duration). They then combine the REINFORCE algorithm with the baseline values ​​to calculate the policy gradients and find the best policy parameters that give the action probability distribution that minimizes the target. Click here to viewGithubOn the code.

Traffic light control

In the paper "Multi-agent system for network traffic signal control based on reinforcement learning" In [3]The researchers tried to design a traffic light controller to solve the congestion problem. Although tested only in simulated environments, their methods show better results than traditional methods and illustrate the potential use of multi-agent RLs in designing transportation systems.

Traffic light control
Traffic light controlsource.

Five agents are placed in the traffic network at the five intersections, and there is an RL agent at the central intersection to control traffic signals. This state is defined as an eight-dimensional vector, with each element representing the relative traffic flow for each channel. The agent can use eight choices, each of which represents a phase combination, and the reward function is defined as a reduction in delay compared to the previous time step. The author uses DQN to learn the Q value of the {state,action} pair.


There is a lot of work to do in applying RL to robotics. For the investigation of RL in robot technology, readers can refer to[10]. especially,[11]Train the robot to learn the strategy of mapping the original video image to the robot's motion. The RGB image is fed to the CNN and the output is the motor torque. The RL component is a guided policy search for generating training data from its own state distribution.

Paper presentation.

Web system configuration

There are more than 100 configurable parameters in the Web system. The process of adjusting parameters requires skilled operators and a large number of tracking and error tests. Article "Intensive Learning Method for Automatic Configuration of Online Web System" [5]Demonstrated the first attempt in the field to automatically reconfigure parameters in a multi-tier Web system in a VM-based dynamic environment.

The reconfiguration process can be expressed as a limited MDP. The state space is the system configuration, and the action space for each parameter is {increase, decrease, hold}, and the reward is defined as the difference between the given target response time and the measured response time. The author uses a modelless Q learning algorithm to accomplish the task.

Although the authors used some other techniques, such as policy initialization to compensate for the large state space and computational complexity of the problem, rather than the potential combination of RL and neural networks, it is believed that this groundbreaking work paved the way for future research in the field. . .


Reinforcement learning in the field of chemistry

RL can also be used to optimize chemical reactions.[4]It shows that their model is superior to the most advanced algorithms and is extended to different potential mechanisms in the article "Optimizing chemical reactions using deep reinforcement learning".

CombineLSTMModeling the policy function, the RL agent optimizes the chemical reaction using a Markov decision process (MDP) characterized by {S, A, P, R}, where S is a set of experimental conditions (eg temperature, pH, etc. , A is to set all actions that may change the experimental conditions, P is the transition probability from the current experimental condition to the next condition, and R is a reward as a state function.

The appIdeal for demonstrating how RL can reduce time-consuming and trial-and-error work in relatively stable environments.

Personalized advice

Previous news recommendation work faced some challenges, including rapid changes in news dynamics, users are easily bored, and click-through rates do not reflect user retention rates. Guan Jie and so on. The RL has been applied to the paper entitled "DRN: Newly Recommended Deep Reinforcement Learning Framework" in the news recommendation system to solve these problems.[1].

In practice, they construct four types of features, namely A) user characteristics and B) context features as state features of the environment, and C) user news features and D) news features as action features. Four features are input to the depth Q network (DQN) to calculate the Q value. The recommended news list is selected according to the Q value, and the user clicks on the news as part of the reward received by the RL agent.

The authors also use other techniques to solve other challenging problems, including memory replay, survival models, and dueling robber gradients. Please refer to the paper for details.

Tendering and advertising

Alibaba Group researchers published a paper "Real-time bidding for multi-agent reinforcement learning in display advertising" [6]And they claim that their distributed agent-based multi-agent agent solution (DCMAB) has achieved good results, so they plan to conduct on-site testing on the Taobao platform.

Implementation details are left to the user for investigation. In general, the Taobao advertising platform is where merchants bid to show ads to customers. This may be a multi-agent problem because merchants bid each other and their behavior is interrelated. In this article, merchants and customers are aggregated into different groups to reduce computational complexity. The agent's state space represents the agent's cost-income status, the action space is bid (continuous), and the reward is the revenue generated by the customer cluster.

DCMAB algorithm
DCMAB algorithm. source:https : //

Other issues, including the impact of different reward settings (self-interest and coordination) on agency revenue, are also studied in this paper.


RL is now so well known because it is the mainstream algorithm for solving different games and sometimes achieving superhuman performance.

Reinforcement learning and linear models contrast humans
Reinforcement learning contrasts with linear models in humans. ClickHereView the source.

The most famous one must be AlphaGo [12]And AlphaGo Zero [13]. AlphaGo is trained by countless human games and has achieved superhuman performance by using value networks and Monte Carlo Tree Search (MCTS) in its policy network. However, the researchers later recalled and tried a more pure RL method-training it from scratch. The researchers let the new agent AlphaGo Zero play with them, and eventually defeated AlphaGo 100-0.

Deep learning

More and more attempts have recently been made to combine RL and other deep learning architectures, which show impressive results.

One of the most influential jobs in RL is the pioneering work of Deepmind combining CNN with RL.[7]. By doing so, the agent can “see” the environment through high-dimensional feelings and then learn to interact with it.

RL and RNN are another combination that people use to try new ideas. RNN is a neural network with "memory". When used in conjunction with RL, the RNN gives the agent the ability to remember things. E.g,[8]Combine LSTM with RL to create a Deep Loop Q Network (DRQN) to play Atari 2600 games. [4] also uses RNN and RL to solve chemical reaction optimization problems.

Deepmind shows[9]How to generate a program using the generated model and RL. In the model, the agent trained by the contralateral side uses the signal as a reward to improve the action, rather thanGANThe gradient is propagated to the input space as in training.

Input and generation results
Enter and generate results. see  .

What you need to know before applying RL to your problem

There are a few things to do before applying RL:

  • Understand your problem: You don't necessarily need to use RL in your problem, and sometimes you can't use RL.Before deciding to use RL, you may want to check whether your question has some of the following characteristics: a) trial and error (you can learn how to do better by receiving feedback from the environment); b) delayed reward; c) yes Modeled as MDP; d) Your problem is a control problem.
  • Simulation environment: A large number of iterations are required before the RL algorithm works. I believe you don't want to see RL agents try different things in autonomous cars on the highway, right? Therefore, there is a need for a simulation environment that can accurately reflect the real world.
  • MDP: Your world needs to develop your problem as an MDP. You need to design state space, action space, rewards, and more. Your agent will complete the reward under restricted conditions. If you design content in a different way, you may not get the results you want.
  • Algorithm: You can choose different RL algorithms and problems. Do you want to find out the policy directly or want to learn the value function? Do you want a modelless or model based? Do you need to combine other types of deep neural networks or methods to solve your problem?

In order to be objective and fair, you will also be told about the shortcomings of RL. Here is a great one.Posts.

Intuition from other disciplines

RL has a very close relationship with psychology, biology and neuroscience. If you think about it, what the RL agent does is just trial and error: it knows how good or bad its behavior is based on the rewards it gets from the environment.This is how humans learn to make decisions.. In addition, exploration and development issues, credit allocation issues, and environmental modeling attempts are also issues in our daily lives.

Economic theory can also clarify RL. In particular, the analysis of multi-agent reinforcement learning (MARL) can be understood from the perspective of game theory, which is a research field developed by John Nash to understand the interaction of agents in the system. In addition to game theory, MARL, the partially observable Markov decision process (POMDP) ​​can also be used to understand other economic topics, such asmarket structure(eg monopoly, oligopoly, etc.),ExternalityInformation asymmetry.

What can reinforcement learning achieve in the future?

RL still has a lot of problems and cannot be used easily. However, as more efforts are made to resolve the issue, RL will have influence and influence in the following areas:

  • Assisting humans: It may be too much to say that RL can evolve into artificial intelligence (AGI) one day, but RL is sure to assist and cooperate with people. Imagine a robot or virtual assistant working with you and taking your actions into consideration to take action to achieve a common goal. Not great?
  • Understand the consequences of different strategies: Life is so amazing, because time does not return, things only happen once. However, if I take different actions, sometimes we want to know what is going to happen (at least in the short term)? Or, if the coach adopts another strategy, does Croatia have a greater chance of winning the 2018 World Cup? Of course, to achieve this goal, we need to perfectly model the environment, transformation functions, etc., and analyze the interaction between agents, which seems impossible at present.

in conclusion

This article only shows some examples of enhanced learning applications in various industries. They should not limit your RL use cases, and as always, you should use the first principle to understand the nature of RL and your problem.

If you are the company's decision maker, I hope this article is enough to convince you to rethink your business and see if you can use RL. If you are a researcher, I hope that you will agree with me. Although RL still has different shortcomings, it also means that it has great potential to improve and there are many research opportunities.

what do you think? Can you think of any problems that RL can solve?