Recently, we have been watching computer games for human games, whether it is multiplayer robots or Dota2, PUB-G, Mario and other one-on-one opponents. When their AlphaGo plans to beat the Korean Go World Championship in 2016,Deepmind(a research company) has created history. If you are a fierce gamer, you may have heard of the Dota 2 OpenAI Five, in which the machine defeats the world's top Dota2 players in a few games (if you are interested,HereIt is a complete analysis of the algorithms and games played by the machine).

The latest version of OpenAI Five uses Roshan
The latest version of OpenAI Five uses Roshan. (src

So this is the core question, why do we need to strengthen learning? Is it only for games? Or can it be applied to real-world scenarios and problems? If you are learning reinforcement for the first time, the answer to this question is beyond your imagination. It is one of the most widely used and fastest growing technologies in the field of artificial intelligence.

Here are some apps that will motivate you to build an enhanced system.

  1. Self-driving
  2. gambling
  3. robot
  4. Recommended system
  5. Advertising and marketing

A brief review and origin of intensive learning

So, when we have a lot of machine learning and deep learning techniques, what is the source of this reinforcement learning? “It was invented by Rich Sutton and Andrew Barto, Dr. Rich. Thesis consultant. “It was formed in the 20 century 80 era, but it was old at the time. Later, Rich believed in its promising nature and it would eventually be recognized.

Reinforcement learning supports automation by learning the environment in which it is located, as is machine learning and deep learning. It is not the same strategy but support automation. So why should we strengthen our study?

This is very similar to the natural learning process, where the process/model will receive feedback as to whether it performs well. Deep learning and machine learning are also learning processes, but the most important thing is to look for patterns in existing data. On the other hand, reinforcement learning learns through trial and error, and ultimately gets the right action or global optimality. Another significant advantage of intensive learning is that we don't have to provide the entire training data as supervised learning. On the contrary, some blocks are enough.

Learn about reinforcement learning

Imagine that you are teaching your cat new skills, but unfortunately, cats don't understand our language so we can't tell them what we want to do with them. Instead, mimic a situation where your cat tries to react in many different ways. If the cat's response is ideal, we will reward them with milk. Now guess what it is, the next time the cat is exposed to the same situation, the cat will perform a similar action, expecting more food to be more enthusiastic. So this is learning from a positive reaction. If they are treated with negative reactions such as angry faces, they often don't learn from them.

Again, this is how reinforcement learning works. We give the machine some input and actions and then reward them based on the output. Maximizing rewards will be our ultimate goal. Let us now see how we interpret the same problem above as a reinforcement learning problem.

  • Cats will become "agents" exposed to the "environment".
  • The environment is a house/play area, depending on what you teach it.
  • The situation encountered is called "state", for example, your cat crawls or runs under the bed. These can be interpreted as states.
  • The agent reacts by performing actions that change from one "state" to another.
  • After the state changes, we give the agent "reward" or "punish" according to the actions performed.
  • “Policy” is a strategy of choosing actions to find better results.

Now that we understand the content of reinforcement learning, let us delve into the origins and evolution of reinforcement learning and deep reinforcement learning, and how to solve problems that cannot be solved by supervised or unsupervised learning. It is interesting to see the fact that Google search engine Optimized using a reinforcement algorithm.

Familiar with reinforcement learning terminology

Agent and Environment play a vital role in the reinforcement learning algorithm. The environment is the world in which agents survive. The agent also perceives the reward signal from the environment, which tells its current state of the world. The agent's goal is to maximize its cumulative reward, called return. Before we write the first reinforcement learning algorithm, we need to understand the following "terms".

Reinforced learning terminology
  1. status:A state is a complete description of the world, and they do not hide any information that exists in the state. It can be position, constant or dynamic. We record these states primarily in arrays, matrices, or higher order tensors.
  2. action:Actions are usually based on the environment, and different environments lead to different behaviors based on agents. The effective set of operations for the agent is recorded in a space called the operation space. These are usually limited.
  3. surroundings: This is where the agent lives and interacts. For different types of environments, we use different rewards, policies, etc.
  4. Rewards and rewards:The reward function R is a function that must be tracked throughout reinforcement learning. It plays a crucial role in tuning, optimizing algorithms and stopping training algorithms. It depends on the current state of the world, the actions just taken and the next state of the world.
  5. Strategy:A policy is a rule that an agent uses to select the next action. These are also called proxy brains.
Reinforced learning loop

Now that we have seen all the reinforcement terms, let's use the reinforcement algorithm to solve the problem. Before that, we need to understand how we design the problem and assign this reinforcement learning term when solving the problem.

Solve the taxi problem

Now that we have seen all the reinforcement terms, let's use the reinforcement algorithm to solve the problem. Before that, we need to understand how we design the problem and assign this kind of reinforcement learning terminology when solving the problem.

Suppose our taxi has a training area and we teach it to transport people in the parking lot to four different places.(R,G,Y,B) . Before that, we need to understand and set up the environment in which Python starts running. If you start python from scratch, I would recommendThisarticle.

You can use OpenAiGymTo set up the Taxi-Problem environment, this is one of the most commonly used libraries for solving reinforcement problems. Well, we need to install the gym on your machine before using it. For this, you can use the python package installer, also known as pip. The following are the commands to be installed.

pip install gym

Now let's see how our environment will be rendered, all the models and interfaces for this problem have been configured in the gym and namedTaxi-V2. To render this environment below, see the code snippet.

“There are 4 locations (marked with different letters), our job is to pick up the passengers in one place and then send them to another place. We have a successful drop of +20 points and lose 1 every step of the way. Points. Illegal pick-up and drop-off actions will also be penalized for 10 points." (Source:https : //gym.openai.com/envs/Taxi-v2/ 

This will be the rendered output on the console:

Render output on the console
Taxi V2 ENV

Perfect, the environment is the core of OpenAi Gym, it is a unified environment interface. Here are the env methods that are very helpful to us:

Env.reset: Resets the environment and returns to a random initial state.
Env.step(action):Take the environment step further.

Env.step(action)r Follows the following variables

  • observation: Observe the environment.
  • reward: If your behavior is good or not
  • done: Indicates whether we have successfully picked up and dropped a passenger, also known as an episode
  • info: Additional information for debugging purposes, such as performance and latency
  • env.render: Rendering an environment framework (helping to visualize the environment)

Now that we have seen the environment, let us know more about the problem. The taxi is the only car in this parking lot. We can divide the parking lot into one5x5Grid, which gives us 25 possible taxi locations. This 25 location is part of our national space. Please note that the current position status of our taxi is the coordinates (3,1).

In the environment,[(0,0), (0,4), (4,0), (4,3)]If you can interpret the environment rendered above as an axis, you can place passengers in the taxi at four possible locations: R, G, Y, B or (row, column) coordinates.

When we also consider one (1) extra passenger status in a taxi, we can combine all of the passenger and destination locations to reach the total number of our taxi environment. There are four (4) destinations and five (4 + 1) passenger positions. So, our taxi environment5×5×5×4=500There are a total of possible states. The agent encountered one of the 500 states and took action. In our case, the action can be to move in one direction or decide to pick up and drop off passengers.

In other words, we have six possible operations:pickup.drop.north.east.south.west(.The four directions are moved by taxi movement)

This is a collection of all the actions our agents can take in a given state.action space

You will notice in the picture above that due to the wall, the taxi cannot perform certain operations in certain states. In the code of the environment, we will simply provide a -1 penalty for each wall strike and the taxi will not move anywhere. This will only be punished, causing the taxi to consider winding around the wall.

Reward form:When you create a taxi environment, you also create an initial rewards table calledP. We can think of it as a matrix, which takes the state number as the number of rows and the action number as the column, iestates × actionsmatrix.

Since each state is in this matrix, we can see the default reward value assigned to our illustration state:

>>> import gym
>>> env = gym.make("Taxi-v2").env
>>> env.P[328]
{0: [(1.0, 433, -1, False)], 
 1: [(1.0, 233, -1, False)],
 2: [(1.0, 353, -1, False)],
 3: [(1.0, 333, -1, False)],
 4: [(1.0, 333, -10, False)],
 5: [(1.0, 333, -10, False)]
}

This dictionary has a structure{action: [(probability, nextstate, reward, done)]}.

  • 0-5 corresponds to the action that the taxi can perform in the current state of the diagram (South, North, East, West, Pickup, Drop).
  • done Used to tell us when we successfully delivered the passenger to the right place.

In order to solve the problem without any reinforcement learning, we can set the target state, select some sample space, and then if it reaches the target state through multiple iterations, we assume that it is the maximum reward, otherwise the reward will increase if it is close to the target. The reward for the step is-10, raise the state and fineminimum.

Let us now write this question without the need to reinforce learning.

Because of usPIn each state we have our default rewards table, we can try to get our taxi navigation to use only it.

We will create an infinite loop until a passenger arrives at a destination (aset), or in other words, when the reward received is 20.env.action_space.sample()The method automatically selects a random action from the set of all possible actions.

Let's see what happened:

import gym
from time import sleep

# Creating thr env
env = gym.make("Taxi-v2").env

env.s = 328


# Setting the number of iterations, penalties and reward to zero,
epochs = 0
penalties, reward = 0, 0

frames = []

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1

    # Put each rendered frame into the dictionary for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
    }
    )

    epochs += 1

print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

# Printing all the possible actions, states, rewards.
def frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'].getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
frames(frames)

Output:

Output
OpenAI

Our problem has been solved but not optimized, or the algorithm does not work all the time, we need to have a suitable interactive agent so that the machine/algorithm requires much fewer iterations.Q-LearningThe algorithm lets us see how it is implemented in the next section.

Introduction to Q-Learning

This algorithm is the most commonly used basic reinforcement algorithm that uses environmental rewards to learn over time and take the best action in a given state. In the above implementation, we will learn the reward table "P" from the agent. Using the rewards table, it selects the next action, and if it is beneficial, then they update a new value called Q value. This new table created is called a Q table and they map to a combination called (State, Action) combination. If the Q is better, we will have more optimized rewards.

For example, if a taxi is facing a state of a passenger including its current location, the picked Q value is likely to be higher than other actions (eg, descent or north).

The Q value is initialized to an arbitrary value, and when the agent exposes itself to the environment and receives different rewards by performing different operations, the Q value is updated using the following equation:

Update Q value

There is a problem here, how to initialize this Q value and how to calculate them, because we initialize the Q value with any constant, and then when the agent is exposed to the environment, it receives various rewards by performing different actions. After the operation is performed, the Q value is executed by the equation.

Here Alpha and Gamma are parameters of the Q learning algorithm. Alpha is called the learning rate, γ is called the discount factor, and the range is between 0 and 1, sometimes equal to 1. Gamma can be zero, while alpha cannot, because the loss should be updated with a certain rate of learning. The Alpha representation here is the same as that used in supervised learning. Gamma determines how much we want to reward the future.

The following is an introduction to the algorithm.

  • step 1:The Q table is initialized to any constant using all zeros and Q values.
  • step 2:Let agents react to the environment and explore actions. For each change in the status, select any of the possible operations for the current status (S).
  • step 3:As a result of this action (a), advance to the next state (S').
  • step 4:For all possible actions from the state (S'), the action with the highest Q value is selected.
  • step 5:Update the Q table value using the equation.
  • Status 6:Change the next state to the current state.
  • step 7:If the target state is reached, the process ends and the process repeats.

Q-Learning in Python

import gym
import numpy as np
import random
from IPython.display import clear_output

# Init Taxi-V2 Env
env = gym.make("Taxi-v2").env

# Init arbitary values
q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1


all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    # Init Vars
    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done:
        if random.uniform(0, 1) < epsilon:
            # Check the action space
            action = env.action_space.sample()
        else:
            # Check the learned values
            action = np.argmax(q_table[state])

        next_state, reward, done, info = env.step(action)

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        # Update the new value
        new_value = (1 - alpha) * old_value + alpha * \
            (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

    if i % 100 == 0:
        clear_output(wait=True)
        print("Episode: {i}")

print("Training finished.")

Perfect, now all your values ​​will be stored in variablesq_table .

This is where all your models are trained and the environment can now reduce passengers more accurately. There, you can learn about reinforcement learning and be able to code new questions.

This article is reproduced from Towardsdatascience,Original address