What is Q-Learning?

Q learning is a value-based learning algorithm in reinforcement learning.

HypothesisrobotMust crossmazeAnd reach the end. Have地雷The robot can only move one floor tile at a time. If the robot steps on the mine, the robot will die. The robot must reach the end point in the shortest possible time.

The scoring/reward system is as follows:

  1. The robot loses 1 points at every step. This is done to get the robot to take the shortest path and reach the target as quickly as possible.
  2. If the robot steps on the mine, the point loss is 100 and the game ends.
  3. If the robot gains power ⚡️, it will get 1 point.
  4. If the robot reaches the final goal, the robot gains an 100 score.

Now, the obvious problem is:How do we train robots to reach the final goal with the shortest path without stepping on the mine?

Q-Learning Overview

So how do we solve this problem?


What is a Q-Table?

Q-Table is just a fancy name for a simple lookup table, and we calculate the maximum expected future reward for each state. Basically, this form will guide us to take the best action in each state.


Each non-edge block will have four operands. When the robot is in a certain state, it can move up or down or right or left.

So let's model this environment in Q-Table.

In the Q table, the column is the action and the row is the state.

In the Q table, the column is the action, the row is the state

Each Q-table score will be the maximum expected future reward that the robot will receive when taking the action in that state. This is an iterative process because we need to improve the Q-Table on each iteration.

But the problem is:

  • How do we calculate the value of the Q table?
  • Is the value available or predefined?

In order to learn each value of the Q table, we useQ-Learning algorithm.


Mathematical basis for Q-Learning


The Q-Fuction uses the Bellman equation and takes two inputs: state (small) and action (One).

Bellman equation

Using the above function, we get the cells in the tableQvalue.

When we start, all the values ​​in the Q table are zero.

There is an iterative process of updating values. When we started exploring the environment.By constantly updating the Q values ​​in the table. The Q function gives us a better and better approximation.

Now let's understand how the update works.


Detailed process of the Q-Learning algorithm

Q-Learning process

Each colored box is a step. Let's take a closer look at each step.

Step 1: Initialize the Q table

We will first build a Q table. There are n columns, where n = operand. There are m lines, where m = the number of states. We initialize the value to 0.

In our robot example, we have four actions (a = 4) and five states (s = 5). So we will build a table with four columns and five rows.

Steps 2 and 3: Select and perform an action

The combination of these steps is completed in an indeterminate amount of time. This means that this step runs until we stop training, or the training loop stops, as defined in the code.

We will select the action (a) in the state according to the Q-Table. However, as mentioned earlier, each Q value is 0 when the episode begins.

Therefore, the concept of exploring and developing trade-offs now works.

We will use a kind ofEpsilon greedy strategything.

At the beginning, the ε rate will be higher. The robot will explore the environment and randomly select actions. The logic behind this is that robots don't know anything about the environment.

As the robot explores the environment, the epsilon rate decreases and the robot begins to take advantage of the environment.

During the exploration process, the robot gradually became more confident in estimating the Q value.

For the robot example, there are four options to choose from.: Up, down, left and right. We now start training-our robot knows nothing about the environment. So the robot chooses random actions and is right.

Q-Learning execution and operation

We can now update the Q value using the Bellman equation to move it to the beginning and to the right.

Steps 4 and 5: Evaluation

Now we have taken action and observed the results and rewards. We need to update the function Q(s, a).

Q-Learning assessment

In the case of robot games, the recurring score/reward structure is:

  • power = +1
  • mine =-100
  • End = +100

We will repeat this process over and over again until learning stops. In this way, the Q table will be updated.

This article was translated fromAn introduction to Q-Learning: reinforcement learning"


Wikipedia version

Wikipedia version

Q-learning is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn a strategy that tells the agent what action to take under what circumstances. It does not require an environment model (hence the connotation "no model"), and it can handle random conversions and rewards without the need for adjustments.

For any finite Markov decision process (FMDP), Q-learning finds the optimal strategy in the sense of maximizing the expected value of the total reward in any and all subsequent steps from the current state. [1] Given an infinite exploration time and a partial random strategy, Q-learning can determine the best action selection strategy for any given FMDP. The "Q" naming returns a function for providing enhanced rewards and can be said to represent the "quality" of actions taken in a given state.

Read More