## What is Q-Learning?

Q learning is a value-based learning algorithm in reinforcement learning.

HypothesisrobotMust crossmazeAnd reach the end. Have地雷The robot can only move one floor tile at a time. If the robot steps on the mine, the robot will die. The robot must reach the end point in the shortest possible time.

The scoring/reward system is as follows:

1. The robot loses 1 points at every step. This is done to get the robot to take the shortest path and reach the target as quickly as possible.
2. If the robot steps on the mine, the point loss is 100 and the game ends.
3. If the robot gains power ⚡️, it will get 1 point.
4. If the robot reaches the final goal, the robot gains an 100 score.

Now, the obvious problem is:How do we train robots to reach the final goal with the shortest path without stepping on the mine? So how do we solve this problem?

## What is a Q-Table?

Q-Table is just a fancy name for a simple lookup table, and we calculate the maximum expected future reward for each state. Basically, this form will guide us to take the best action in each state. Each non-edge block will have four operands. When the robot is in a certain state, it can move up or down or right or left.

So let's model this environment in Q-Table.

In the Q table, the column is the action and the row is the state. Each Q-table score will be the maximum expected future reward that the robot will receive when taking the action in that state. This is an iterative process because we need to improve the Q-Table on each iteration.

But the problem is:

• How do we calculate the value of the Q table?
• Is the value available or predefined?

In order to learn each value of the Q table, we useQ-Learning algorithm.

## Mathematical basis for Q-Learning

#### Q-Fuction

The Q-Fuction uses the Bellman equation and takes two inputs: state (small) and action (One). Using the above function, we get the cells in the tableQvalue.

When we start, all the values ​​in the Q table are zero.

There is an iterative process of updating values. When we started exploring the environment.By constantly updating the Q values ​​in the table. The Q function gives us a better and better approximation.

Now let's understand how the update works.

## Detailed process of the Q-Learning algorithm Each colored box is a step. Let's take a closer look at each step.

#### Step 1: Initialize the Q table

We will first build a Q table. There are n columns, where n = operand. There are m lines, where m = the number of states. We initialize the value to 0.  In our robot example, we have four actions (a = 4) and five states (s = 5). So we will build a table with four columns and five rows.

#### Steps 2 and 3: Select and perform an action

The combination of these steps is completed in an indeterminate amount of time. This means that this step runs until we stop training, or the training loop stops, as defined in the code.

We will select the action (a) in the state according to the Q-Table. However, as mentioned earlier, each Q value is 0 when the episode begins.

Therefore, the concept of exploring and developing trade-offs now works.

We will use a kind ofEpsilon greedy strategything.

At the beginning, the ε rate will be higher. The robot will explore the environment and randomly select actions. The logic behind this is that robots don't know anything about the environment.

As the robot explores the environment, the epsilon rate decreases and the robot begins to take advantage of the environment.

During the exploration process, the robot gradually became more confident in estimating the Q value.

For the robot example, there are four options to choose from.: Up, down, left and right. We now start training-our robot knows nothing about the environment. So the robot chooses random actions and is right. We can now update the Q value using the Bellman equation to move it to the beginning and to the right.

#### Steps 4 and 5: Evaluation

Now we have taken action and observed the results and rewards. We need to update the function Q(s, a). In the case of robot games, the recurring score/reward structure is:

• power = +1
• mine =-100
• End = +100  We will repeat this process over and over again until learning stops. In this way, the Q table will be updated.