## What is Q-Learning?

Q learning is a value-based learning algorithm in reinforcement learning.

Hypothesis**robot**Must cross**maze**And reach the end. Have**地雷**The robot can only move one floor tile at a time. If the robot steps on the mine, the robot will die. The robot must reach the end point in the shortest possible time.

The scoring/reward system is as follows:

- The robot loses 1 points at every step. This is done to get the robot to take the shortest path and reach the target as quickly as possible.
- If the robot steps on the mine, the point loss is 100 and the game ends.
- If the robot gains power ⚡️, it will get 1 point.
- If the robot reaches the final goal, the robot gains an 100 score.

Now, the obvious problem is:**How do we train robots to reach the final goal with the shortest path without stepping on the mine?**

So how do we solve this problem?

## What is a Q-Table?

Q-Table is just a fancy name for a simple lookup table, and we calculate the maximum expected future reward for each state. Basically, this form will guide us to take the best action in each state.

Each non-edge block will have four operands. When the robot is in a certain state, it can move up or down or right or left.

So let's model this environment in Q-Table.

In the Q table, the column is the action and the row is the state.

Each Q-table score will be the maximum expected future reward that the robot will receive when taking the action in that state. This is an iterative process because we need to improve the Q-Table on each iteration.

But the problem is:

- How do we calculate the value of the Q table?
- Is the value available or predefined?

In order to learn each value of the Q table, we use**Q-Learning algorithm.**

## Mathematical basis for Q-Learning

#### Q-Fuction

The Q-Fuction uses the Bellman equation and takes two inputs: state (**small**) and action (**One**).

Using the above function, we get the cells in the table**Q**value.

When we start, all the values in the Q table are zero.

There is an iterative process of updating values. When we started exploring the environment**.**By constantly updating the Q values in the table**.** The Q function gives us a better and better approximation.

Now let's understand how the update works.

## Detailed process of the Q-Learning algorithm

Each colored box is a step. Let's take a closer look at each step.

**Step 1: Initialize the Q table**

We will first build a Q table. There are n columns, where n = operand. There are m lines, where m = the number of states. We initialize the value to 0.

In our robot example, we have four actions (a = 4) and five states (s = 5). So we will build a table with four columns and five rows.

**Steps 2 and 3: Select and perform an action**

The combination of these steps is completed in an indeterminate amount of time. This means that this step runs until we stop training, or the training loop stops, as defined in the code.

We will select the action (a) in the state according to the Q-Table. However, as mentioned earlier, each Q value is 0 when the episode begins.

Therefore, the concept of exploring and developing trade-offs now works.

We will use a kind of**Epsilon greedy strategy**thing.

At the beginning, the ε rate will be higher. The robot will explore the environment and randomly select actions. The logic behind this is that robots don't know anything about the environment.

As the robot explores the environment, the epsilon rate decreases and the robot begins to take advantage of the environment.

During the exploration process, the robot gradually became more confident in estimating the Q value.

**For the robot example, there are four options to choose from.**: Up, down, left and right. We now start training-our robot knows nothing about the environment. So the robot chooses random actions and is right.

We can now update the Q value using the Bellman equation to move it to the beginning and to the right.

**Steps 4 and 5: Evaluation**

Now we have taken action and observed the results and rewards. We need to update the function Q(s, a).

In the case of robot games, the recurring score/reward structure is:

**power**= +1**mine**=-100**End**= +100

We will repeat this process over and over again until learning stops. In this way, the Q table will be updated.

This article was translated fromAn introduction to Q-Learning: reinforcement learning"

## Wikipedia version

Q-learning is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn a strategy that tells the agent what action to take under what circumstances. It does not require an environment model (hence the connotation "no model"), and it can handle random conversions and rewards without the need for adjustments.

For any finite Markov decision process (FMDP), Q-learning finds the optimal strategy in the sense of maximizing the expected value of the total reward in any and all subsequent steps from the current state. [1] Given an infinite exploration time and a partial random strategy, Q-learning can determine the best action selection strategy for any given FMDP. The "Q" naming returns a function for providing enhanced rewards and can be said to represent the "quality" of actions taken in a given state.

## 9 Comments

mine -> "mine", the machine turned over too badly

Haha, the foreign language content is directly googled. Does not affect the overall understanding.

I said it looked awkward

It is very awkward to read, and does not quote the original text.

Accept criticism!

Accepting criticism, you put the original text up.

Thank you for reminding me that I have joined the original address.

It's too weak, so you can easily take out such manuscripts. It's okay for you to do the machine translation and read it yourself.

Thanks for the criticism, some of the content is indeed for the sake of knowledge structure, but it has not been rewritten yet