We discussed**Distinguish programming**The concept of integrating existing programs into a deep learning model. However, if you are a researcher, such as a self-driving car, what does it mean to distinguish programming in practice? How does it affect the way we express problems, train our models, plan our data sets, and ultimately the results we achieve?

This article shows what DP can bring to some simple but classic control problems, and we usually use reinforcement learning (RL). The DP-based model not only learns more effective control strategies than RL, but also trains orders of magnitude faster. The**代码**It's all available for yourself to run - they will be on any laptop on most trains in a matter of seconds.

## Follow the gradient

Differentiation is the key to deep learning; given the function y = f(x) y = f(x) we use gradient \ frac {dy} {dx} dxdy to calculate how changes in xx will affect yy.Despite the mathematical costumes, gradients are actually a very versatile and intuitive concept.Forget about the formula you have to stare at in school; let's do something more interesting, such as throwing things away.

When we throw something with a catapult, our xx represents a setting (for example, the size of the counterweight, or the release angle), and yy is the distance the projectile travels before landing. If you want to aim, the gradient will tell you something very useful-whether the target changes to increase or decrease the distance. To maximize the distance, just follow the gradient.

Ok, but how can we get this magic number? 诀窍 is a name**Algorithm differentiation**Process, which not only distinguishes the simple formulas you have learned at school, but also distinguishes any complex*program* -Such as our catapult simulator. The result is that we can adopt a**Simple simulator**With Julia and**Written by DiffEq**, did not learn in depth, and get a gradient in a single function call.

```
# what you did in school
gradient(x -> 3x^2 + 2x + 1, 5) # (32,)
# something a little more advanced
gradient((wind, angle, weight) -> Trebuchet.shoot(wind, angle, weight),
-2, 45, 200) # (4.02, -0.99, 0.051)
```

Now that we have it, let's do something interesting with it.

## throw things

An easy way to use it is to aim the catapult at the target and use a gradient to fine-tune the release angle; this kind of thing is*Parameter Estimation*In the name of it is very common, we have already**Introduced a similar example**. We can make things more interesting through metas: instead of targeting a slinger given a single goal, we will optimize for*any*The target is aimed at its neural network. Here's how it works: The neural network has two inputs, the target distance (meters) and the current wind speed. The network spits out the set of trebuchet settings (weight and release angle of the counterweight), and the simulator calculates the distance achieved. Then we compare with the target and*Back propagation throughout the chain*End-to-end, adjust the weight of the network. Our "data set" is a set of randomly selected targets and wind speeds.

A nice feature of this simple model is training it.*quickly*Because we have fully expressed our requirements for the model in a differentiable way. Initially, it looks like this:

After about five minutes of training (on a single core of my laptop CPU), it looks like this:

If you want to try to push it, please increase the wind speed:

It is only 16 centimeters, or about 0.3%.

This is about the simplest control problem, and we mainly use it for illustrative purposes. However, we can apply the same technique to the classic RL problem in a more advanced way.

## Shopping cart, meet the Poles

A more easily identifiable control problem is**CartPole**This is the "Hello World" for intensive learning. The task is to learn to balance the uprights by pushing their bases to the left or right. Our setup is roughly similar to the trebuchet case:**Julia implementation**This means that we can directly treat the rewards generated by the environment as losses. DP allows us to switch from no-model to model-based RL seamlessly.

A savvy reader may notice an obstacle. Cartpole's action space-nudge left or right-is discrete and therefore indistinguishable. We can introduce*Discretization of distinction*Solve this problem, define**as follows** :

*\ begin {aligned} f(x)& = \ begin {cases} 1& x \ ge 0 \\ -1& x< 0 \ end {cases} \\ \ frac {df} {dx}& = 1 \ end {aligned} f(x)dxdf = {1-1x≥0x<0 = 1*

In other words, we force the gradient to behave as if ff is an identity function.Given how much the mathematical concept of separability has been abused in ML, it may not be surprising that we can cheat here; what we need for training is a signal to inform the pseudo-random walk in the parameter space, and the rest is the details.

The results speak for themselves. Before solving the problem, the RL method needs to train hundreds of sets, and the DP model only needs about 5 sets to win.

## Through the pendulum of time and Backprop

An important goal of RL is to deal with*Delayed reward*When an action helps us before a few steps in the future. DP also allows this, and in a very familiar way: when the environment can be distinguished, we can actually use backpropagation to train the agent, just like a regular network! In this case, the environmental state becomes a "hidden state" that changes between time steps.

In order to demonstrate this technology, we have studied**Pendulum**The task of the environment is to keep the pendulum until it stands upright, keeping balance with minimal effort.This is difficult for the RL model; after about 20 training sessions, the problem was solved, but the usual solution path was obviously not optimal.In contrast, BPTT can be*One episode of training*beat**RL leaderboard**.It is beneficial to actually watch this episode unfold; at the beginning of the recording, the strategy is random, and the model improves over time.The speed of learning is almost worrying.

Although it only went through an episode, the model handled any initial angle well and had something very close to the optimal strategy. When restarting, the model looks more like this.

This is just the beginning; we will have a real victory to apply DP to an environment that is too difficult for RL to use.

## Map is not a territory

The limitation of these toy models is that they equate the simulated training environment with the test environment; of course, the real world is not distinguishable.In a more realistic model, the simulation provides us with a rough overview of behavior, which is refined with data.This data informs (for example) the simulation effect of wind, which in turn improves the quality of the gradient that the simulator transmits to the controller.The model can even form part of the controller's forward pass, enabling it to optimize its predictions without having to learn the system dynamics from scratch.Exploring these new architectures will help inspire future work.

## end

The core idea is*Distinguish programming*, We only need to write arbitrary digital programs and optimize them through gradients. This is a powerful way to provide better deep learning models and architectures-especially when we have a large recognizable code base. The toy models described are actually just previews, but we hope they can intuitively understand how to apply these ideas in a more realistic way.

Just as functional programming involves the use of functional mode inference and expression algorithms, differential programming involves the use of a differentiable mode representation algorithm. Many of these design patterns have been developed by deep learning communities, such as for handling control problems or sequences and ttree structured data. This article introduces several new ones, and as the field matures, more will be invented. The resulting program may make the most advanced current deep learning architecture look dwarfed.

This article is transferred from medium,Original address

## Comments