OpenAI trains agents in a simple "hide and seek" game and learns many other skills in the process.

Competition is one of the socio-economic dynamics that affect the evolution of our species. Due to the co-evolution and competition between organisms guided by natural selection, a great deal of complexity and diversity has emerged on the planet. By competing with the other side, we are constantly being forced to improve our knowledge and skills on specific topics. Recent developments in artificial intelligence (AI) have begun to exploit certain competitive principles to influence the learning behavior of AI agents. Specifically, the field of multi-agent reinforcement learning (MARL) is greatly influenced by competition and game theory dynamics. Recently, researchers from OpenAI began training some AI agents in a simple "hide and seek" game, and they were shocked by some of the behaviors that agents naturally developed.Just published interesting research papers.

Learning through competition is one of the emerging paradigms in AI, and it is deadly similar to how our knowledge evolves with humans. Because we are babies, we develop new knowledge by exploring the surrounding environment and interacting with others, sometimes in a collaborative and sometimes competitive manner. This dynamic contrasts sharply with the way we build AI systems today. Although supervised learning methods are still the main example of AI, it is relatively impractical to apply them to many real-world situations. This is especially true in environments where agents need to interact with physical objects in relatively unknown environments. In these cases, it is natural for agents to continually collaborate and/or compete with each other to develop new organic knowledge agents.

Multi-agent courses and emergency behavior

One of the side effects of learning through competition is that agents can develop unexpected behavior. In AI theory, this is called the agent automation course, which represents the first row of observing how knowledge develops. Imagine that you are training an AI agent to master a particular game, and suddenly the agent discovers a strategy that has never been tested. Although the automatic course phenomenon occurs in the single-person reinforcement learning system, the phenomenon of development through competition is more obvious. This is the so-called multi-person automatic course.

In a competitive multi-agent AI environment, different entities compete with each other to evaluate a particular strategy. When a new success strategy or variability occurs, it changes the implicit task assignments that neighboring agents need to resolve and puts new pressure on adaptation. These evolutionary arms races created implicit automated courses, so competing subjects continue to create new tasks for each other. A key element of the multi-agent automated curriculum is that the sudden behavior learned by the subject evolves organically rather than as a result of pre-established incentives. Not surprisingly, multi-agent automated courses have become one of the most successful technologies when training AI agents in multiplayer games.

Hide and seek training agent

The original OpenAI experiment was designed to train a series of intensive learning agents to master hide-and-seek games. In the target setting, the agent's task is to participate in a two-person hide-and-seek game in a physics-based environment. The task of the occult is to avoid the sight of the searcher, and the task of the seeker is to maintain the vision of the occupant. Objects are scattered throughout the environment, and agents can grab these objects and lock them in place. Agents must also learn randomly generated non-movable rooms and walls. The OpenAI environment does not have any explicit incentives for the agent to interact with the object. Agents receive team-based rewards; if all the hideouts are hidden, the hideout will receive a +1 reward, and if the seeker sees any hiding, they will receive a -1 reward. The seeker gets the opposite reward, -1 if all the hidden objects are hidden; otherwise +1. In order to limit the seating behavior to a reasonable extent, if the seat is too far away from the competition area, it will be punished. In the preparation phase, all agents receive a zero reward.

In order to train the hide-and-seek, OpenAI researchers have used other multiplayer games (such asOpenAI FiveDactylTraining infrastructure used in ). This type of infrastructure relies on a policy network in which self-playing is used to train agents. This is a natural course because agents always play the appropriate level of adversary. The proxy strategy consists of two independent networks with different parameters - a policy network that generates action distributions and a network of critics that predict future discounted returns. Embed each object and then pass through a masked residual self-focus block, similar to those used in transformers, where the focus is on objects rather than time. Objects that are not in line of sight and that are in front of the agent are masked so that the agent does not have their information.

Self-study and emergency behavior in hide-and-seek

When AI agents competed in the environment explained earlier, they not only mastered the knowledge of hide-and-seek, but also developed as many as six different strategies, none of which were the initial incentives.

Initially, the occupants and seekers learned to run violently and chase. After about 2500 10,000 hide-and-seek activities, the occupants learned to use their own tools and intentionally changed their environment. They started building safe shelters by moving many boxes together or leaning against the wall and then locking them in place to hide. After another 7500 episode, the searcher also learned the use of basic tools; they learned to move and use the ramp to jump over obstacles so they could enter the shelter of the hideout. After 10 million episodes, the occupants learn to defend against this strategy. The occultist learns to bring the ramp to the edge of the play area and lock it in place, which seems to eliminate the only tool the seeker has. Similarly, after a total of 3.8 million training sessions, the searcher learned to bring the box to the edge of the player and the hideman locked the ramp. The searcher then jumps to the top of the box and surfs it to the shelter of the hideman; it works because the environment allows the agent to move with the box, whether or not they are on the ground. In response, the occupants learned to lock all the boxes in place before building the shelter. The figure below shows some of these emergency behaviors.

The fascinating thing about the sudden behaviors of hide-and-seek hunters is that they are part of their own curriculum inspired by internal competition and are completely organically evolved. In almost all cases, the behavior of sudden behavior is better than the intrinsic motivation.

OpenAI's hide-and-seek experiment is absolutely fascinating, clearly demonstrating the potential of a multi-agent competitive environment as a learning catalyst. Many OpenAI techniques can be inferred into other AI scenarios where it is more feasible to learn by competition than to supervise learning.

This article was transferred from awardsdatascience,Original address