Experienced algorithm engineers must be very clear. In the development cycle of a model, the work volume is actually the process of feature engineering and model evaluation and on-line. Now that the machine learning platform is very mature, the implementation and adjustment of the model structure is just a matter of a few lines of code. So if you can improve the efficiency of the model evaluation and online AB Test, it must be a great liberation of the efficiency of the algorithm engineer.

Today we will introduce this article.Streaming media giant Netflix's "Online Online Assessment Tips" - Interleaving.

As we all know, Netflix is ​​a streaming media giant in the United States. Its well-known reason is not only because of its many well-known original dramas, high market value, but also in the recommended technology field, Netflix has always been at the forefront of the industry. Then the important technology that drives Netflix to implement the rapid iterative innovation of the recommendation system is the fast online evaluation method we will introduce today, Interleaving.

Netflix recommendation system

Netflix recommended system problem background

Almost all Netflix pages are recommended algorithm-driven, and each algorithm is optimized for different recommended scenarios. 

As shown in the figure below, the “Top Picks Line” on the home page provides recommendations based on the personalized ranking of the video, while the “Trending Now Line” contains recent trends. These personalized lines together constitute a personalized homepage for Netflix's nearly 1 billion members.

Personalize the Netflix homepage example. Each row is a recommended category, and for a given row, the left-to-right video ordering is determined by a particular sorting algorithm.

For the powerful algorithm-driven Netflix, the iterative innovation of the algorithm is of course essential. In order to maximize Netflix's business goals through algorithms (these business metrics include monthly user subscriptions, total viewing time, etc.), a large number of AB Tests are needed to verify that the new algorithm can effectively improve these key product metrics.

This brings a contradiction, that isThe contradiction between the increasing AB Test requirements of algorithm engineers and the serious shortage of online AB Test resources.

Because the online AB Test must occupy valuable online traffic resources, it may also cause damage to the user experience, but online traffic resources are obviously limited, and only a small part can be used for AB Test; The algorithm-driven usage scenarios are constantly increasing, and a large number of candidate algorithms need to perform AB Test one by one. The contradiction between the two must intensify. There is an urgent need to design a fast online assessment method.

To this end, Netflix designed a two-stage online testing process:

Use Inleaving for fast online testing. The bulb is used to represent the candidate algorithm. Among them, the optimal winning algorithm is represented in red. Interleaving is able to quickly reduce the initial set of candidate algorithms and determine the optimal algorithm faster than the traditional AB Test.
Use Inleaving for fast online testing. The bulb is used to represent the candidate algorithm. Among them, the optimal winning algorithm is represented in red. Interleaving is able to quickly reduce the initial set of candidate algorithms and determine the optimal algorithm faster than the traditional AB Test.

1. The first phase used a test method called Interleaving to quickly select candidate algorithms, and selected a small number of "excellent" Ranking algorithms from a large number of initial ideas.

The second phase of 2. is to perform a traditional AB Test on a reduced set of algorithms to measure their long-term impact on user behavior.

Everyone must have been familiar with the traditional AB Test method, so this article focuses on how Netflix can quickly test online through the Interleaving method.

Problems with the traditional AB Test

In addition to the efficiency problem of the traditional AB Test, there are some statistically significant differences. The following is a description of a typical AB Test problem.

Here we design an AB Test to verify that the user community has a taste preference for Coca-Cola and Pepsi. Then, according to the traditional practice, we will randomly divide the test population into two groups and then conduct a “blind test”, that is, test without telling the cola brand. The first group only provides Coca-Cola, the second group only provides Pepsi, and then according to the amount of cola consumption in a certain period of time to see whether people prefer "Coca-Cola" or "Pepsi-Cola".

This experiment is indeed effective in the general sense, and many times we do the same. But alsoThere are some potential problems :

1. In the total test population, the consumption habits of cola are definitely different, from people who barely drink cola to those who drink a lot of cola every day.

2. Coke's heavy-duty population is certainly only a small part of the total test population, but they may account for a larger proportion of overall soda consumption.

These two problems lead to even a small imbalance in the cola consumer between the two groups, which may have a disproportionate impact on the conclusion.

In the Internet scene, such problems also exist. For example, in the Netflix scenario, the number of very active users is a small number, but the contribution time of the contribution is a large proportion. Therefore, the active users in the Netflix AB Test are classified in the A group or more in the B group. It has a greater impact on the results, thus masking the real effect of the model.

So how to solve this problem? One way is to not group test people, but to give all testers the freedom to choose Pepsi and Coca-Cola (there is still no brand label in the test, but can distinguish between two different colas). At the end of the experiment, count everyone’s Coca-Cola and PepsiConsumption ratioThen, after averaging, the overall consumption ratio is obtained.

The advantages of this test solution are:

1. eliminates the problem of uneven distribution of the attributes of the AB group testers;

2. reduces the excessive impact of heavy consumers on outcomes by giving everyone the same weight.

This test idea is applied to the Netflix scenario, which is Interleaving.

Netflix's fast online assessment method - Interleaving

The following differences exist between AB Test and Interleaving.

  • In the traditional AB Test, Netflix will choose two groups of subscribers: a group accepts RankingalgorithmThe recommendation result of A, the other group accepts the recommendation result of Ranking algorithm B.
  • In the Interleaving test, there is only one set of subscribing users who will receive alternate rankings generated by the rankings of the hybrid algorithms A and B.

This allows the user to see the recommendation results of algorithms A and B at the same time in one line (the user cannot distinguish whether an item is recommended by algorithm A or algorithm B), and can be measured by calculating the viewing time and other indicators. Algorithm A is good or algorithm B is good.

Traditional AB Test and Interleaving In the traditional AB Test, the test users are divided into two groups, one group is exposed to the ranking algorithm A, and the other group is exposed to the algorithm B, and the core evaluation indicators such as the viewing time are compared between the two groups. On the other hand, Interleaving exposes all test users to the mixed ranking of algorithms A and B, and then compares the indicators of the corresponding items of the algorithm.
Traditional AB Test and Interleaving In the traditional AB Test, the test users are divided into two groups, one group is exposed to the ranking algorithm A, and the other group is exposed to the algorithm B, and the core evaluation indicators such as the viewing time are compared between the two groups. On the other hand, Interleaving exposes all test users to the mixed ranking of algorithms A and B, and then compares the indicators of the corresponding items of the algorithm.

Of course, when testing with the Interleaving method, you mustConsider the existence of positional deviationTo avoid the total number of videos from algorithm A in the first place. So needLet Algorithm A and Algorithm B alternately lead with equal probability. This is similar to when playing on a wild court, the two captains first decide who chooses the first choice by throwing coins, and then alternately select the players.

Use the "Captain Picker" approach to mix videos from two ranking algorithms. The ranking algorithms A and B respectively generate a list of recommended videos. It is determined by random coin toss whether the ranking algorithm A or B contributes the first video. Then, the video is selected from the high to the bottom in algorithms A and B in turn.
Use the "Captain Picker" approach to mix videos from two ranking algorithms. The ranking algorithms A and B respectively generate a list of recommended videos. It is determined by random coin toss whether the ranking algorithm A or B contributes the first video. Then, the video is selected from the high to the bottom in algorithms A and B in turn.

After clearing the Interleaving method, it is also necessary to verify whether this evaluation method can replace the traditional AB Test and will lead to erroneous conclusions. Netflix has been validated in two ways.One is the "sensitivity" of Interleaving, and the second is the "correctness" of Interleaving..

Sensitivity comparison between Interleaving and traditional AB Test

Netflix's experiments hope to verify that the Interleaving method can verify the pros and cons of Algorithm A and Algorithm B compared to the traditional AB Test. We have repeatedly stressed the tension of online testing resources, so it is natural to hope that Interleaving can use less online resources and less test users to solve the evaluation problem. This is called "sensitivity comparison".

Figure 5 is the experimental result, the horizontal axis is the number of samples participating in the experiment, and the vertical axis Netflix does not give a very accurate explanation, but we can understand that it is the "wrong" probability that the algorithm A is better than the algorithm B. It can be seen that the interleaving method can determine whether the algorithm A is better than B by using 10^3 samples, and the AB test requires 10^5 samples to reduce the error rate below 5%. This means that with a set of AB Test resources, we can do 100 group Interleaving experiments. This undoubtedly greatly enhances the ability of online testing.

Sensitivity to Interleaving and traditional AB Test indicators. Compared with the most sensitive AB Test indicator, Interleaving only needs 1/100 subscription user samples to determine which algorithm the user prefers.
Sensitivity to Interleaving and traditional AB Test indicators. Compared with the most sensitive AB Test indicator, Interleaving only needs 1/100 subscription user samples to determine which algorithm the user prefers.

Correlation between Interleaving Indicators and AB Test Indicators

In addition to the ability to quickly evaluate algorithms using small samples, Interleaving's judgment is consistent with AB Test, and it is also the key to verifying that Interleaving can evaluate the first phase of the AB Test online.

Figure 6 shows the correlation between the experimental indicators in Interleaving and the AB Test indicator. Each data point represents a Ranking algorithm. We found a strong correlation between the Interleaving indicator and the AB Test metric, which proves that the algorithm that won the Interleaving experiment is also very likely to win in the subsequent AB Test.

Correlation between the Interleaving indicator and the AB Test indicator. Each point represents the experimental result of a Ranking algorithm. There is a strong correlation between the Interleaving indicator and the AB Test indicator.
Correlation between the Interleaving indicator and the AB Test indicator. Each point represents the experimental result of a Ranking algorithm. There is a strong correlation between the Interleaving indicator and the AB Test indicator.

in conclusion

Through experiments we already know that Interleaving is a powerful and fast algorithm verification method that accelerates the iterative innovation of Netflix's various Ranking algorithms.

But we also want to be clear that the Interleaving method is alsoThere are certain limitationsMainly the following two points:

The framework for 1. engineering implementation is more complex than the traditional AB Test.Because the logic of the Interleaving experiment is entangled with business logic, business logic can be disrupted. And in order to achieve Interleaving, a large number of auxiliary data labels need to be added to the entire data pipeline, which is a difficult point of engineering implementation;

After all, 2. Interleaving is only a relative measure of the user's preference for algorithm recommendation results, and can not produce a complete performance of an algorithm.For example, we want to know how much the algorithm A can increase the overall viewing time of the user. It is impossible to draw such a conclusion using Interleaving. To this end, Netflix designed the two-level experimental structure of Interleaving+AB Test to complete the framework of the entire online test.

This article is transferred from the public number to the door venture,Original address