A text to understand logistic regression

This article will introduce the basic concepts, advantages and disadvantages of logical regression and practical application cases in an easy-to-understand way. At the same time, some comparisons will be made with linear regression, so that you can effectively distinguish different algorithms of 2.


What is logistic regression?

Logical regression in machine learning

The position of linear regression is shown in the figure above. It belongs to machine learning-supervised learning-classification-logistic regression.

Extended reading:

'I understand machine learning in one article! (3 learning methods + 7 practical steps + 15 common algorithms)"

'I understand supervised learning in one article (basic concept + 4 step flow + 9 typical algorithm)"

Logistic Regression mainly solves the problem of two classifications and is used to indicate the possibility of something happening..

What is logistic regression?

such as:

  • The possibility that an email is spam (yes, no)
  • The possibility of buying a good (buy, not buy)
  • The possibility of an ad being clicked (point, no point)


Advantages and disadvantages of logistic regression


  • Simple to implement, widely used in industrial issues;
  • The amount of calculation at the time of classification is very small, the speed is fast, and the storage resources are low;
  • Convenient observation sample probability scores;
  • For logistic regression, multicollinearity is not a problem, it can be combined with L2 regularization to solve the problem;
  • The calculation is not costly and easy to understand and implement;

Things to note:

  • When the feature space is large, the performance of logistic regression is not very good;
  • easilyUnder-fitting, the general accuracy is not too high
  • A large number of multi-class features or variables are not handled well;
  • Can only deal with two classification problems (softmax derived from this can be used for multi-classification), and mustLinear separability;
  • For nonlinear features, conversion is required;


Logistic regression VS linear regression

Linear regression and logistic regression are classic 2 algorithms. Often used for comparison, here are some of the differences between the two:

The difference between linear regression and logistic regression

  1. Linear regression can only be used for regression problems. Although the name is called regression, it is more used for classification problems. (For the difference between regression and classification, please see this article.I understand supervised learning in one article (basic concept + 4 step flow + 9 typical algorithm)》)
  2. Linear regression requires that the dependent variable is a continuous numerical variable, while logistic regression requires that the dependent variable be a discrete variable
  3. Linear regression requires a linear relationship between independent and dependent variables, while logistic regression does not require linear relationships between independent and dependent variables.
  4. Linear regression can intuitively express the relationship between independent and dependent variables, and logistic regression can not express the relationship between variables.


Independent variable: the variable of active operation, which can be regarded as the cause of the "dependent variable"

The dependent variable: because of the change in the "independent variable", can be seen as the result of the "independent variable". It is also the result we want to predict.

Interpretation of independent variables and dependent variables


US group application case

The US Mission will apply logical regression to the business to solve some practical problems. Here, for example, to predict the user's purchase preference for the category, the question can be converted to predict whether the user will purchase a certain category at a certain time in the future. If the purchase is marked as 1 and the purchase is not marked as 0, it is converted to A two-category problem. The features we use include historical information such as user browsing, purchases, etc., as shown in the following table:

The case of the US group applying logical regression

The time span of the extracted features is 30 days, and the label is 2 days. The generated training data is in the order of 7000 million (users who have acted in the US group for a month). We artificially aggregate similar small categories, and finally there are 18 more typical category collections. If the user purchases a certain category collection in a given time, it is taken as a positive example. With the training data, use the Spark version of the LR algorithm to train a two-category model for each category. If the number of iterations is set to 100, model training takes about 40 minutes. The average time for each model is 2 minutes.AUCMost of them are above 0.8. The trained model will be saved and used to predict the purchase probability in each category. The predicted results are used in scenarios such as recommendations.

Due to the different distribution of positive and negative cases between different categories, the distribution of positive and negative cases of some categories is very uneven. We have also tried different sampling methods, and the ultimate goal is to improve the online indicators such as the order rate. After some parameter tuning, the category preference feature brings more than 1% order increase rate to recommendation and sorting.

In addition, because the LR model is simple, efficient, and easy to implement, it can provide a good baseline for subsequent model optimization. We also use the LR model in services such as sorting.


Baidu Encyclopedia + Wikipedia

Baidu Encyclopedia version

Logistic regression is a generalized linear regression analysis model, which is often used in data mining, automatic disease diagnosis, economic forecasting and other fields. For example, explore the risk factors that cause disease, and predict the probability of disease occurrence based on risk factors.

Taking gastric cancer disease analysis as an example, two groups of people were selected, one group was gastric cancer group and one group was non-gastric cancer group. The two groups of people must have different signs and lifestyles. Therefore, the dependent variable is whether it is gastric cancer. If the value is “yes” or “no”, the independent variable can include many factors such as age, gender, eating habits, and Helicobacter pylori infection. The independent variables can be either continuous or classified. Then through logistic regression analysis, the weight of the independent variable can be obtained, so that we can roughly understand which factors are the risk factors of gastric cancer. At the same time, according to the weight value, the possibility of a person suffering from cancer can be predicted according to the risk factor.

Read More


Wikipedia version

In statistics, the logical model is a widely used statistical model. In its basic form, logical functions are used to simulate binary dependent variables; there are more complex extensions. In regression analysis, logistic regression is a parameter that estimates the logical model; it is a form of binomial regression.

Mathematically, a binary logic model has a dependent variable with two possible values, such as pass/fail, win/loss, live/dead or health/illness; these are represented by indicator variables, where two values ​​are labeled as " 0" and "1". In a logical model, the log ratio (the likelihood of a logarithm), the value labeled "1" is one or more independent variables of a linear combination ("predict"); the independent variable can be a binary variable (two classes, encoded by indicator variables) or continuous variables (any actual value).

Read More