This article will introduce the basic concepts, advantages and disadvantages of logical regression and practical application cases in an easy-to-understand way. At the same time, some comparisons will be made with linear regression, so that you can effectively distinguish different algorithms of 2.
What is logistic regression?
The position of linear regression is shown in the figure above. It belongs to machine learning-supervised learning-classification-logistic regression.
Logistic Regression mainly solves the problem of two classifications and is used to indicate the possibility of something happening..
- The possibility that an email is spam (yes, no)
- The possibility of buying a good (buy, not buy)
- The possibility of an ad being clicked (point, no point)
Advantages and disadvantages of logistic regression
- Simple to implement, widely used in industrial issues;
- The amount of calculation at the time of classification is very small, the speed is fast, and the storage resources are low;
- Convenient observation sample probability scores;
- For logistic regression, multicollinearity is not a problem, it can be combined with L2 regularization to solve the problem;
- The calculation is not costly and easy to understand and implement;
Things to note:
- When the feature space is large, the performance of logistic regression is not very good;
- easilyUnder-fitting, the general accuracy is not too high
- A large number of multi-class features or variables are not handled well;
- Can only deal with two classification problems (softmax derived from this can be used for multi-classification), and mustLinear separability;
- For nonlinear features, conversion is required;
Logistic regression VS linear regression
Linear regression and logistic regression are classic 2 algorithms. Often used for comparison, here are some of the differences between the two:
- Linear regression can only be used for regression problems. Although the name is called regression, it is more used for classification problems. (For the difference between regression and classification, please see this article.I understand supervised learning in one article (basic concept + 4 step flow + 9 typical algorithm)》)
- Linear regression requires that the dependent variable is a continuous numerical variable, while logistic regression requires that the dependent variable be a discrete variable
- Linear regression requires a linear relationship between independent and dependent variables, while logistic regression does not require linear relationships between independent and dependent variables.
- Linear regression can intuitively express the relationship between independent and dependent variables, and logistic regression can not express the relationship between variables.
Independent variable: A variable that is actively operated and can be regarded as the cause of the "dependent variable"
Dependent variable: Changes due to changes in the "independent variable" can be seen as the result of the "independent variable". It is also the result we want to predict.
US group application case
The US Mission will apply logical regression to the business to solve some practical problems. Here, for example, to predict the user's purchase preference for the category, the question can be converted to predict whether the user will purchase a certain category at a certain time in the future. If the purchase is marked as 1 and the purchase is not marked as 0, it is converted to A two-category problem. The features we use include historical information such as user browsing, purchases, etc., as shown in the following table:
The time span of the extracted features is 30 days, and the label is 2 days. The generated training data is in the order of 7000 million (users who have acted in the US group for a month). We artificially aggregate similar small categories, and finally there are 18 more typical category collections. If the user purchases a certain category collection in a given time, it is taken as a positive example. With the training data, use the Spark version of the LR algorithm to train a two-category model for each category. If the number of iterations is set to 100, model training takes about 40 minutes. The average time for each model is 2 minutes.AUCMost of them are above 0.8. The trained model will be saved and used to predict the purchase probability in each category. The predicted results are used in scenarios such as recommendations.
Due to the different distribution of positive and negative cases between different categories, the distribution of positive and negative cases of some categories is very uneven. We have also tried different sampling methods, and the ultimate goal is to improve the online indicators such as the order rate. After some parameter tuning, the category preference feature brings more than 1% order increase rate to recommendation and sorting.
In addition, because the LR model is simple, efficient, and easy to implement, it can provide a good baseline for subsequent model optimization. We also use the LR model in services such as sorting.
Baidu Encyclopedia + Wikipedia