Once you have enough data, you can choose feature selection or feature extraction (most of them can do the same job and can be used interchangeably). There are usually two methods:

**Feature extraction / selection****Dimension reduction or feature reduction**

Let's find them out step by step.

## Part 2: Feature Extraction / Selection

### So what is feature selection? Feature extraction? What's the difference between them?

→ In machine learning and statistics, feature selection (also called variable selection) is the process of selecting a relevant subset of features (variables, predictors) for model building.

→ Feature extraction is used to create a new set of smaller features that will still capture most useful information.

→ Similarly, feature selection preserves a subset of the original features, while feature extraction creates new features.

**Importance of feature selection / extraction**

→ This becomes especially important when the number of functions is large.

→ You don't need to use all available functions to create algorithms.

→ You can assist the algorithm by providing only the features that really matter.

**Where to use feature selection?**

→ It makes the training of machine learning algorithms faster, reduces complexity and makes it easier to interpret.

→ If the correct subset is selected, the accuracy of the model can be improved.

→ This can reduce the situation of overfitting.

It can be broadly divided into two technologies (though this is not "

The only way there")

I. Univariate feature selection

ii. Multiple feature selection

ü*nivariate function selection:**The technique involves multiple manual-like jobs. Access each feature and check its importance with the goal. To implement "*Univariate Feature Selection " *, You should master some skills*.

→ If you have the appropriate**Domain knowledge**And trust your judgment call, always start with this step. Analyze all functions and delete all unnecessary functions. Yes, this is a time-consuming and laborious step, but hey, people you trust more,**"The machine is yours"**

→ **an examination**All features**variance of**(Yes, there has always been a confusing bias-variance compromise :). The rule of thumb here sets a threshold (for example, a feature with a variance of 0 means that each sample has the same value, so the feature will not bring any predictive power to the model) and delete the feature accordingly.

→ **Use of Pearson correlation coefficient:**This is probably the most applicable of the three methods. If you don't understand or are confused, please read firstText.

- In short, it gives us the interdependence between target variables and functions.

- Use Pearson's Thumb rules:

I. Select only moderate to strong relationships with the target variable. (See above).

ii. When there is a strong relationship between the two elements themselves and the target variable, choose either of them (selecting neither will increase any value). use" **seaborn.heat****map()**** "**For visualization and selection, this**Very helpful**.

iii. There is a trap here. It is most effective for linear data and less effective for non-linear data (so avoid it).

Medium**ultivariate function selection:**When you have a lot of features (such as hundreds or thousands of them), then it really becomes impossible to go and manually check each of them, or if you don't have enough domain knowledge, then you have to trust the following technologies . Therefore, in easy-to-understand terms, you are selecting multiple functions at once.

**Multiple feature selection is broadly divided into three categories:**

Let's check it out (we will discuss the most widely used technology in each category)

## Filtering method:

→ The filtration method is usually used as a pre-treatment step. The choice of function has nothing to do with any machine learning algorithm.

→ The filter method ranks some features. Rankings indicate how "useful" each feature may be for classification. Once the ranking is calculated, a feature set consisting of the best N features is created.

→ In various statistical tests, select the correlation between the factor and the result variable based on the score of the factor. (Relevance is a subjective term here.)

**Pearson's relevance:**Oh yes! Pearson correlation is a filtering method. We have already discussed.**difference****Threshold:**We have already discussed.**Linear discriminant analysis:**The goal is to project the dataset onto a low-dimensional space with good class separability to avoid overfitting ("”) And reduce computing costs.*Curse of dimension*

→ Does not perform mathematical operations,LDABringing all higher-dimensional variables (which we cannot draw and analyze) onto 2D graphics, while doing so removes useless features.

→ LDA is still a kind of " ** Reduce the number of dimensions "**Technology than

**选择**

**More**

*"***Feature extraction**Ability (because it creates new variables by reducing the number of dimensions). Therefore, it applies only to tagged data.

→ Maximized the inter-class*Separability*. (Too many technical terms, yes. Don't worry, watch the video).

**other:**

**variance analysis:**ANOVA is similar to LDA except that it operates using one or more class independent features and a continuous dependent feature. It provides a statistical test of whether the means of several groups are equal.

**Bangla:**This is a statistical test applied to categorical feature groups to use their frequency distribution to assess the likelihood of correlation or association between them.

One thing to keep in mind is that filtering methods do not eliminate multicollinearity. Therefore, before training the data model, you must also deal with multicollinearity of the features.

## method of packing:

→ Based on the inferences from the previous model, we decided to add or remove features from your subset.

→ They are called wrapper methods because they wrap the classifier in a feature selection algorithm. Usually, a set of functions is selected. Determine the efficiency of this set; make some changes to change the original set, and evaluate the efficiency of the new set.

→ The problem with this method is that the feature space is large, and it will take a lot of time and calculation to look at each possible combination.

→ In essence, the problem has been reduced to a search problem. These methods are often very computationally expensive.

**Positive selection:**Positive selection is an iterative method, and we start with no features in the model. In each iteration, we will continue to add features that will improve the model until the addition of new variables will not improve the performance of the model.**Eliminate backwards:**In backward elimination, we start with all features and remove the least important features at each iteration, which improves the model's performance. We repeated this process until no improvement was observed in removing features.**Recursive Feature Elimination (RFE)****:**It works by removing attributes recursively and building a model based on the remaining attributes. It uses external estimators to assign weights to features, such as coefficients of linear models, to identify which attributes (and combinations of attributes) contribute the most to predicting target attributes.

→ This is a greedy optimization algorithm that aims to find the best subset of features.

→ It repeatedly creates models and retains the best or worst performing features on each iteration.

→ It will build the next model with features on the left until all features are exhausted. Then, the features are sorted according to the order of feature elimination.

# Recursive Feature Elimination from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression # create a base classifier used to evaluate a subset of attributes model = LinearRegression () X, y = iowa.iloc [:,:-1], iowa. iloc [:,-1] # create the RFE model and select 3 attributes rfe = RFE (model, 10) rfe = rfe.fit (X, y) # summarize the selection of the attributes print (rfe.support_) print (rfe .ranking_)
** Output:
**[False False True True False False False False False False False False False False False True True False True True True False True False False False False False False False False False] [16 24 1 1 4 9 18 13 14 15 11 6 7 12 10 1 1 1 2 1 1 1 1 5 1 23 17 20 22 19 8 21 25 3]

→ this is what happened in the example above,

I. ' **rfe.support_** ' **顺序**Give results about the features (obviously based on the selected model and not required).

ii. ' **rfe.ranking_** 'Give grades to all features separately. When you need more features than you **n_features_to_select "**This is really handy when entering functions (such as 10 above). So you can set thresholds and select all features above them individually.

**4.**Sequential feature selection**Device**** : **Sequential feature selection algorithms are a series of greedy search algorithms* d*Dimensional feature space is reduced to* k*Dimensional feature subspace (where* k*.

→ Step-by-step function selection begins by evaluating each individual function and then selecting those selected algorithm models with the best performance.

→ Backward feature selection is closely related, as you may have guessed, it starts with the entire feature set and then reacts from there, deleting features to find the best subset of the predefined size.

*→ What is "best"?*

It all depends on the defined evaluation criteria (AUC, Prediction accuracy, RMSE, etc.). Next, all possible combinations of selected features and subsequent features are evaluated, and the second feature is selected, and so on, until the required predefined number of features are selected.

→ In short, SFA will delete or add a feature at the same time based on the performance of the classifier until it reaches the required size*k*So far.

Note: I suggest you visit the official documentation to learn more about it with examples

## Embedding method:

→ The embedded method combines the quality of the filter and wrapper methods. It is implemented by an algorithm with its own built-in function selection method.

→ Therefore, these are not any special feature selection or extraction techniques, they also help to avoid overfitting.

- Lasso regularization in linear regression
- Choose k-best in random forest
- Gradient Lifter (GBM)

## Difference between filter and wrapper methods

## Part 3: Downsizing

**So, starting from the same question again, what is dimension reduction?**

In simple terms, the initial*d*Dimensional feature space is reduced to*k*Dimensional feature subspace (where*k*.

**So why is feature selection and extraction?**

Yes in a sense (but limited to " *layman* "). To understand this, we must study more deeply.

In machine learning,

dimensionJust refers to the data set特征(Ie input variables)Quantity. When the number of features is very large relative to the number of observations in the dataset,someAlgorithms will have difficulty training effective models. This is called the "dimensional curse" and it is particularly relevant to clustering algorithms that rely on distance calculations.

(QuoraUsers provide a good analogy for the "curse of dimensionality", see)

So when you have 100 or even 1000 features, you only have one choice at that time**Dimension Reduction.**Let's discuss two extremely robust and popular technologies.

**Linear Discriminant Analysis (LDA):**Yes, it is also used as a dimensionality reduction technique along with the Filter method (as described above).

→ When features are marked, we use LDA in supervised learning.

→ Prepare and understand LDA (if not already installed).

**2. Principal Component Analysis (PCA): PCA**The main purpose of is to analyze the data to identify patterns and find patterns to reduce the dimensionality of the data set while minimizing the loss of information.

→ PCA will try to reduce the dimensionality by exploring how one feature of the data is expressed in terms of other features (linear correlation). Instead, feature selection takes into account goals.

→ PCA works best on datasets with 3 or larger dimensions. Because as the number of dimensions increases, it becomes more difficult to interpret from the resulting data cloud.

(PCA is a bit complicated, yes. Explaining here will make this already long blog more boring. So please use these two excellent materials to understand,

I. One-stop principal component analysis

ii. Video explanation by Josh Starmer (same person from StatQuest)

w ^ Knock it up: but not by the packaging method (Du...😜). This is the end of the series. Finally (but not a list) prompt:

I. Never ignore feature engineering or feature selection and keep everything in the algorithm.

ii. In the endnote, I shared two very useful and powerful tools (remember me, this will be of great help to you)

## Comments