In machine learning, feature selection is the process of selecting the features that are most useful for your predictions. Although it sounds simple, it is one of the most complicated issues when creating a new machine learning model.
In this article, I will share with you that I amFiverrLead some of the methods studied during the previous project.
You'll get some ideas about the basic methods I've tried and the more complicated methods that get the best results - remove the 60% or more features while maintaining accuracy and achieving higher stability for our model. Sex. I will also share our improvements to the algorithm.

Why is feature selection so important?

If you build a machine learning model, you'll know which features are important and which are just how difficult it is.

Removing noisy features will help with memory, computational cost and model accuracy.
In addition, by removing features, it will help to avoid overfitting of the model.

Sometimes you have a business-meaning feature, but that doesn't mean it will help you make predictions.
You need to remember that functionality may be useful in one algorithm (such as a decision tree), but not in another algorithm (such as regression models), not all functions are the same :)

Unrelated or partially related features can have a negative impact on model performance. Functional choice and data cleansing should be the first and most important step in designing the model.

Feature selection method:

Although there are many techniques for feature selection, such as backward elimination, lasso regression. In this article, I will share 3 methods that are found to be most useful for completing better feature selection, each with its own advantages.

"Except X"

In Fiverr, name this technique "All But X." This technique is simple but useful.

  1. You can train and evaluate repeatedly
  2. In each iteration, a feature will be removed.
    If you have a lot of features, you can delete the "series" of features. In Fiver, we usually aggregate features such as different times, 30 days clicks, 60 days clicks, and more. This is a series of features.
  3. Check the evaluation indicators against the baseline.

The goal of this technology is to see which of the functional families do not affect the assessment, or even remove it to improve the assessment.

The problem with this approach is that deleting one feature at a time does not make the features work with each other (non-linear effects). Perhaps the combination of feature X and feature Y is producing noise, not just feature X.
Run all the graphs except X for all processes — After running all the iterations, we compare to see which iterations do not affect the accuracy of the model.

The problem with this method is that deleting one element at a time does not cause the elements to have an effect on each other (non-linear effect).maybe特征The combination of X and feature Y is generating noise, not just feature X.

Feature importance + random features

Another way we try is to use the functional importance that most machine learning model APIs have.

What we do is not just to get the top N features from the importance of functionality. We added 3 random features to the data:

  1. Binary random feature (0 or 1)
  2. Uniform between 0 to 1 random features
  3. Integer random feature

After the list of important features, we only selected features that are higher than the random features.

It is important to use different distributions of random features, as each distribution will have a different impact.

In trees, the model “likes” continuous features (due to segmentation), so these features will be at a higher position in the hierarchy. Therefore, you need to compare each feature to its random random function.


BorutaIt is a functional grading and selection algorithm developed by the University of Warsaw. The algorithm is based on random forests, but can also be used with XGBoost and different tree algorithms.

In Fiverr, I used the algorithm and made some improvements to the XGBoost ranking and classifier model, which I will cover briefly.

This algorithm is a combination of the two methods I mentioned above.

  1. Create a "shadow" feature for each feature in the dataset with the same feature values, but only randomly between rows
  2. Loop through until one of the stop conditions:
    2.1. We will not remove any other features
    2.2. We removed enough features - we can say that we want to remove the 60% feature
    2.3. We did N iterations - we limited the number of iterations to avoid getting into an infinite loop
  3. Run X iterations - we use 5 to eliminate patterns
    3.1 Randomness.Use regular features and shadow features to train the model
    3.2. Save the average feature importance score for each feature
    3.3 removes all features below its shadow characteristics
def _create_shadow(x):
    Take all X variables, creating copies and randomly shuffling them
    :param x: the dataframe to create shadow features on
    :return: dataframe 2x width and the names of the shadows for removing later
    x_shadow = x.copy()
    for c in x_shadow.columns:
        np.random.shuffle(x_shadow[c].values) # shuffle the values of each feature to all the features
    # rename the shadow
    shadow_names = ["shadow_feature_" + str(i + 1) for i in range(x.shape[1])]
    x_shadow.columns = shadow_names
    # Combine to make one new dataframe
    x_new = pd.concat([x, x_shadow], axis=1)
    return x_new, shadow_names

# Set up the parameters for running the model in XGBoost
param = booster_params
df = pd.DataFrame() # initial empty dataframe

for i in range(1, n_iterations + 1):
    # Create the shadow variables and run the model to obtain importances
    new_x, shadow_names = _create_shadow(x)
    bst, df = _run_model(new_x, y, group, weights, param, num_boost_round, early_stopping_rounds, i == 1, df)
    df = _check_feature_importance(bst, df, i, importance_type)

df[MEAN_COLUMN] = df.mean(axis=1)
# Split them back out
real_vars = df[~df['feature'].isin(shadow_names)]
shadow_vars = df[df['feature'].isin(

# Get mean value from the shadows
mean_shadow = shadow_vars[MEAN_COLUMN].mean() * (perc / 100)
real_vars = real_vars[(real_vars[MEAN_COLUMN] > mean_shadow)]

criteria = _check_stopping_crietria(delta, real_vars, x)

return criteria, real_vars['feature']
From the Create Shadow-Train-Compare-Delete feature, return the Boruta Run Graph.
From the Create Shadow-Train-Compare-Delete feature, return the Boruta Run Graph.

Boruta 2.0

This is the best part of this article and is an improvement to Boruta.

We ran Boruta using the "short version" of the original model. By taking data samples and a small number of trees (we use XGBoost), we improved the runtime of the original Boruta without compromising accuracy.

Another improvement is that we run the algorithm using the random features mentioned earlier. It can be seen that we have removed all random features from the dataset, which is a good condition.

With improvements, we don't see any changes in the accuracy of the model, but we see improvements in the runtime. By deleting, we are able to convert multiple 200 features to less than 70 features. We saw the stability of the model at different stages of the number of trees and training.

We also see an improvement in the distance between the training loss and the validation set.

The advantage of improvements and Boruta is that you are running the model. In this case, the problematic feature found is problematic for your model, not a different one.

Final Thoughts

In this article, you learned about 3's different technologies, how they feature selection of data sets and how to build effective predictive models. You saw our implementation of Boruta, runtime improvements, and added random features to help with sanity checks.

With these improvements, our model can run faster, more stable, and maintain accuracy with only 35% of the original features.

Choose the technology that works best for you. Keep in mind that feature selection can help improve accuracy, stability and uptime, and avoid overfitting. More importantly, fewer features make debugging and interpreting easier.

This article is transferred from medium,Original address