Have you ever thought about how we apply machine learning algorithms to problems in order to analyze, visualize, discover trends and find correlations in data? In this article, I will discuss the common steps of building a machine learning model and how to choose the right model for your data. The inspiration for this article comes from common interview questions that are asked about how to deal with data science issues and why they are chosen.
As a data scientist, we follow some guidelines to create a model:
- Collect data (usually tons)
- Establish goals, assumptions to test, and timelines for completing this task
- Check for exceptions or outliers
- Explore lost data
- Clean up data based on our constraints, goals and assumptions
- Perform statistical analysis and initial visualization
- Scaling, regularization, normalization, feature engineers, random samples and validating our data for model preparation
- Train and test our data and use our verification portion of the data to play an unknown role
- Modeling based on classification/regression indicators for supervised or unsupervised learning
- Establish benchmark accuracy and check the accuracy of our current model based on training or test data
- Double check that we have resolved the issue and provided results
- Prepare models for deployment and product delivery (AWS, Docker, Buckets, App, Website, Software, Flask, etc.)
Machine learning tasks can be divided into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. In this article, we don't care about the last two, but I will understand what they mean.
Semi-supervised learning uses unlabeled data to gain a general understanding of demographics. In other words, we only learn the features from a small training set because it is marked! We don't take advantage of test sets that contain a lot of valuable information because they are untagged. Therefore, we should find a way to learn from a large amount of unlabeled data.
according toGeeksForGeeks's statementReinforcement learning is about taking appropriate actions to maximize returns in specific situations. The machine or robot learns by trying all possible paths and then choosing the path that will get the best return with the fewest obstacles.
Here are some ways to choose a model for machine learning/deep learning tasks:
- Data imbalance is relatively common.
We can do this byResamplingTo deal with unbalanced data, this is a way to use data samples to improve accuracy and quantify the uncertainty of the overall parameters. Remember that we like to resample. Actually,Re-samplingMethod takes advantage of nestedRe-samplingtechnology.
We divide the raw data into training and test sets. After finding the coefficients that fit our model with the help of the training set, we can apply the model to the test set and find the accuracy of the model. This is the final accuracy before applying it to unknown data (also known as our validation set). This ultimate accuracy makes people more promising to get accurate results on unknown data.
However, if we further divide the training set into training and test subsets and then calculate the final accuracy of the subset, then repeating this operation for many subsets in these subsets allows us to get the most out of these subsets. Accuracy! We hope that this model will provide maximum accuracy for our final test set. get onResamplingTo improve the accuracy of the model. Have different dataResamplingWays such as bootstrap, cross-validation, repeated cross-validation, etc.
2. We can create new features through principal component analysis
Also known as PCA, it helps to reduce the size. Clustering techniques are very common in unsupervised machine learning techniques.
3. With regularization techniques, we can prevent overfitting, underfitting, outliers and noise.
4. We need to solve the black box artificial intelligence problem
This led us to consider strategies for building interpretable models. according toKDNuggets's statementBlack box AI systems for automated decision making are typically based on machine learning of big data, mapping user characteristics into a class that can predict individual behavioral characteristics without revealing the cause.
This is not only a problem of lack of transparency, but also a possible bias of the algorithm to inherit from the set of artifacts hidden in human bias and training data, which can lead to unfair, erroneous decisions and erroneous analysis.
5. Understand algorithms that are not sensitive to outliers
We can decide whether we should use randomness in the model or overcome the abnormal skewness in random forests.
Machine learning model
Most of these models are covered in my data scientist study guide. This guide is a good definition of the purpose, use, time of use, and simple verbal examples for each model. If you want to access my guide, please clickHere,Because it was also released and recommended in Medium's "Towards Data Science."
- predictionContinuous valueThe first method:Linear regressionUsually the first and most common choice, such as house prices
- Binary classificationMethod is usually similarLogistic regressionmodel. If you encounter two types of classification problems, thenSupport Vector Machines(SVM) very helpful for the best results!
- Multi-category classification: Random forests are highly favored, but SVM has an advantage.Random forestMore suitable for multiple classes!
For multiple classes, you need to simplify the data into multiple binary classification problems. Even if the proportions of the elements are different, the random forest can be a good combination of numbers and classification elements, which means you can use the data as it is. SVM can maximize margin and rely on the concept of distance between different points. It is really important to decide by distance!
Because of this, we must thermally encode (virtual) the classification features. In addition, it is highly recommended to use minimum-maximum or other scaling as a pre-processing step. For most common classification problems, random forests offer the possibility of belonging to this category, while SVM provides you the distance from the boundary. If you need a probability, you still need to convert it to probability in some way. For those issues that support SVM, it will outperform the random forest model. The SVM gives you a support vector, the point in each class that is closest to the boundary between the classes.
4. The simplest classification model starts with?Decision treeConsidered to be the easiest to use and understandtree. They are implemented by models such as random forests or gradient enhancements.
5. Competing models? The Kaggle game prefers random forest and XGBoost! What is a gradient enhancement tree?
Deep learning model
according toInvestopediastatement,Deep learningIt is an artificial intelligence function that mimics the work of the human brain in processing data and creating patterns for decision making.
- We can useMultilayer perceptronTo focus on complex features, these features are not easy to specify, but have a lot of tag data!
according toTechopedia's statement.Multilayer perceptron(MLP) is a feedforward artificialNeural NetworksYou can generate a set of outputs from a set of inputs. MLP is characterized by input nodes and outputsFloorConnectedSeveral floorsEnter the node as a directed graph.
2. For vision-based machine learning, such as image classification, target detection, image segmentation or image recognition, we will useConvolutional neural network.CNNUsed for image recognition and processing dedicated to processing pixel data.
3. For sequence modeling tasks such as language translation or text categorization,Recurrent neural network.
When any model needs context to be able to provide output based on input,RNNWill appear. Sometimes the context is the most important thing for the model to predict the most appropriate output. In otherNeural networkAll inputs are independent of each other.
Thank you for taking the time to read my article. During my interview, I have been questioning how to solve data science problems and choose the right machine learning model. Not that I was wrong, but why I chose this particular direction. In fact, there is really no correct answer to your method, but choosing the right model has some correctness. Ultimately, it depends on the data you are analyzing and the goals you are trying to solve!
This article is transferred from medium,Original address