When our teamProjectOne of the key components of our success in scoring the first time in this year's CALL Shared Task Challenge text subtask is to carefully prepare and clean up the data. Data cleansing and preparation is the most critical first step in any AI project. As the evidence shows,Most data scientists will spend most of their time (up to70%) for cleaning data.
In this blog post, we'll walk you through the initial steps of data cleansing and preprocessing in Python, starting with importing the most popular libraries into the actual encoding of the feature.
Data cleanupOrData cleanupIs the process of detecting and correcting (or deleting) corrupted or inaccurate records from a recordset, table or database, referring to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data, then replacing, modifying, Or delete dirty or coarse data. // Wikipedia
Step 1. Load the data set
The first thing you need to do is import the database preprocessor library. There are many libraries available, but the most popular and important Python libraries for working with data are Numpy, Matplotlib and Pandas.NumpyIt is a library for all mathematical things.pandasIt is the best tool for importing and managing data sets.Matplotlib(Matplotlib.pyplot) is a library for making charts.
For future use, you can import these libraries using a shortcut alias:
Import numpy as np import matplotlib.pyplot as plt import pandas as pd
Load data into pandas
After downloading the dataset and naming it as a .csv file, you need to load it into the pandas DataFrame to explore and perform some basic cleanup tasks, removing unwanted information, which can slow down the data processing.
Typically, such tasks include:
- Delete the first line: it contains irrelevant text instead of column headings. This text prevents the pandas library from parsing the data set correctly:
My_dataset = pd.read_csv('data / my_dataset.csv',skiprows = 1,low_memory = False)
- Delete columns that contain text descriptions we don't need, url columns and other unnecessary columns:
My_dataset = my_dataset.drop(['url'],axis = 1)
- Delete all columns that contain only one value, or have missing values that exceed 50% to work faster (if your data set is large enough to still make sense):
My_dataset = my_dataset.dropna(thresh = half_count, axis = 1)
It is also a good practice to name the filtered data set differently to separate it from the original data. This ensures that you still have the raw data in case you need to return it.
Step 2. Explore the data set
Understand the data
Now that you have set up the data, you should still take some time to explore it and understand what each column represents. This manual review of the data set is important to avoid errors in the data analysis and modeling process.
To simplify the process, you can create a DataFrame using the columns in the data dictionary, the data type, the first row value, and the name of the description.
As you explore these features, you can focus on any of the following columns:
- Poor format,
- Need more data or a lot of pre-processing to become a useful feature, or
- Contains redundant information,
Because if not handled properly, these things can damage your analysis.
You should also pay attention to data leakageThis may lead to over-fitting of the model. This is because the model will also learn the features that are not available when we use it for prediction. We need to ensure that our model is only trained using the data at the time of the loan application.
Determine the target column
By exploring the filtered data set, you need to create a dependent variable matrix and an independent variable vector. First, you should determine the appropriate column to use as the modeling target column based on the question you are answering. For example, if you want to predict the development of cancer, or the opportunity for a letter of credit to be approved, you need to find a column with a disease status or a loan grant ad to use it as the target column.
For example, if the target column is the last column, you can create a dependent variable matrix by typing:
X = dataset.iloc [:,: - 1] .values
This first colon ( : ) means that we want all the lines in our data set.:-1Indicates that we want to get all the data columns except the last one. At the end of last month, we want all the values.
To get an argument vector that contains only the last column of data, type
y = dataset.iloc [:, - 1] .values
Step 3. Prepare machine learning function
Finally, it is time to prepare to provide the functionality of the ML algorithm. To clean up the dataset, you needHandling missing values and classification featuresBecause the mathematical assumptions of most machine learning models are numerical and do not contain missing values. Also, if you try to train models such as linear regression and logistic regression using data with missing or non-numeric values, thenScikit-learnThe library will return an error.
Handling missing values
Lost data can be the most common feature of unclean data. These values are usually in the form of NaN or None.
Here are a few reasons for missing values: sometimes missing values because they don't exist, or because of improper data collection or improper data entry. For example, if someone is underage and the issue applies to someone older than 18, the question will contain missing values. In this case, the value of the question is incorrect.
There are several ways to fill in missing values:
- If your data set is large enough and the percentage of missing values is high (for example, more than 50%), you can delete the rows that contain the data;
- You can use 0 to fill all empty variables to handle values;
- you can useScikit-learnIn the libraryImputerClass fills missing values with data (mean, median, most_frequent)
- You can also decide to fill in the missing values directly with any value in the same column.
These decisions depend on the type of data, the actions you want to perform on the data, and the reasons for the missing values. In fact, just because something is popular doesn't necessarily make it the right choice. The most common strategy is to use an average, but depending on your data, you might take a completely different approach.
Processing classified data
Machine learning uses only numeric values (float or int data types). However, datasets typically contain object data types that are to be converted to numbers. In most cases, the categorical values are discrete and can be encoded as dummy variables, assigning a number to each category. The easiest way is to use One Hot Encoder, specifying the index of the column to be processed:
From sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = )X = onehotencoder.fit_transform(X).toarray()
Handling inconsistent data entry
For example, when there are different unique values in the column, an inconsistency can occur. You can consider using different capitalization methods, simple error imprints, and inconsistent formats to form an idea. One way to remove data inconsistencies is to remove spaces before or after the entry name and convert everything to lowercase.
However, if there are a large number of inconsistent unique entries, you cannot manually check the closest match. you can use itFuzzy WuzzyThe package identifies which strings are most likely to be the same. It accepts two strings and returns a ratio. The closer the ratio is to 100, the more likely you are to unify the string.
Processing date and time
The specific type of data inconsistency is inconsistent in date format, such as dd / mm / yy and mm / dd / yy in the same column. Your date value may not be the correct data type, which will not allow you to perform actions efficiently and gain insight from it. This time you can usedatetimePackage to fix the type of date.
Scaling and normalization
If you need to specify another change in the number of changes that is not equal to the other, scaling is important. With scaling, you can ensure that they are not used as the primary predictor just because they are very large. For example, if you use a person's age and salary in a forecast, some algorithms will pay more attention to salary because it is bigger, which makes no sense.
Normalization involves converting or converting a data set to a normal distribution. imageSVMSuch algorithms converge much faster on standardized data, so it makes sense to standardize the data to get better results.
There are many ways to perform feature scaling. In short, we put all the features in the same scale so that no one function is dominated by another. For example, you can use the sklearn.preprocessing packageStandardScalerClass to fit and transform the data set:
From sklearn.preprocessing import StandardScaler sc_X = StandardScaler()X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)Since you don't need to put it in the test set, you only need to apply the conversion.Sc_y = StandardScaler()
Y_train = sc_y.fit_transform(y_train)
Save as CSV
To ensure that you still have raw data, it's a good idea to store the final output of each part or stage of the workflow in a separate csv file. This way, you can make changes in the data processing flow without having to recalculate everything.
As we have done before, you can use pandas
to_csv()The function stores the DataFrame as .csv.
These are the basic steps required to process large data sets, clean up and prepare data for any Data Science project. You may find other forms of data cleanup useful. But now we want you to know that you need to organize and organize your data before you develop any models. Better, cleaner data is better than the best algorithm. If you use very simple algorithms for the cleanest data, you will get very impressive results. Moreover, performing basic pre-processing is not difficult!
This article is reproduced from the medium,Original address