This article is transferred from the public reading core,Original address

Today, many organizations are constantly looking for faster and more accurate data preparation methods to solve data challenges and achieve machine learning (ML). But it is important to ensure that the data is clean, consistent, and accurate before importing the data into a machine learning model or any other analytical project. Because many of today's analysis is based on the context of the data, the people who are closest to the content of the data are better able to accomplish tasks, that is, business experts who can apply intuitive, theoretical, and business knowledge to the data.

Unfortunately, business users often don't have data science skills, and crossing this gap can help quickly get value from the data. As a result, many people use Data Preparation (DP) to help data scientists and machine learning practitioners quickly prepare and annotate their corporate data and expand the value of data across the enterprise.

How data collection and preparation is the basis of a trusted ML model

In order to create a successful machine learning model, companies must be able to train, test, and validate them before putting them into production. Data preparation techniques are being used to create a clean, annotated foundation for modern machine learning. Historically, however, a good DP took more time than any other machine learning process.

The time it takes to reduce data preparation becomes more and more important, allowing more time for model testing, debugging, and optimization to create greater value. At the same time, preparing data for data analysis teams and machine learning teams accelerates machine learning and data science projects, bringing an immersive business experience to accelerate and automate data insight through six key steps.

The first step: data collection

This is the most basic step so far, and it can handle some common problems, including:

· Automatically determine the relevant attributes of the data string stored in the .csv format file.

· Analyze highly nested data structures, such as converting information in XML or JSON files into tabular form, to facilitate scanning and pattern detection.

· Search and identify relevant data from external storage.

However, when considering a DP solution, make sure it combines multiple files into a single input. For example, you have a set of files containing daily transaction information, but machine learning needs to enter data for a whole year. Also make sure that an emergency response plan for the dataset is related to the sampling and preferences in the machine learning model. 

The second step: data exploration and analysis

Once the data collection is complete, you need to evaluate the data status, including looking for trends, outliers, anomalies, errors, inconsistencies, missing or skewed information. This is important because the source data reflects all the results of the model, so it's important to ensure that the data does not contain hidden deviations. For example, if you are looking for behavior data for a nation's consumers, but only extracting data from a limited sample, you may be missing important geographic areas. At this time, it is necessary to find all the factors that may cause the error of the model results in the entire data set, not just the partial or sample data set.

Step 3: Adjust and unify the data format

The next step in big data preparation is to ensure that the format of the data matches the machine learning model. If the collected data comes from a different data source, or if the data set has been manually modified by more than one investor, then it may be necessary to check for exceptions in the data format (eg USD5.50 and $5.50). Normalizing the values ​​in a column in the same way (for example, full spelling or abbreviated state names) ensures that the data is correctly summarized. A consistent data format avoids these errors so that the entire data set uses the same input format protocol.

Step 4: Improve data quality

In this step, you first need a strategy to handle erroneous data, missing values, extreme values, and outliers in the data. If the self-service data preparation tool has built-in smart tools that can help match data attributes in different databases and integrate them intelligently, then you can use it to help. For example, if you have two columns in a database, "name" and "last name", and another database contains a list of "customers", it seems to contain a combination of "last name" and "name", the intelligent algorithm should be able to determine A way to match both and consolidate the database into a single customer view.

For continuous variables, it is important to use a histogram to check the distribution of the data and reduce the bias. Be sure to check the records outside the acceptable range of values. This outlier may be an input error or a real, meaningful result that may reflect future events. Repeated or similar values ​​may contain the same information and should be eliminated. Similarly, be careful before automatically deleting all records that contain missing values, because too many deletions can skew the data set, making it impossible to react to real-world situations.  

Step 5: Feature Engineering

This step involves an art and science that transforms raw data into features that enable it to better import patterns into learning algorithms. For example, data can be broken down into multiple parts to capture more specific relationships, such as analyzing sales performance by day of the week, not just by month or year. In this case, separating the day of the week from the list of dates (eg, "Monday; 2017.06.19" may provide more relevant information to the algorithm.

Step 6: Split the data into training and evaluation sets

The final step is to split the data into two data sets, one for algorithm training and one for evaluation. When splitting the training set and evaluation set, it is important to select a non-overlapping subset of the data to ensure the test is correct. When you import raw data and your prepared data into a machine learning algorithm, you purchase tools that provide version management and cataloging functions, and clarify the relationship between the two types of data. In this way, you can trace the input data based on the prediction results, so that you can improve and optimize your model in the future.

Promote business performance - how to implement ML with DP and solve data problems

Data preparation has long been recognized as helping business leaders and analysts prepare data to meet analytical, operational, and management needs. Self-service data provisioning provided by Amazon Web Services (AWS) and Azure has taken it to another level by leveraging the many valuable attributes in a cloud-based environment. 

Therefore, with built-in intelligent algorithms, business users who are closest to the data and most familiar with the business environment can quickly and accurately prepare the data set. They can access, retrieve, shape, collaborate, and publish data using intuitive visual applications with mouse clicks instead of code, while providing complete management and security. IT professionals can maintain the scale of data volumes and diversity across enterprise and cloud data sources to meet the timely and repeatable data service needs of business scenarios.

A DP-like solution solves many data challenges, implements ML and data science workflows, and enhances applications with machine intelligence. More importantly, it enables them to transfer data to the information consumer, making all the people, processes and systems in the organization smarter.