Exploratory data analysis is the process of obtaining the original data and using technical means to help oneself better understand the data, extract "good features", and establish a preliminary model.
This article will introduce how to classify data and how to visualize different types of data.
What is exploratory data analysis?
When it comes to basketball, everyone knows that height and wingspan are the key characteristics of athletes.
What about handball?I believe most people can't tell.
When you encounter a field you are not familiar with, you need to quickly have a certain understanding of the unfamiliar field.
There are 2 ways to help us understand unfamiliar areas:
- Consult industry insiders.Senior industry insiders will pass on some of their experience.
- Go and study data in unfamiliar areas.We can take the physical data and performance data of handball players for analysis to see what are the characteristics of the best handball players.Without any industry experience, some discoveries can be made through data insights.
The second way above is:Exploratory Data Analysis | Exploratory Data Analysis | EDA
Exploratory data analysis is a data analysis method and concept that uses various technical means (most of which are data visualization) to explore the internal structure and laws of data.
The purpose of exploratory data analysis is to gain as much insight as possible into the data set, discover the internal structure of the data, extract important features, detect outliers, test basic hypotheses, and establish preliminary models.
The 3-step approach to exploratory data analysis
The process of exploratory data analysis is roughly divided into 3 steps:
- Data Classification
- data visualization
- Insight data
The first step: data classification
When we get the data, the first step is to classify the data, and then use different methods to process different types of data.
The data can be classified in the following ways from coarse to fine:
Structured data vs unstructured data
Structured data: Data that can be organized in tables is considered structured data.
For example: data in Excel, data in MySQL...
Unstructured data: All are organized in non-tabular format.
For example: text, picture, video...
Quantitative data vs qualitative data
Quantitative data: Numerical type, which measures the quantity of something.
For example: 1985
Qualitative data: category, describing the nature of something.
For example: post-80s
4 levels of data
Norminal level: It is the first level of data, and its structure is the weakest.Just need to sort by name.
For example: blood type (A, B, AB, O), name, color
Ordinal level: Sequencing level adds natural sorting on the basis of categorization level, so that we can compare different data.
For example: the star rating of the restaurant, the evaluation level of the company
Interval level: The fixed distance level must be of numeric type, and these values can be used not only for sorting, but also for addition and subtraction.
For example: Fahrenheit, Celsius (the temperature has a negative number, multiplication and division are not allowed)
Ratio level (ratio level): On the basis of the fixed distance level, the absolute zero point is added, which can not only perform addition and subtraction operations, but also multiplication and division operations.
For example: money, weight
Step XNUMX: Data visualization
In order to have a better insight into the data, we can visualize the data to better observe the characteristics of the data.
There are several commonly used data visualizations:
The four data levels above need to correspond to different visualization methods. Below is a table that can help you choose a better visualization solution.
The following are some basic visualization schemes. In practical applications, there will be more complex, combination charts can be used.
Data level | attribute | Descriptive statistics | chart |
---|---|---|---|
Classify | Discrete, disordered | Frequency ratio, mode | Bar chart, pie chart |
Sequencing | Ordered categories, comparison | Frequency, mode, median, percentile | Bar chart, pie chart |
Fixed distance | Number difference is meaningful | Frequency, mode, median, mean, standard deviation | Bar chart, pie chart, box plot |
Fixed ratio | continuous | Mean, standard deviation | Bar chart, curve chart, pie chart, box plot |
Step XNUMX: Insight into the data
Data visualization can help us gain better insights into the data. We can more efficiently discover which data is more important, the possible relationships between different data, and which data will affect each other...
The reason why it is called exploratory data analysis is that there are no fixed routines, so there is nothing to talk about in this step.
Final Thoughts
Exploratory data analysis is a data analysis method and concept that uses various technical means (most of which are data visualization) to explore the internal structure and laws of data.
The process of exploratory data analysis is roughly divided into 3 steps:
- Data Classification
- data visualization
- Insight data
Comments