Visualization is great.However, it is difficult to form a good visualization.

In addition, it takes time and effort to present these visualizations to more viewers.

We all know how to makeBar-Plots, Scatter Plots and Histograms, butWe did not pay too much attention to beautify them.

This will hurt us-our credibility with peers and managers. You can't feel it now, but it happened.

In addition, I found it important to reuse my code. Do I need to start over every time I access a new dataset? someReusable charts that help us find information about data FAST.

In this article, I will also discuss 3's cool visual tools:

  • Related to the classification of graphics,
  • Pairplots,
  • Use Seaborn's Swarmplots and Graph Annotations.

In short, this article is about useful and renderable charts.

I will use data from the FIFA 19 full player dataset to discussKaggle  -Detailed attributes of each player registered in the latest FIFA 19 database.

Since the dataset has many columns, we only focus on the subset of classifications and contiguous columns.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# We dont Probably need the Gridlines. Do we? If yes comment this line
player_df = pd.read_csv("../input/data.csv")
numcols = [
'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration',
       'SprintSpeed', 'Agility',  'Stamina',
catcols = ['Name','Club','Nationality','Preferred Foot','Position','Body Type']
# Subset the columns
player_df = player_df[numcols+ catcols]
# Few rows of data
Player data
Player data

This is a well formatted data, but we need toWage and Value columnsget onSome pretreatment (Because they are in Euro and contain strings), so that they become numbers for our subsequent analysis.

def wage_split(x):
        return int(x.split("K")[0][1:])
        return 0
player_df['Wage'] = player_df['Wage'].apply(lambda x : wage_split(x))
def value_split(x):
        if 'M' in x:
            return float(x.split("M")[0][1:])
        elif 'K' in x:
            return float(x.split("K")[0][1:])/1000
        return 0
player_df['Value'] = player_df['Value'].apply(lambda x : value_split(x))

Related to the classification of graphics

Simply put, correlation is a measure of how two variables move together.

For example, in the real world,Income and expenses are positively correlated. If one increases another one also increases.

Academic performance and the use of video games are negatively correlated. Add one forecast to another decrease.

Therefore, if our predictor is positively or negatively correlated with our target variable, then it is valuable.

When we try to understand our data, the correlation between the different variables is a very good thing.

We can easily create a very good correlation diagram using Seaborn.

corr = player_df.corr()
g = sns.heatmap(corr,  vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt='.2f', cmap='coolwarm')
Where are all the categorical variables?
Where are all the categorical variables?

But have you noticed any problems?

Yes, this graph only calculates the correlation between the series.

If my target variable isOrHow to do?ClubPosition?

I want to be able to get correlation in three different situations, and we use the following correlation metrics to calculate these:

1. numeric variable

We have alreadyPearson's relevanceThe form gets this, it is a measure of how two variables move together. This range is from [-1,1]

2. categorical variables

We will useCramer VAs classification-classification case. It is the interrelationship of two discrete variables and is used with variables that have two or more levels. It is a symmetric measure because the order of the variables does not matter. Cramer (A, B) == Cramer (B, A).

For example: in our dataset,ClubandNationalityMust be related in some way.

Let's examine this using a stacked graph, which is a great way to understand the distribution between categorical and categorical variables. We use a portion of the data because there are many nationalities and clubs in the data.

We only keep the best team (keeping FC Porto just for a more diverse sample) and the most common nationality.

Keep the best team and nationality

The club's preference is equivalent to nationality: understanding the former is very helpful in predicting the latter.

We can see if a player belongs to England. He is more likely to play in Chelsea or Manchester United, not in Barcelona or Bayern Munich or Porto.

So here is some information. Cramer's V captures the same information.

If all clubs have the same percentage of players from all countries, then Cramer's V is 0.

For example, if every club likes a single nationality Cramer's V == 1, then all English players will be in Manchester United, Bayern Munich's all-German players and so on.

In all other cases, the range is [0,1]

3. Values ​​and categorical variables

We willRelated ratioUsed to classify consecutive cases.

There is not much mathematics, it is a decentralized measure.

Given a number, can we find out which category it belongs to?

For example:

Suppose we have two columns in our data set:SprintSpeedPosition :

  • GK: 58 (De Gea), 52 (T. Courtois), 58 (M.Neuer), 43 (G. Buffon)
  • CB: 68 (D. Godin), 59 (V. Kompany), 73 (S.Umtiti), 75 (M. Benatia)
  • ST: 91 (C. Ronaldo), 94 (G.Bale), 80 (S. Aguero), 76 (R. Lewandowski)

As you can see, these numbers predict the buckets they are involved in, so the correlation ratio is high.

If I know that the sprint speed is faster than 85, I can definitely say that this player is playing in ST.

This ratio is also within the range of [0,1]

The code to do this is taken from the dython package. I won't write too much in the code, you can be in myIn the Kaggle kernelFind it. The final result is as follows:

player_df = player_df.fillna(0)
results = associations(player_df,nominal_columns=catcols,return_results=True)
Classification and classification, classification and numbers, numbers and numbers
Classification and classification, classification and numbers, numbers and numbers. More fun.

Not very beautiful?

By looking at the data, we can learn about football. E.g:

  • The position of the player is highly correlated with the ability to dribble. You won't play Messi in the back. Correct?
  • Value is more relevant to dribble than passing and ball control. The rule is to always pass the ball. Neymar, I am watching you.
  • Clubs and wages are highly correlated. Can be expected.
  • The body type and preferred foot are highly correlated. Does this mean that if you are lean, are you likely to be left? It doesn't make much sense. People can investigate further.

In addition, we can find so much information through this simple chart, which is not visible in the typical correlation diagram without categorical variables.

I put it here. People can learn more about graphs and find more meaningful results, but the point is that it makes life easier to find patterns.


Although I talked a lot about relevance, it is a variable indicator.

To understand what I mean, let us look at an example.

Anscombe's QuartetIt consists of four datasets that have almost the same 1 correlation, but have very different distributions and look very different when drawn.

Anscombe Quartet - Correlation may be fickle.
Anscombe Quartet-Relevance can be fickle.

Therefore, sometimes drawing relevant data becomes critical. And view the distribution separately.

There are now many columns in our data set. Drawing them all into graphics can be very laborious.

No, this is a line of code.

filtered_player_df = player_df[(player_df['Club'].isin(['FC Barcelona', 'Paris Saint-Germain',
       'Manchester United', 'Manchester City', 'Chelsea', 'Real Madrid','FC Porto','FC Bayern München'])) & 
                      (player_df['Nationality'].isin(['England', 'Brazil', 'Argentina',
       'Brazil', 'Italy','Spain','Germany'])) 
# Single line to create pairplot
g = sns.pairplot(filtered_player_df[['Value','SprintSpeed','Potential','Wage']])
Visual presentation data

very good. We can see so much in this picture.

  • Wages and values ​​are highly correlated.
  • Most other values ​​are also relevant. However, the trend of potential and value is unusual. We can see how the value grows exponentially when we reach a specific potential threshold. This information may be helpful in modeling. Can I make some conversions to potential to make it more relevant?

caveat:There are no classification columns.

Can we do better? We can always be.

g = sns.pairplot(filtered_player_df[['Value','SprintSpeed','Potential','Wage','Club']],hue = 'Club')
Plus color for more information

More information. Just puthueAdd parameters as categorical variablesClub.

  • The wage distribution in Porto is too biased downwards.
  • I don't think the value of Porto players is very large. Porto players are always looking for opportunities.
  • See how many pink dots (Chelsea) form a cluster on the potential and wage map. Chelsea have a lot of high-potential players and have lower wages. Need more attention.

I already know some points in the salary/value submap.

The blue point of salary 500k is Messi. In addition, the orange point that is more valuable than Messi is Neymar.

Although this hack still can't solve the classification problem, I have some cool things to study the distribution of categorical variables. Although individual.


How do I see the relationship between categorical data and numerical data?

Enter the picture Swarmplots, just like their name.Draw a set of points for each category, slightly scattered on the y-axis to make it easier to see.

They are the people I like to draw this relationship right now.

g = sns.swarmplot(y = "Club",
              x = 'Wage', 
              data = filtered_player_df,
              # Decrease the size of the points to avoid crowding 
              size = 7)
# remove the top and right line in graph

Why don't I use Boxplots?Where is the median? Can I paint it?obvious. Covering the bar chart at the top, we have a nice looking graphic.

g = sns.boxplot(y = "Club",
              x = 'Wage', 
              data = filtered_player_df, whis=np.inf)
g = sns.swarmplot(y = "Club",
              x = 'Wage', 
              data = filtered_player_df,
              # Decrease the size of the points to avoid crowding 
              size = 7,color = 'black')
# remove the top and right line in graph
Swarmplot + Boxplot, interesting

very good. We can see the various points on the chart, view some statistics and clearly understand the difference in wages.

Messi is the far right point. However, I should not tell you in the text below the chart. Correct?

The diagram will be taken in the presentation. Your boss said. I want to write Messi on this picture. Enter the picture注释.

max_wage = filtered_player_df.Wage.max()
max_wage_player = filtered_player_df[(player_df['Wage'] == max_wage)]['Name'].values[0]
g = sns.boxplot(y = "Club",
              x = 'Wage', 
              data = filtered_player_df, whis=np.inf)
g = sns.swarmplot(y = "Club",
              x = 'Wage', 
              data = filtered_player_df,
              # Decrease the size of the points to avoid crowding 
              size = 7,color='black')
# remove the top and right line in graph
# Annotate. xy for coordinate. max_wage is x and 0 is y. In this plot y ranges from 0 to 7 for each level
# xytext for coordinates of where I want to put my text
plt.annotate(s = max_wage_player,
             xy = (max_wage,0),
             xytext = (500,1), 
             # Shrink the arrow to avoid occlusion
             arrowprops = {'facecolor':'gray', 'width': 3, 'shrink': 0.03},
             backgroundcolor = 'white')
Annotated statistics and point groups
Annotated statistics and point groups. In the speech, I am leaving.
  • Look at Porto there. Competing with such a small salary budget giant.
  • Real and so many high-paying players in Barcelona.
  • Manchester City has the highest median salary.
  • Manchester United and Chelsea believe in equality. Many players gather at roughly the same salary level.
  • I am very happy that although Neymar is more valuable than Messi, the wages of Messi and Neymar are very different.

In this crazy world, it seems normal.

in conclusion

So in retrospect, in this article, we discuss the correlation between computing and reading different variable types, plot the correlation between numerical data, and use Swarmplots to plot categorical and numerical data. I like how we stack chart elements together in Seaborn.

Also, if you want to learn more about visualization, I would like to provide adata visualizationExcellent course,And applyUniversity of MichiganDrawingThis is a very goodData science specializationPart of itself isPython. Let's see

The original text was transferred from awardsdatascience,Original address