Correlation matrix and Descriptive Analysis — NBA

11 min readMar 7, 2021

I start this article by saying that I’m passionate about sports and basketball is in my top 3 favorite sports.

Since we can’t see the height of the writers on the internet, we always imagine that any author who writes about basketball probably must be a 6.2 ft tall fan.

But here I am measuring 4.11 ft tall.

But why have I exposed myself like this here? To say that data analysis is a wonderful field!

In the real world I can’t stand out playing basketball (and I have tried hard enough) but with data analysis, I can work with the sport I love so much! Being infiltrated in the middle of so many giant people is wonderful =)

But let’s get to the point.

The goal of this project is to perform a descriptive statistical analysis focused on the correlation between variables in the boxscore data for NBA — National Basketball Association teams during the 2019–20 season.

You can see the entire review process of this project on my Github page but the file view is better through Colab, click here to access it.

NBA

Brief history

Data analysis in sports is becoming increasingly important for better team performance.

One of the great teams that have started a deeper data analysis is Houston Rockets.

The analysis is done both for the team’s game variables and physical analysis of players. The team starts analyzing since the pre-season by studying both their team and the opposing teams.

During the game, the coaches and staffs seek information to be able to perform more efficient combinations of players during the game.

It’s possible to analyze the profile of the opposing player who would be more interesting to make a foul because the probability of this player misses the free throw is higher and this mistake can be fundamental in the outcome of the match.

The analyses are also done on college teams to list with the best players for draft.

It’s very important to know the environment that we will perform the analysis because much more important than knowing how to make amazing calculations is to know how to use them to optimize the results.

What is Descriptive Analysis

Data analysis is done to strategize, plan, and pursue desired results.

Descriptive statistics aims, through various techniques, to describe and summarize data to facilitate understanding and aid problem-solving.

It seeks to describe the trends in the data. It is through graphs, frequency tables, measures such as mean, median, standard deviation, percentiles, and quartiles.

The analysis covers the entire process of collecting, organizing, tabulating, and describing data. It is a fundamental step for any analysis within Data Science.

# Imports for data manipulation, visualization and analysisimport pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import numpy as npsns.set()# Merging Dataframes df_team and df_win
df_team2 = df_team.merge(df_win, on='Team', how='inner' )
df_team2.columns

To start our analysis on the teams, let’s check the variable names, check the amount of data, whether or not there are missing values.

# show volume of data
# lines
print("Number of observations:\t {}".format(df_team2.shape[0]))# column
print("Variables:\t{}\n".format(df_team2.shape[1]))

Boxscore = Dictionary of Variables in Basketball

Everyone who has ever watched a basketball game must remember that they show a table with the results of the game or the players.

A “boxscore” is a table in which all the main events of a basketball match are counted.

It can be followed in real-time, which helps in decision-making at the moment of the match, or it can be analyzed after the match to tell the story of what happened inside the court.

From it, we can verify some reasons for victories and defeats.

As stated earlier in this project, the goal is to analyze this data per team throughout the 2019–2020 season.

Dictionary of Variables in Basketball

** Victories and losses only in the qualifying round. Does not include playoff game

Exploring the data

Let’s see what our dataset looks like.

# viewing the data
df_team2.head(30)

Asterisk teams are teams that have made it into the playoffs.

What is the percentage of missing values in the dataset?

By identifying the amount of missing data we can check the quality of the dataset.

The quality of a dataset is directly related to the number of missing values. By doing this analysis we can understand if these missing/null values are significant compared to the total inputs.

# Identifies the amount of missing data in the variables by percentage and in descending order of missing value(df_team.isnull().sum() / df_team2.shape[0]).sort_values(ascending=False)

For this dataset, we don’t have any variables with missing values, which is great for our analysis

For a first analysis, we will calculate the average of the variables and see if we find any interesting values for our analysis.

# see the average of the variablesdf_team2.mean()

Mean of variables:

We can observe some interesting data.

This dataset has an average of 70 games per team. The number of games is different between teams because of playoffs.

FG% → These are 2 or 3 point shots. Average of 45% of hits
3P% → Average of 35% of successful shots
2P% → Average of 52% of right shots
FT% → Average of 77% right shots

In the variables related to defensive aspects, we have a higher average amount of defensive rebounds, 2453 compared to offensive rebounds, 711.

Offensive rebounds are rebounds during the team’s own attack and defensive rebounds are those where the team is defending.

The defensive rebounds are easier because the team is all positioned inside the paint area (the painted area is the area between the end line and the free-throw line near each basket) to get the ball, which doesn’t happen with offensive rebounds since the team is more open preparing the attacking play.

We can see then that the defense data match reality because there are more defensive rebounds than offensive ones.

Analysis of the data of the two types of rebounds per team helps to understand the team’s style of play.

Exploratory Questions

What are the top teams?

Top 10 teams by Points, converted field goal percentage (2 and 3 shots), free throws percentage, relation to total rebounds, and in relation to 3 points converted percentage.

# Top 10 teams by Points
team_pts = ( df_team2[['Team', 'PTS']].sort_values( 'PTS', ascending=False ) [:10])
print(team_pts)

It’s interesting to note that the winning team of the season does not appear in the top 10 “points”, but is in first place in the percentage of converted shots.

This may indicate that making more points is not necessarily what leads to the victory of the season, but rather the quality of the shots combined with a good defense.

Is there a relationship between being “better” in a given variable and winning the season?

This is what we will verify through the correlation analysis between the variables.

Correlation between variables

A correlation is a mutual relationship or a connection between two variables.

Correlation seeks to understand is how one variable behaves in a scenario where another is varying, aiming to identify if there is any relationship between the two.

Pearson Correlation Coefficient

Pearson’s correlation coefficient (r), is a degree of relationship between two quantitative variables and expresses the degree of correlation through values between -1 and 1.

When the correlation coefficient approaches 1, one notices an increase in the value of one variable when the other also increases, that is, there is a positive linear relationship. When the coefficient approaches -1, it is also possible to say that the variables are correlated, but in this case when the value of one variable increases that of the other decreases. This is what is called negative or inverse correlation.

A correlation coefficient close to zero indicates that there is no relationship between the two variables, and the closer they get to 1 or -1, the stronger the relationship is.

Which variables are most related to each other?

For a better correlation analysis, I removed some variables that could give a false-positive result for being variables already related to each other, such as:

Amount of attempted throws and amount of converted throws.
Percentage of converted shots
FG — which represents shots from 2 and 3 together
Total rebounds with offensive and defensive rebounds
Win-loss ratio

Remember: correlation does not imply cause

# creating the correlation matrix
df_tcorr = df_team2.copy()
df_tcorr.drop(['Rk','G', 'MP','W/L%', 'SRS','FG','FGA','FG%','2P%','3P%', '3PA','2PA', 'FTA', 'PS/G', 'PA/G','TRB'], axis=1, inplace=True)
corr_team = df_tcorr.corr()corr_team.corr()

For a better visualization let’s make the heatmap.

# creating heatmap
plt.figure(figsize=(20, 15))
sns.heatmap(corr_team,
            annot = True,
            fmt = '.2f',
            cmap='Blues')
plt.title('Correlation between variables of the Team dataset')
plt.show()

To further improve the visualization let’s focus on the variables that have more correlation with the Wins variable that represents wins.

# correlation Team
k = 10# finding the most correlated variables
cols = corr_team.nlargest(k, 'W')['W'].index
cm = np.corrcoef(corr_team[cols].values.T)# creating heatmap
plt.figure(figsize=(20, 15))
sns.heatmap(cm,
            annot = True,
            fmt = '.2f',
            cmap='Blues',
yticklabels=cols.values, xticklabels=cols.values)
plt.title('Variables that most influence the outcome of the game')
plt.show()

Table with the 9 variables that most influence the variable Win, which represents the number of wins the team has in the season.

Interesting observations:

The defense variables, defensive rebounding and blocking have a high correlation with the wins variable.
As expected the variable point has a positive relationship with the wins.
Free throws for being a ball theoretically easier to be converted also has a high correlation. And as we saw earlier, the average percentage of shots converted is 77%, a high percentage.
Long shots that are the 3 point baskets are also highly correlated with the victory.

We have seen that points are an important variable and highly correlated with the victory variable.

Let’s then analyze the variables that most influence the points variable.

Variables most related to the variable points:

Once again we see that defense-related variables are important and make a difference in the score.

Many times, by grabbing the rebound, the team has an advantage.

If it is an offensive rebound, the player is closer to the hoop, is usually a taller player, and runs the risk of being fouled.

As for the defensive rebound, it is likely to get a long pass to the player further forward, facilitating the conversion of the shot.

Historically, 3-point attempts have exceeded 2-point attempts because, according to the teams’ analysts, more 3-point attempts have a higher relationship with the number of victories.

Over the years of the NBA, teams started to worry about the efficiency of 3-point shots and started to adopt long shots as a game strategy, starting a change in the pattern of the game.

Houston Rockets have been breaking the record for shots per game since the 16/17 season, but we cannot leave aside the “Splash Brothers”, Stephen Curry and Klay Thompson, who are surprisingly quick to open scoreboard advantage.

We can see then that the “3P” variable outperforms the “2P” when correlated with the points variable, which in turn is highly correlated with the wins variable.

Pandas — Profiling

An easier and more practical alternative to performing a descriptive analysis of the data is to use Pandas Profiling.

It generates a report in HTML format with all the necessary information, making it easier for us to visualize the data.

Some information that it brings us:

Size of the dataset
Number of duplicated lines.
The type of data in each column
Which columns contain missing values
Warnings

In warnings it tells us some information that we need to pay more attention to related to our dataset, such as degree of correlation between variables, normality of the data, unique values, etc.

At variables it gives information for each variable as if we were using df_team2.describe() and also gives histograms with the frequency distribution.

It also shows us the correction matrix of the variables.

Best of all, we can save the report in html format for later use.

Conclusions

We can conclude that although the conversion of 3-point shots on intuition seems more difficult in reality it is an important variable for the team’s victory.

Defense variables have a high correlation with wins and points. An efficient defensive strategy can make a difference in the game.

Free throws are also important because they are theoretically easier to be converted since the player has no opponent getting in the way.

Final Considerations

For a future analysis, it would be interesting to check in more detail the efficiency of the teams related to the variables.

Among the teams that made it to the playoffs, would there be any similarities between your data?

The relationship of the team with the number of rebounds or assists can indicate the style of play of the team, a more individual or more collective game.

Do teams from the same conference have similar characteristics?

Curiosities

1. The first basketball game was played on January 20, 1892, in the city of Springfield, Massachusetts (USA). The game was created by canadian James A. Naismith, a physical education teacher at the local YMCA. Initially, the novelty was restricted to the association. Only on March 11 did a game takes place with the presence of an external public: 200 people saw the students beat the teachers 5 to 1. In 1894, the Amateur Athletic Union formalized the rules. The first women’s basketball game took place on April 4, 1896, also in the United States. Stanford University beat the University of California.

2. In the early years of professional basketball, players were no taller than six feet.

3. Until 1992, the rules of basketball in the Olympic Games excluded professional teams, which left NBA players out of the competition. Thus, the US team was made up of athletes from university teams. From that year on, however, everything changed. The team that would represent the land on “Uncle Sam” in the Barcelona Games had so many stars that it earned the nickname Dream Team. The salaries of its twelve members were also far from a nightmare: together they earned 73 million dollars a year. The team’s first basket was made by Larry Bird, in a 136–57 victory over Cuba.

4. The first basketball ball was produced by the Chicope Falls, Massachusetts company in 1894. Before, a soccer ball was used.

5. Male NBA Official Game Ball is a size 7 basketball and meets all size and weight specifications set by the NBA with a circumference of 29.5" (75cm).

Source: Guia dos curiosos