Wednesday, 7 September 2022

Plotting a fancy diagonal correlation matrix in python within dataframe

I have the following synthetic dataframe, including numerical and categorical columns as well as the label column. I want to plot a diagonal correlation matrix and display correlation coefficients in the upper part as the following:

expected output:

img

Despite the point that categorical columns within synthetic dataset/dataframedf needs to be converted into numerical, So far I have used this seaborn example using 'titanic' dataset which is synthetic and fits my task, but I added label column as follows:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Generate a large random dataset with synthetic nature (categorical + numerical)
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)

# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

I checked a related post but couldn't figure it out to do this task. The best I could find so far is this workaround which can be installed using this package that gives me the following output:

#!pip install heatmapz
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Generate a large random dataset
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)

# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool)) 
mask[np.diag_indices_from(mask)] = False
np.fill_diagonal(mask, True)

# Compute the correlation matrix
corr = df.corr()

plt.figure(figsize=(8, 8))
corrplot(corr[mask], size_scale=300)

img

Sadly corr[mask] doesn't mask the upper triangle in this package.

I also noticed that in R reaching this fancy plot is much easier, so I'm open if there is a more straightforward way to convert Python Pandas dataFrame to R dataframe since it seems there is a package so-called rpy2 that we could use Python & R together even in Google Colab notebook: Ref.1

from rpy2.robjects import pandas2ri
pandas2ri.activate() 

So if it is the case, I find this post1 & post2 using R for regarding Visualization of a correlation matrix. So in short, my 1st priority is using Python and its packages Matplotlib, seaborn, Plotly Express, and then R and its packages to reach the expected output.



from Plotting a fancy diagonal correlation matrix in python within dataframe

No comments:

Post a Comment