Explore Data

Explore data outside the context of a model

source

show_correlation

 show_correlation (df:pandas.core.frame.DataFrame, method='pearson',
                   cmap:str|Colormap='PuBu', low:float=0, high:float=0,
                   axis:Axis|None=0, subset:Subset|None=None,
                   text_color_threshold:float=0.408, vmin:float|None=None,
                   vmax:float|None=None, gmap:Sequence|None=None)

Show correlation heatmap

If output is not rendering properly when you reopen a notebook, make sure the notebook is trusted.

Parameters:

  • df: DataFrame

  • method: Method of correlation to pass to df.corr()

    Remaining parameters are passed to pandas.io.formats.style.background_gradient.

import sklearn.datasets
X_diabetes, y_diabetes = sklearn.datasets.load_diabetes(return_X_y=True, as_frame=True)
show_correlation(pd.concat((X_diabetes, y_diabetes), axis="columns"))
  age sex bmi bp s1 s2 s3 s4 s5 s6 target
age 1.00 0.17 0.19 0.34 0.26 0.22 -0.08 0.20 0.27 0.30 0.19
sex 0.17 1.00 0.09 0.24 0.04 0.14 -0.38 0.33 0.15 0.21 0.04
bmi 0.19 0.09 1.00 0.40 0.25 0.26 -0.37 0.41 0.45 0.39 0.59
bp 0.34 0.24 0.40 1.00 0.24 0.19 -0.18 0.26 0.39 0.39 0.44
s1 0.26 0.04 0.25 0.24 1.00 0.90 0.05 0.54 0.52 0.33 0.21
s2 0.22 0.14 0.26 0.19 0.90 1.00 -0.20 0.66 0.32 0.29 0.17
s3 -0.08 -0.38 -0.37 -0.18 0.05 -0.20 1.00 -0.74 -0.40 -0.27 -0.39
s4 0.20 0.33 0.41 0.26 0.54 0.66 -0.74 1.00 0.62 0.42 0.43
s5 0.27 0.15 0.45 0.39 0.52 0.32 -0.40 0.62 1.00 0.46 0.57
s6 0.30 0.21 0.39 0.39 0.33 0.29 -0.27 0.42 0.46 1.00 0.38
target 0.19 0.04 0.59 0.44 0.21 0.17 -0.39 0.43 0.57 0.38 1.00
iris = sklearn.datasets.load_iris()
X_iris, y_iris = iris["data"], iris["target"]
y_iris = pd.Series(y_iris, name="iris type").map(
    {num: name for num, name in zip([0, 1, 2], iris["target_names"])}
)
X_iris = pd.DataFrame(X_iris, columns=iris["feature_names"])
show_correlation(
    pd.concat((X_iris, pd.Series(y_iris == "setosa", name="setosa"))), method="spearman"
)
/var/folders/wv/pmfhhk1d4h1fkd_z5l83m0dw0000gq/T/ipykernel_96080/529445965.py:15: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  return df.corr(method=method).style.background_gradient(**kwargs).format("{0:,.2f}")
  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
sepal length (cm) 1.00 -0.17 0.88 0.83
sepal width (cm) -0.17 1.00 -0.31 -0.29
petal length (cm) 0.88 -0.31 1.00 0.94
petal width (cm) 0.83 -0.29 0.94 1.00

source

plot_column_clusters

 plot_column_clusters (df, corr_method:str='spearman',
                       ax:matplotlib.axes._axes.Axes=None, p=30,
                       truncate_mode=None, color_threshold=None,
                       get_leaves=True, orientation='top', labels=None,
                       count_sort=False, distance_sort=False,
                       show_leaf_counts=True, no_plot=False,
                       no_labels=False, leaf_font_size=None,
                       leaf_rotation=None, leaf_label_func=None,
                       show_contracted=False, link_color_func=None,
                       above_threshold_color='C0')

Plot a dendrogram based on column correlations

If output is not rendering properly when you reopen a notebook, make sure the notebook is trusted.

Adapted from https://github.com/fastai/book_nbs/blob/master/utils.py#L58-L64

Parameters:

  • df: DataFrame

  • corr_method: Method of correlation to pass to df.corr()

  • ax: Matplotlib Axes object. Plot will be added to this object if provided; otherwise a new Axes object will be generated.

    Remaining parameters are passed to scipy.cluster.hierarchy.dendrogram.

ax = plot_column_clusters(X_iris)