W4 Flashcards

Question

What is important on the widths of the histograms?

Answer 1

Bins do not necessarily need to have the same width. * Without normalisation, the height of the histogram represents the frequency for values fall into the bins - Can be misleading, better not to do it! * With normalisation, the area of the histogram represents the probability for values fall into the bins EXAMPLE: (age of the Titanic's passengers, with age [0, 1) as the first bin and [50, 100] the last bin _, ax = plt.subplots(1, 2, figsize=(15, 2)) bins = [0,1,10,20,30,40,50,100] ax[0].hist(titanic['Age'], bins=bins) ax[0].set_title('with original frequency') ax[1].hist(titanic['Age'], bins=bins, density=True) ax[1].set_title('with relative frequency');

Answer 2

Instead of visualising the probability density discretely by histogram, density plot can be used to get a smooth estimation. EXAMPLE: (age of the Titanic's passengers using kernel density estimation (KDE)): plt.subplots(figsize=(15, 3)) sns.kdeplot(titanic['Age']);

Answer 3

Kernel density estimation (KDE) is used to get a density plot. EXAMPLE: * Consider the data: kde_data = [0, 1, 3, 4, 5] *Place a density function like from the normal distribution (or the Gaussian kernel) on each data point, with mean the same as the data point, and a given standard deviation (here we use 1). from scipy import stats from scipy import stats x_values = np.linspace(-3, 8, 100) y_pdf = np.array([stats.norm.pdf(x_values, loc=mu, scale=1) for mu in kde_data]) y_pdf = pd.DataFrame(y_pdf.T, index=x_values) * Graphical illustration _, ax = plt.subplots(1, 2, figsize=(15, 3), sharex=True) sns.rugplot(kde_data, ax=ax[0], height=0.8) y_pdf.plot(ax=ax[1], legend=False) sns.rugplot(kde_data, ax=ax[1], height=0.8) ax[0].set(ylim=(0, 0.5), title='rug plot') ax[1].set(ylim=(0, 0.5), title='with kernels'); * Then normalise the density and sum them together normalised_pdf = y_pdf/5 kde = normalised_pdf.sum(axis=1) * Visualise the KDE: _, ax = plt.subplots(1, 2, figsize=(15, 3), sharex=True) normalised_pdf.plot(ax=ax[0], legend=False) kde.plot(ax=ax[1]) ax[0].set(ylim=(0, 0.5), title='normalised kernels') ax[1].set(ylim=(0, 0.5), title='KDE: summing all normalised kernels');

Answer 4

The standard deviation controls the smoothness of the KDE, similar to how the number of bins controls the amount of details we can see for a histogram.

Answer 5

The smoothing parameter (bw_adjustment ) affects the density plot using the age of the Titanic's passengers data EXAMPLE: _, ax = plt.subplots(1, 4, figsize=(20, 4)) for i, bw_adj in enumerate(np.array([0.1, 0.5, 1, 10])): g = sns.kdeplot(titanic['Age'], ax=ax[i], bw_adjust=bw_adj) g.set(title=f'bw adjustment = {bw_adj}', xlabel='');

Answer 6

From the distribution plots, we can see some characteristics of the distributions like: * Central tendency and spread * Modes * Skewness * Tail and outliers

Answer 7

A mode of distribution is a local or global maximum. * Unimodal: A distribution with a single clear maximum * Bimodal: A distribution with two modes * Multimodal: More than two modes _, ax = plt.subplots(1, 3, figsize=(15, 4)) sns.histplot(np.random.normal(size=1000), ax=ax[0], kde=True).set_title('unimodal') sns.histplot(np.hstack([np.random.normal(loc=-2, size=500), np.random.normal(loc=2, size=500)]), ax=ax[1], kde=True).set_title('bimodal') sns.histplot(np.hstack([np.random.normal(loc=-5, scale=1.5, size=500), np.random.normal(loc=0, size=500), np.random.normal(loc=5, size=500)]), ax=ax[2], kde=True).set_title('multimodal');

Answer 8

* Right skewed: a distribution has a long right tail - Mean is typically to the right of the median * Left skewed: a distribution has a long left tail - Mean is typically to the left of the median * Symmetric: both tails are of equal size EXAMPLE: skewness_data = {'right skewed': np.random.beta(1, 20, size=1000), 'left skewed': np.random.beta(20, 1, size=1000), 'symmetric': np.random.normal(size=2000)} _, ax = plt.subplots(1, 3, figsize=(18, 4)) for i, lab in enumerate(skewness_data): sns.histplot(skewness_data[lab], ax=ax[i], kde=True).set_title(lab); ax[i].axvline(x=skewness_data[lab].mean(), color='green') ax[i].axvline(x=np.median(skewness_data[lab]), color='red', linestyle='--') ax[i].legend(ax[i].get_lines()[-2:], ['mean', 'median'])

Answer 9

Box plot is another type of visualisation for distribution of numerical variables. * It summarises several characteristics of a distribution (e.g central tendency, symmetry, skewness, outliers) by visualising some descriptive statistics * Box plot summarises several characteristics of a distribution by some descriptive statistics, for which they are graphically represented by: - Box: graphically demonstrate locality, spread and skewness of numerical data through their quartiles: -> First quartile (Q1) -> Median -> Third quartile (Q3) - Whiskers: extend from the box indicating variability outside the Q1 and Q3 -> There are a few ways to specify how to calculate the whisker boundary. One commonly used: -> Lower boundary: Q1 - 1.5IQR -> Upper boundary: Q3 + 1.5IQR - Outliers: All other observations outside the boundary of the whiskers EXAMPLE: Age of Titanic passengers: plt.subplots(figsize=(2.5, 3)) sns.boxplot(y=titanic['Age']).set_xlim(-0.8,1) for q, lab in zip([0.25, 0.5, 0.75], ['Q1', 'median', 'Q3']): plt.annotate(lab, (0.5, titanic['Age'].quantile(q))) plt.annotate('whisker', (0.5, 50)); plt.annotate('whisker', (0.5, 10)) plt.annotate('outlier', (0.5, 72));

Answer 10

* Box plot provides a compact summary of a distribution, can easily observe: - Central tendency: via median line - Variability: via length of the box (IQR) and the whiskers - Skewness: via the relative location of median line in the box, and/or the relative length of the upper and lower whiskers - Amount of extreme values * Based on "robust" statistics like median and IQR * Good for comparing distributions of different variables and explore relations between categorical and quantitative variable through side-by-side box plots * Statistics used in a box plot could be easily provided in a non-graphical way, but box plot allows us to notice the skewness and extreme values: - EXAMPLE: from matplotlib.cbook import boxplot_stats pd.DataFrame(boxplot_stats(titanic['Age'].dropna())).drop(['mean', 'cilo', 'cihi'], axis=1) * Boxplot can be used alongside histograms

Answer 11

* With box plot, we roughly summarise the data with only 5 statistics - useful information can be lost * Can be misleading about aspects such as multimodality * Data provided may suggest similar boxplots, but they may be vastly different if we use KDE and histogram to visualise them

Answer 12

Violin plot is another type of visualisation for the distribution of numerical variables, and it can be considered as a combination of a box plot and a kernel density plot. * Like box plot, it can show the three quartiles and whiskers * Like kernel density plot, it shows the approximated distribution * Violin plots can show the difference in distribution that box plots fail to for the previous dataset EXAMPLE: _, ax = plt.subplots(1, 3, figsize=(15, 4), sharey=True) sns.violinplot(y=titanic['Age'], ax=ax[0]).set_xlim(-1,1); ax[0].set_title('violin plot') for q, lab in zip([0.25, 0.5, 0.75], ['Q1', 'median', 'Q3']): ax[0].annotate(lab, (0.6, titanic['Age'].quantile(q))) ax[0].annotate('whisker', (0.6, 50)); ax[0].annotate('outlier', (0.6, 72)) ax[1].violinplot(titanic['Age'].dropna(), showextrema=True, quantiles=[0.25, 0.5, 0.75]) ax[1].set_xlim(0.5,1.5); ax[1].set_xticks([], []), ax[1].set_title('violin plot') sns.boxplot(y=titanic['Age'], ax=ax[2]); ax[2].set_title('box plot');

Answer 13

QQ plot (quantile-quantile plot) is a visualisation method to see if a sample follows a particular distribution. * QQ plot compares two probability distributions by plotting their quantiles against each other * If two distributions being compared are similar, the points in the QQ plot will approximately lie on the identity line x=y * QQ plot provides a graphical view of comparing two distributions, to see how properties such as location, scale, and skewness are similar or different between two distributions * One common use of QQ plot is to compare data with the normal distribution - It lets us see how well the given data matches a normal distribution with the same mean and variance as the sample mean and variance from the data -> Many statistical tests are based on the assumption that the data is approximately normally distributed. If this assumption is not consistent with the data, the conclusion from those tests may not be trustworthy - It can also help us detect skewness, fat-tails, etc EXAMPLE: import statsmodels.api as sm _, ax = plt.subplots(figsize=(2.5, 2.5)) sm.qqplot(titanic['Age'][titanic['Age'].notnull()], line='45', loc=titanic['Age'].mean(), scale=titanic['Age'].std(), ax=ax);

Answer 14

import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib import seaborn as sns sns.set_style("darkgrid") matplotlib.rcParams['figure.figsize'] = (4, 2.5)

Answer 15

# update the type of data titanic = pd.read_csv('data/titanic/train.csv') titanic['Survived'] = pd.Categorical.from_codes(titanic.Survived, ['not survived', 'survived']) titanic['Sex'] = titanic['Sex'].astype('category') titanic['Pclass'] = pd.Categorical(titanic['Pclass'], ordered=True) # ordinal # select variables we are interested in titanic = titanic[['Survived', 'Pclass', 'Sex', 'Age', 'Fare']]

Answer 16

Contingency table displays the multivariate frequency distribution of the variables. * For two categorical variables, the column headers match the levels of one variable, and the row headers match the levels of another variable * Contingency table provides a basic picture of the interrelation between two variables and can help find interactions between them * EXAMPLE (compare the number of survived/not survived passengers of the Titanic with different ticket classes) survived = pd.crosstab(titanic['Pclass'], titanic['Survived']) survived

Answer 17

Side-by-side bar chart provides simultaneous comparison of distributions of a categorical variable "conditioning" on another categorical variable. * It allows you to compare each subgroup directly * EXAMPLE: compare the number of survived/not survived passengers of the Titanic with different ticket classes: survived.plot.bar(figsize=(6, 2), rot=0);

Answer 18

Stacked bar chart is another way to simultaneously compare the distribution of a categorical variable conditioning on another categorical variable. * Comparing with side-by-side bar chart, it focuses more on part-to-whole relation * EXAMPLE: compare the number of survived/not survived passengers of the Titanic with different ticket classes: survived.plot.bar(stacked=True, figsize=(6, 2), rot=0); NOTE: Here, we can compare the distribution of ticket class like the normal bar chart, but we can also see how the distribution of survival differs conditioning on the ticket class

Answer 19

By comparing the descriptive statistics on the quantitative variable across different groups of the categorical variable * EXAMPLE (age (quantitative variable) vs ticket class (categorical variable)): titanic[['Age', 'Pclass']].groupby('Pclass', observed=True). \ agg(['min', 'mean', 'median', 'max']).T.round(1)

Answer 20

Overlaid histograms and density curves are possible ways to compare the distribution of different quantitative variables (or how a variable differs over specific groups). * Superposition: multiple lines plots on top of each others * EXAMPLE: compare the age of passengers in different classes: fig, ax = plt.subplots(1, 2, figsize=(15, 3)) sns.histplot(data=titanic, x='Age', hue='Pclass', ax=ax[0], stat='density') sns.kdeplot(data=titanic, x='Age', hue='Pclass', ax=ax[1]);

Answer 21

* Overlapping histograms can be difficult to read * Overlapping density plots is not bad, but can be difficult to read when we have more categories

Answer 22

ALTERNATIVE 1: Alternatively, we can plot multiple histograms and/or distribution curves sharing the same axis. * Juxtaposition: multiple plots with the same scale, displaying side-by-side * EXAMPLE: Compare the age of passengers from different classes: _, ax=plt.subplots(3, 2, figsize=(8, 2.5), sharex=True, sharey=True); plt.tight_layout() for i in range(3): sns.histplot(data=titanic[titanic.Pclass==i+1], x='Age', ax=ax[i,0], stat='density') sns.kdeplot(data=titanic[titanic.Pclass==i+1], x='Age', ax=ax[i,1]) ax[i,0].set(title=f'class = {i+1}');ax[i,1].set(title=f'class = {i+1}'); ALTERNATIVE 2: * It may be better to compare distributions using side-by-side box plots or violin plots. * EXAMPLE: age vs ticket class: - 1 quantitative variable (age on the y-axis) - 1 categorical variable (ticket class on the x-axis) _, ax=plt.subplots(1, 2, figsize=(10, 1.5), sharey=True) sns.boxplot(data=titanic, y='Age', x='Pclass', ax=ax[0]) sns.violinplot(data=titanic, y='Age', x='Pclass', ax=ax[1]);

Answer 23

* The (over)simplified visualisation provided box plot makes it useful for comparing a quantitative variable across groups and see the relationship between a quantitative variable with a categorical variable * It highlights the range, quartiles, median and any outliers present in a data set for each group

Answer 24

Split violin plot allows you to display the distributions from 2 groups on different sides of the density plot. * This allows us to explore the relations between 3 variables - 1 quantitative variable (y-axis) - 2 categorical variables (x-axis and both sides of the violin plot) * EXAMPLE: age vs ticket class and gender of passengers of Titanic: sns.catplot(data=titanic, x='Pclass', y='Age', kind='violin', hue='Sex', split=True, height=3, aspect=1.5);

Answer 25

1. Scatter plot 2. Hex plot 3. Contour plot 4. Scatter matrix

Answer 26

Scatter plots are used to reveal relationships between pairs of quantitative variables. * Use Cartesian coordinates to display values for typically two variables for a set of data * EXAMPLE: x = [1, 2, 4, 4, 3, 2, 5]; y = [0, 2, 5, 3, 2, 1, 4] _, ax = plt.subplots(figsize=(2.5,2.5)) g = sns.scatterplot(x=x, y=y, ax=ax) g.set(xlabel='x', ylabel='y');

Answer 27

Scatter plot helps us to find out if there is a relationship, and the type of relationships (linear, non-linear, unequal spread) * See if there is any relation between mpg and other variables like acceleration, displacement and weight in the auto dataset EXAMPLE1: import numpy.random as rn _, ax = plt.subplots(1, 4, figsize=(15, 4)) n = 300; x = rn.randn(n) sns.scatterplot(x=x, y=rn.normal(scale=0.5, size=n), ax=ax[0]).set_title('no relation') sns.scatterplot(x=x, y=x+rn.normal(scale=0.5, size=n), ax=ax[1]).set_title('linear') sns.scatterplot(x=x, y=x**2+rn.normal(scale=0.5, size=n), ax=ax[2]).set_title('non-linear') sns.scatterplot(x=x, y=x+x*rn.normal(scale=0.3, size=n), ax=ax[3]).set_title('unequal spread'); EXAMPLE2: auto = pd.read_csv('data/auto-mpg.csv') auto['origin'] = auto['origin'].astype('category') _, ax=plt.subplots(1, 3, figsize=(10, 3)) sns.scatterplot(data=auto, x='displacement', y='mpg', ax=ax[0]) sns.scatterplot(data=auto, x='acceleration', y='mpg', ax=ax[1]) sns.scatterplot(data=auto, x='weight', y='displacement', ax=ax[2]);

Answer 28

We can have a scatter plot with marginal histograms of the two quantitative variables: EXAMPLE: g = sns.jointplot(data=auto, x='displacement', y='mpg') g.fig.set_size_inches(3,3);

Answer 29

* Like rug plot, scatter plots can be subjected to overplotting. * One possible solution: smaller markers * Another solution: 2D "histogram", density plot * NOTE: if we only we want to see the relations between variables, overplotting is not necessarily an issue

Answer 30

Hex plot is a tool to visualise the joint distribution. It divides the plane into regular hexagons, counts the number of observations that fall into each hexagon, and then maps the count to the hexagon fill. * Can be thought of as a two dimensional histogram * More shaded hexagons typically indicate a greater density/frequency EXAMPLE: _, ax=plt.subplots(figsize=(3, 2.5)) auto.plot.hexbin(x="displacement", y="mpg", gridsize=10, ax=ax);

Answer 31

The use of hexagon bins can: * Avoid the visual artefacts sometimes generated by the very regular alignment of grides * Visualise relations better EXAMPLE: g = sns.displot(auto, x="displacement", y="mpg", cbar=True) g.fig.set_size_inches(3,2.5);

Answer 32

Contour plots are two-dimensional density plots. EXAMPLE: sns.jointplot(data=auto, x='displacement', y='mpg', kind='kde', fill=True, height=3);

Answer 33

* You may want to use a scatter matrix to visualise the relations between all pairs of quantitative variables for multivariate data * It consists of: - Histograms / KDE plots to visualise the marginal distribution of each variable on diagonal - Scatter plots for all possible pairs * It is a convenient visualisation for quick exploratory data analysis EXAMPLE1 With marginal density plots: g = sns.pairplot(auto[['mpg', 'displacement', 'weight']], plot_kws={"marker": '+'}) g.fig.set_size_inches(3,3); EXAMPLE2 Lower triangle only: g = sns.pairplot(auto[['mpg', 'displacement', 'weight']], plot_kws={"marker": '+'}) g.fig.set_size_inches(3,3);

Answer 34

It is possible to show more than 2 quantitative variables on a scatter plot. We can do so by encoding the other variables by different colours, styles and shapes of the markers: * Quantitative: marker colour, marker size * Categorical: marker colour, marker style * NOTE: While in principle you can show 5 variables in one scatter plot, use it with care as the plot can be difficult to read. * EXAMPLE1: acceleration vs mpg, with marker colour representing the number of cylinders: plt.subplots(figsize=(10, 3)) sns.scatterplot(data=auto, x='acceleration', y='mpg', hue='cylinders', palette='flare'); * EXAMPLE2: with more than five variables to a scatter plot: plt.subplots(figsize=(10, 3)) sns.scatterplot(data=auto, x='acceleration', y='mpg', hue='cylinders', style='origin', size='weight', palette='flare');

Answer 35

1. Line plot 2. Moving average plot 3. Stacked area graph 4. Multiple lines in one graph 5. Different y-axes 6. Line plot (juxtaposition)

Answer 36

A line plot is a type of chart which displays ordered data connected by straight line segments. * Useful for sequential data to find out patterns, trends, seasonality, changes and anomalies * For time series data, often we plot data against time to see if there is any trend in the data * EXAMPLE: coronavirus cases in the UK: cases = pd.read_csv('data/cases.csv', usecols=[3,4], parse_dates=[0], index_col=0) cases.columns = ['cases']; cases.sort_index(ascending=False) cases.plot(figsize=(8, 1.5), legend=False, ylabel='cases');

Answer 37

To visualise the overall trend better, you may want to remove the periodic fluctuations around the trend. * By using a 7-day rolling mean, we can remove the periodic fluctuations and visualise the overall trend better: * EXAMPLE: Covid number of cases reported is higher during the weekdays than the weekends: cases['weekday'] = cases.index.weekday cases.groupby('weekday').mean().plot.bar(ylabel='average number of cases', legend=False, figsize=(5,2.5)) plt.xticks(range(7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']) plt.tick_params(axis='x', rotation=0); plt.subplots(figsize=(8, 3)) cases['cases'].rolling(7, center=True).mean().plot(ylabel='cases');

Answer 38

A stacked area graph visualises how a quantitative variable of each group changes over time. * Value of each group at each time point is represented by the height * EXAMPLE: COVID number of cases in France, Spain and Italy: import seaborn.objects as so eu_cases = pd.read_csv('data/cases_eu.csv', usecols=[0,4,5,6], parse_dates=[0], dayfirst=True, index_col=0) eu_cases.columns = ['cases', 'deaths', 'country']; eu_cases.index.name = 'date' eu_cases = eu_cases[(eu_cases['country'].isin(['France', 'Spain', 'Italy'])) & (eu_cases.index >= pd.to_datetime('2021-09-01'))] p = so.Plot(eu_cases, "date", "cases", color='country') p.add(so.Area(alpha=1), so.Stack()).layout(size=(15, 3))

Answer 39

Multiple lines in one graph can be used to visualise how a quantitative variable of each group changes over time. * EXAMPLE: p = so.Plot(eu_cases, "date", "cases", color='country') p.add(so.Line(), so.Agg()).layout(size=(15, 4))

Answer 40

1. When the scale of the variables are different, it is difficult to compare the lines when they are plotted on the same graph. * EXAMPLE: Number of COVID cases vs deaths in France SOLUTION1: Plot with different y-axes *One possible solution is we use different y axes for the two variables * But use with care - can be misleading * EXAMPLE: ax_1 = eu_cases.loc[eu_cases.country == 'France', 'cases'].plot(figsize=(8, 4)) eu_cases.loc[eu_cases.country == 'France', 'deaths'].plot(secondary_y=True) plt.legend(ax_1.get_lines() + ax_1.right_ax.get_lines(), ['cases', 'deaths']) ax_1.set_ylabel('cases'); ax_1.right_ax.set_ylabel('deaths'); SOLUTION2: Line plot juxtaposition * Possible solution: side-by-side plot with the same x-axis * EXAMPLE: ax = eu_cases[eu_cases.country == 'France'].plot(figsize=(8, 3), subplots=True, legend=False) ax[0].set_ylabel('cases'); ax[1].set_ylabel('deaths');

W4 Flashcards

(65 cards)