W4 Flashcards

1
Q

What are the relevant visualisation libraries used and how do you import them?

A

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 potential types of visualisations to consider?

A
  1. Distribution: how a variable in the dataset distributes over a range of possible values (histograms)
  2. Comparison: how multiple variables compare (boxplots)
  3. Relationship: how the values of variables in the dataset relate (scatterplots)
  4. Trend: how
    values evolve over time (time-series related)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Titanic EXAMPLE: What is some preprocessing of the data to 1. Change the type of the data, 2. Select the variables that we are interested in?

A

Open the data

titanic = pd.read_csv(‘data/titanic/train.csv’)

titanic[‘survived’] = pd.Categorical.from_codes(titanic.survived, [‘not survived’, ‘survived’])

titanic[‘Sex’] = titanic[‘Sex’].astype(‘category’)

titanic[‘Pclass’] = pd.Categorical(titanic[‘Pclass’], ordered=True)
#make the data ordinal

titanic = titanic[[‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘Fare’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Titanic EXAMPLE: How do you attain the descriptive data from the data set?

A

titanic.describe(include=’all’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Titanic EXAMPLE: How can you learn about the distribution of age of the passengers by looking at the raw data and by visualising the same data using a histogram?

A

print(titanic[‘Age’].to_list()[:300])

plt.subplots(figsize=(15, 3))
sns.histplot(titanic[‘Age’]);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a motivation for reviewing data sets before analysing them?

A
  • Importance of looking at data graphically before analysing them to discover an unusual pattern that we never expected to see from the descriptive statistics
  • Inadequacy of basic statistics for describing datasets

EXAMPLE: In Anscombe’s quartet: all 4 data sets give you nearly the same mean, standard deviation and correlation. Yet if we fit linear regression, they have very different distributions and appear very differently when graphed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do we use visualisation?

A
  • We tend to see patterns/structure of data much more easily by visual means than looking at raw numbers
  • Descriptive statistics may not be adequate for us to understand the data
  • Identify hidden, unexpected patterns and trends
  • Visualisation complements statistics. Both descriptive statistics and visualisation should be used to help us to understand the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 2 main goals of visualisation?

A
  1. Exploratory: Understand your data
    * Key part of exploratory data analysis (EDA)
    * Evaluate model performance
    * Audience: yourself
    * Tool you use while thinking - not worry too much about the formatting, etc.
  2. Explanatory: Communicate results to others
    * Explain and inform
    * Provide evidence and support
    * Audience: others
    * Tool you use to influence and persuade - highly editorial and selective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is EDA and how does it differ from IDA?

A

Exploratory data analysis (EDA) is an approach of analysing datasets to:
* Summarise their main characteristics, often by visualising the data or some summary statistics
* Understand the data beyond the formal modelling or hypothesis testing

EDA is different from initial da
ta analysis (IDA)
* IDA: Process of data inspection - check the quality of data, handle issues with the data, etc.

EDA is a critical first step for data analysis, followed by formal (confirmatory) data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the objectives of EDA?

A
  1. Enable unexpected discoveries in the data
  2. Discover relationships among variables
  3. Suggest hypotheses about the causes of observed phenomena
  4. Preliminary selection of appropriate statistical tools, techniques and models
  5. Assess assumptions on which statistical inference will be based
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the types of EDA?

A
  1. Graphical vs non-graphical
    * Last week: non-graphical, descriptive statistics
  2. Univariate vs multivariate
    * Univariate: look at only 1 variable at a time
    - For tabular data, it can be only looking at one column
    * Multivariate: look at two or more variables at a time to explore relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

RECAP: What are the different descriptive statistics?

A

Central tendency
* Mean
* Median
* Mode

Spread/Dispersion
* Range
* IQR
* Standard deviation

Relations
* Correlation
* Cross tabulation/contingency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some plots/distributions for univariate variables?

A

Categorical data:
* Bar chart
* Pie chart

Quantitative data
* Rug plot
* Histogram and density plot
* Box plot and violin plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

For univariate categorical variables, how can we display the distribution by tabulation?

A

EXAMPLE: The gender of the Titanic’s passengers

titanic_gender_data = titanic.value_counts(‘Sex’)

titanic_gender_data.to_frame()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a bar chart and how do you attain it?

A

A bar chart is commonly used to display the distribution of a univariate categorical variable (and quantitative data with only very few distinct values).

  • The rectangular bars with heights or lengths proportional to the number of observations for the corresponding category
  • Width does not represent the property of the data

EXAMPLE: Gender of the titanic passengers:

titanic_gender_data.plot.bar(ylabel=’number of passengers’, rot=0, figsize=(3, 1.5)));

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Pie Chart and how do you attain it?

A

A pie chart is another commonly used chart to show the distribution of a univariate categorical variable.

  • The arc length of each slice (or its angle and area) is proportional to the number of observations for the corresponding category

EXAMPLE: Gender of the Titanic’s passengers:

titanic_gender_data.plot(kind=’pie’, legend=False, autopct=lambda p:f’ {p: . 2f}%’, textprops={‘fontsize’: 11}, ylabel=’ ‘, figsize=(4, 2.5)));

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a Rug Plot and how do you attain it?

A

A Visualised distribution of a univariate quantitative variable, a rug plot simply maps the data to locations on an axis

  • It helps to show the distribution of a single quantitative variable
  • It shows every value
  • Note the y-axis does not represent the property of the data

EXAMPLE:
rug_data = [0, 2, 0, 0, 3, 0, 1, 0, 5, 10]

plt.subplots(figsize=(15, 0.5))

g = sns.rugplot(rug_
data, height=1)

g.set(ylim=(0, 1), yticks=[]);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are possible issues with the Rug Plot?

A
  • Too much details: No need to know each value
  • Overplotting: Cannot tell how many observations for each mark are representing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an alternative to a rug plot?

A

We could use spike plot, which maps the data to positions on the x-axis, with the height to represent the frequency of the occurrences of the corresponding value.
- But like rug plot, spike plot can still have too many details which makes it difficult to
generalise and interpret the graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a histogram and how do you attain it?

A

A Histogram is another type of plot which is used to show the distribution of a univariate (i.e. single) quantitative variable.

  • Like spike plot, it can use height to represent frequency
  • Unlike spike plot, instead of counting (and plotting) the occurrences for each value, it groups values to some intervals (“bins”) and counts how many observations fall into each bin
  • Lose details but see the big picture

EXAMPLE:
plt.subplots(figsize=(15, 2.5))
heights, bins, _ = plt.hist(titanic[‘Age’], bins=list(range(0, 100, 10)))
plt.ylabel(‘count’); plt.xlabel(‘age’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does the height of the histogram rectangles represent when the bins are of equal size?

A

When the bins are of equal size, the height of the rectangles represents the frequency (absolute/relative) of the values in the corresponding bin.

  • Verify the height is equal to the number of values in that bin for our example above:
  • Height of the rectangles:
    -> EXAMPLE: heights
  • Number of observations inside each interval to be [left, right) (except the last one): [left, right]
    -> EXAMPLE: [sum(titanic[‘Age’].between(bins[i], bins[i+1], inclusive=’left’))
    for i in range(len(bins)-1)]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why may you want to normalise a histogram and how do you normalise it?

A

Histogram can be “normalised” to display the relative frequency, with the sum of the area of all rectangles is 1.

  • Can be considered as an approximate representation of the probability density of the data
  • Area of each rectangle represents the empirical probability of observations in the corresponding interval (bin) indicated by the x axis
  • EXAMPLE: (age of the Titanic’s passengers):

plt.subplots(figsize=(15, 2.5))

heights, bins, _ = plt.hist(titanic[‘Age’], bins=list(range(0, 100, 10)), density=True)

plt.ylabel(‘density’); plt.xlabel(‘age’);

NOTE1: You can verify the sum of the area of all rectangles is 1 via: sum(heights*10)

NOTE2: To check the proportion of distribution (i.e. the proportion that passengers with 20 <= age <= 40): heights[2]10 + heights[3]10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can you select an appropriate number of bins?

A
  • You may need to see the shape of the distribution (e.g.
    via histogram) to decide an appropriate number of
    bins. So it can be an iterative process
  • Often the default choice is quite good, but there is no guarantee
  • EXAMPLE:
    titanic[‘Age’].hist(figsize=(5,2));
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What affects how the estimated probability density of a histogram looks?

A
  • Different number of bins affects how the estimated probability density looks.
  • The higher the number of bins, the more detailed the histogram is

*Beware of drawing strong conclusions from the looks of a histogram

EXAMPLE:
_, ax = plt.subplots(1, 4, figsize=(20, 5))

for i, bins in enumerate(np.array([1, 4, 12, 100])):

ax[i].hist(titanic[‘Age’], bins=bins, density=True)

ax[i].set_title(f’number of bins = {bins}’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is important on the widths of the histograms?
Bins do not necessarily need to have the same width. * Without normalisation, the height of the histogram represents the frequency for values fall into the bins - Can be misleading, better not to do it! * With normalisation, the area of the histogram represents the probability for values fall into the bins EXAMPLE: (age of the Titanic's passengers, with age [0, 1) as the first bin and [50, 100] the last bin _, ax = plt.subplots(1, 2, figsize=(15, 2)) bins = [0,1,10,20,30,40,50,100] ax[0].hist(titanic['Age'], bins=bins) ax[0].set_title('with original frequency') ax[1].hist(titanic['Age'], bins=bins, density=True) ax[1].set_title('with relative frequency');
26
What is a density plot?
Instead of visualising the probability density discretely by histogram, density plot can be used to get a smooth estimation. EXAMPLE: (age of the Titanic's passengers using kernel density estimation (KDE)): plt.subplots(figsize=(15, 3)) sns.kdeplot(titanic['Age']);
27
What is a KDE and how does it work?
Kernel density estimation (KDE) is used to get a density plot. EXAMPLE: * Consider the data: kde_data = [0, 1, 3, 4, 5] *Place a density function like from the normal distribution (or the Gaussian kernel) on each data point, with mean the same as the data point, and a given standard deviation (here we use 1). from scipy import stats from scipy import stats x_values = np.linspace(-3, 8, 100) y_pdf = np.array([stats.norm.pdf(x_values, loc=mu, scale=1) for mu in kde_data]) y_pdf = pd.DataFrame(y_pdf.T, index=x_values) * Graphical illustration _, ax = plt.subplots(1, 2, figsize=(15, 3), sharex=True) sns.rugplot(kde_data, ax=ax[0], height=0.8) y_pdf.plot(ax=ax[1], legend=False) sns.rugplot(kde_data, ax=ax[1], height=0.8) ax[0].set(ylim=(0, 0.5), title='rug plot') ax[1].set(ylim=(0, 0.5), title='with kernels'); * Then normalise the density and sum them together normalised_pdf = y_pdf/5 kde = normalised_pdf.sum(axis=1) * Visualise the KDE: _, ax = plt.subplots(1, 2, figsize=(15, 3), sharex=True) normalised_pdf.plot(ax=ax[0], legend=False) kde.plot(ax=ax[1]) ax[0].set(ylim=(0, 0.5), title='normalised kernels') ax[1].set(ylim=(0, 0.5), title='KDE: summing all normalised kernels');
28
What is the effect of setting the standard deviation in the KDE?
The standard deviation controls the smoothness of the KDE, similar to how the number of bins controls the amount of details we can see for a histogram.
29
What is the effect of setting the smoothing parameter?
The smoothing parameter (bw_adjustment ) affects the density plot using the age of the Titanic's passengers data EXAMPLE: _, ax = plt.subplots(1, 4, figsize=(20, 4)) for i, bw_adj in enumerate(np.array([0.1, 0.5, 1, 10])): g = sns.kdeplot(titanic['Age'], ax=ax[i], bw_adjust=bw_adj) g.set(title=f'bw adjustment = {bw_adj}', xlabel='');
30
What can we see from distribution plots?
From the distribution plots, we can see some characteristics of the distributions like: * Central tendency and spread * Modes * Skewness * Tail and outliers
31
What are the different types of mode?
A mode of distribution is a local or global maximum. * Unimodal: A distribution with a single clear maximum * Bimodal: A distribution with two modes * Multimodal: More than two modes _, ax = plt.subplots(1, 3, figsize=(15, 4)) sns.histplot(np.random.normal(size=1000), ax=ax[0], kde=True).set_title('unimodal') sns.histplot(np.hstack([np.random.normal(loc=-2, size=500), np.random.normal(loc=2, size=500)]), ax=ax[1], kde=True).set_title('bimodal') sns.histplot(np.hstack([np.random.normal(loc=-5, scale=1.5, size=500), np.random.normal(loc=0, size=500), np.random.normal(loc=5, size=500)]), ax=ax[2], kde=True).set_title('multimodal');
32
What is skewness?
* Right skewed: a distribution has a long right tail - Mean is typically to the right of the median * Left skewed: a distribution has a long left tail - Mean is typically to the left of the median * Symmetric: both tails are of equal size EXAMPLE: skewness_data = {'right skewed': np.random.beta(1, 20, size=1000), 'left skewed': np.random.beta(20, 1, size=1000), 'symmetric': np.random.normal(size=2000)} _, ax = plt.subplots(1, 3, figsize=(18, 4)) for i, lab in enumerate(skewness_data): sns.histplot(skewness_data[lab], ax=ax[i], kde=True).set_title(lab); ax[i].axvline(x=skewness_data[lab].mean(), color='green') ax[i].axvline(x=np.median(skewness_data[lab]), color='red', linestyle='--') ax[i].legend(ax[i].get_lines()[-2:], ['mean', 'median'])
33
What is a box plot?
Box plot is another type of visualisation for distribution of numerical variables. * It summarises several characteristics of a distribution (e.g central tendency, symmetry, skewness, outliers) by visualising some descriptive statistics * Box plot summarises several characteristics of a distribution by some descriptive statistics, for which they are graphically represented by: - Box: graphically demonstrate locality, spread and skewness of numerical data through their quartiles: -> First quartile (Q1) -> Median -> Third quartile (Q3) - Whiskers: extend from the box indicating variability outside the Q1 and Q3 -> There are a few ways to specify how to calculate the whisker boundary. One commonly used: -> Lower boundary: Q1 - 1.5IQR -> Upper boundary: Q3 + 1.5IQR - Outliers: All other observations outside the boundary of the whiskers EXAMPLE: Age of Titanic passengers: plt.subplots(figsize=(2.5, 3)) sns.boxplot(y=titanic['Age']).set_xlim(-0.8,1) for q, lab in zip([0.25, 0.5, 0.75], ['Q1', 'median', 'Q3']): plt.annotate(lab, (0.5, titanic['Age'].quantile(q))) plt.annotate('whisker', (0.5, 50)); plt.annotate('whisker', (0.5, 10)) plt.annotate('outlier', (0.5, 72));
34
What are the advantages of a Box Plot?
* Box plot provides a compact summary of a distribution, can easily observe: - Central tendency: via median line - Variability: via length of the box (IQR) and the whiskers - Skewness: via the relative location of median line in the box, and/or the relative length of the upper and lower whiskers - Amount of extreme values * Based on "robust" statistics like median and IQR * Good for comparing distributions of different variables and explore relations between categorical and quantitative variable through side-by-side box plots * Statistics used in a box plot could be easily provided in a non-graphical way, but box plot allows us to notice the skewness and extreme values: - EXAMPLE: from matplotlib.cbook import boxplot_stats pd.DataFrame(boxplot_stats(titanic['Age'].dropna())).drop(['mean', 'cilo', 'cihi'], axis=1) * Boxplot can be used alongside histograms
35
What are possible problems with the Box Plot?
* With box plot, we roughly summarise the data with only 5 statistics - useful information can be lost * Can be misleading about aspects such as multimodality * Data provided may suggest similar boxplots, but they may be vastly different if we use KDE and histogram to visualise them
36
What is a violin plot?
Violin plot is another type of visualisation for the distribution of numerical variables, and it can be considered as a combination of a box plot and a kernel density plot. * Like box plot, it can show the three quartiles and whiskers * Like kernel density plot, it shows the approximated distribution * Violin plots can show the difference in distribution that box plots fail to for the previous dataset EXAMPLE: _, ax = plt.subplots(1, 3, figsize=(15, 4), sharey=True) sns.violinplot(y=titanic['Age'], ax=ax[0]).set_xlim(-1,1); ax[0].set_title('violin plot') for q, lab in zip([0.25, 0.5, 0.75], ['Q1', 'median', 'Q3']): ax[0].annotate(lab, (0.6, titanic['Age'].quantile(q))) ax[0].annotate('whisker', (0.6, 50)); ax[0].annotate('outlier', (0.6, 72)) ax[1].violinplot(titanic['Age'].dropna(), showextrema=True, quantiles=[0.25, 0.5, 0.75]) ax[1].set_xlim(0.5,1.5); ax[1].set_xticks([], []), ax[1].set_title('violin plot') sns.boxplot(y=titanic['Age'], ax=ax[2]); ax[2].set_title('box plot');
37
What is a QQ plot?
QQ plot (quantile-quantile plot) is a visualisation method to see if a sample follows a particular distribution. * QQ plot compares two probability distributions by plotting their quantiles against each other * If two distributions being compared are similar, the points in the QQ plot will approximately lie on the identity line x=y * QQ plot provides a graphical view of comparing two distributions, to see how properties such as location, scale, and skewness are similar or different between two distributions * One common use of QQ plot is to compare data with the normal distribution - It lets us see how well the given data matches a normal distribution with the same mean and variance as the sample mean and variance from the data -> Many statistical tests are based on the assumption that the data is approximately normally distributed. If this assumption is not consistent with the data, the conclusion from those tests may not be trustworthy - It can also help us detect skewness, fat-tails, etc EXAMPLE: import statsmodels.api as sm _, ax = plt.subplots(figsize=(2.5, 2.5)) sm.qqplot(titanic['Age'][titanic['Age'].notnull()], line='45', loc=titanic['Age'].mean(), scale=titanic['Age'].std(), ax=ax);
38
What modules should you import for preparation of data visualisation?
import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib import seaborn as sns sns.set_style("darkgrid") matplotlib.rcParams['figure.figsize'] = (4, 2.5)
39
How can you load and prepare the Titanic EXAMPLE?
# update the type of data titanic = pd.read_csv('data/titanic/train.csv') titanic['Survived'] = pd.Categorical.from_codes(titanic.Survived, ['not survived', 'survived']) titanic['Sex'] = titanic['Sex'].astype('category') titanic['Pclass'] = pd.Categorical(titanic['Pclass'], ordered=True) # ordinal # select variables we are interested in titanic = titanic[['Survived', 'Pclass', 'Sex', 'Age', 'Fare']]
40
What is a contingency table and how do you attain it for 'Pclass' and 'Survived' in the Titanic EXAMPLE?
Contingency table displays the multivariate frequency distribution of the variables. * For two categorical variables, the column headers match the levels of one variable, and the row headers match the levels of another variable * Contingency table provides a basic picture of the interrelation between two variables and can help find interactions between them * EXAMPLE (compare the number of survived/not survived passengers of the Titanic with different ticket classes) survived = pd.crosstab(titanic['Pclass'], titanic['Survived']) survived
41
How do you attain bullet points in Markdown
Use *
42
What is a side-by-side bar chart and how do you attain it?
Side-by-side bar chart provides simultaneous comparison of distributions of a categorical variable "conditioning" on another categorical variable. * It allows you to compare each subgroup directly * EXAMPLE: compare the number of survived/not survived passengers of the Titanic with different ticket classes: survived.plot.bar(figsize=(6, 2), rot=0);
43
What is a stacked bar chart, and how do you attain it?
Stacked bar chart is another way to simultaneously compare the distribution of a categorical variable conditioning on another categorical variable. * Comparing with side-by-side bar chart, it focuses more on part-to-whole relation * EXAMPLE: compare the number of survived/not survived passengers of the Titanic with different ticket classes: survived.plot.bar(stacked=True, figsize=(6, 2), rot=0); NOTE: Here, we can compare the distribution of ticket class like the normal bar chart, but we can also see how the distribution of survival differs conditioning on the ticket class
44
How do we explore relationships between quantitative and categorical variables?
By comparing the descriptive statistics on the quantitative variable across different groups of the categorical variable * EXAMPLE (age (quantitative variable) vs ticket class (categorical variable)): titanic[['Age', 'Pclass']].groupby('Pclass', observed=True). \ agg(['min', 'mean', 'median', 'max']).T.round(1)
45
What are overlaid histograms and density curves?
Overlaid histograms and density curves are possible ways to compare the distribution of different quantitative variables (or how a variable differs over specific groups). * Superposition: multiple lines plots on top of each others * EXAMPLE: compare the age of passengers in different classes: fig, ax = plt.subplots(1, 2, figsize=(15, 3)) sns.histplot(data=titanic, x='Age', hue='Pclass', ax=ax[0], stat='density') sns.kdeplot(data=titanic, x='Age', hue='Pclass', ax=ax[1]);
46
What are some potential issues with overlapping histograms and density plots?
* Overlapping histograms can be difficult to read * Overlapping density plots is not bad, but can be difficult to read when we have more categories
47
Whats an alternative to overlapping histograms and density plots explore relationships between quantitative and categorical variables?
ALTERNATIVE 1: Alternatively, we can plot multiple histograms and/or distribution curves sharing the same axis. * Juxtaposition: multiple plots with the same scale, displaying side-by-side * EXAMPLE: Compare the age of passengers from different classes: _, ax=plt.subplots(3, 2, figsize=(8, 2.5), sharex=True, sharey=True); plt.tight_layout() for i in range(3): sns.histplot(data=titanic[titanic.Pclass==i+1], x='Age', ax=ax[i,0], stat='density') sns.kdeplot(data=titanic[titanic.Pclass==i+1], x='Age', ax=ax[i,1]) ax[i,0].set(title=f'class = {i+1}');ax[i,1].set(title=f'class = {i+1}'); ALTERNATIVE 2: * It may be better to compare distributions using side-by-side box plots or violin plots. * EXAMPLE: age vs ticket class: - 1 quantitative variable (age on the y-axis) - 1 categorical variable (ticket class on the x-axis) _, ax=plt.subplots(1, 2, figsize=(10, 1.5), sharey=True) sns.boxplot(data=titanic, y='Age', x='Pclass', ax=ax[0]) sns.violinplot(data=titanic, y='Age', x='Pclass', ax=ax[1]);
48
What is the advantage of side-by-side box plots?
* The (over)simplified visualisation provided box plot makes it useful for comparing a quantitative variable across groups and see the relationship between a quantitative variable with a categorical variable * It highlights the range, quartiles, median and any outliers present in a data set for each group
49
What is a split violin plot?
Split violin plot allows you to display the distributions from 2 groups on different sides of the density plot. * This allows us to explore the relations between 3 variables - 1 quantitative variable (y-axis) - 2 categorical variables (x-axis and both sides of the violin plot) * EXAMPLE: age vs ticket class and gender of passengers of Titanic: sns.catplot(data=titanic, x='Pclass', y='Age', kind='violin', hue='Sex', split=True, height=3, aspect=1.5);
50
What are ways to explore relationships between 2 quantitative variables?
1. Scatter plot 2. Hex plot 3. Contour plot 4. Scatter matrix
51
What is a scatter plot and how do you attain it?
Scatter plots are used to reveal relationships between pairs of quantitative variables. * Use Cartesian coordinates to display values for typically two variables for a set of data * EXAMPLE: x = [1, 2, 4, 4, 3, 2, 5]; y = [0, 2, 5, 3, 2, 1, 4] _, ax = plt.subplots(figsize=(2.5,2.5)) g = sns.scatterplot(x=x, y=y, ax=ax) g.set(xlabel='x', ylabel='y');
52
What relationships can scatter plots help to identify?
Scatter plot helps us to find out if there is a relationship, and the type of relationships (linear, non-linear, unequal spread) * See if there is any relation between mpg and other variables like acceleration, displacement and weight in the auto dataset EXAMPLE1: import numpy.random as rn _, ax = plt.subplots(1, 4, figsize=(15, 4)) n = 300; x = rn.randn(n) sns.scatterplot(x=x, y=rn.normal(scale=0.5, size=n), ax=ax[0]).set_title('no relation') sns.scatterplot(x=x, y=x+rn.normal(scale=0.5, size=n), ax=ax[1]).set_title('linear') sns.scatterplot(x=x, y=x**2+rn.normal(scale=0.5, size=n), ax=ax[2]).set_title('non-linear') sns.scatterplot(x=x, y=x+x*rn.normal(scale=0.3, size=n), ax=ax[3]).set_title('unequal spread'); EXAMPLE2: auto = pd.read_csv('data/auto-mpg.csv') auto['origin'] = auto['origin'].astype('category') _, ax=plt.subplots(1, 3, figsize=(10, 3)) sns.scatterplot(data=auto, x='displacement', y='mpg', ax=ax[0]) sns.scatterplot(data=auto, x='acceleration', y='mpg', ax=ax[1]) sns.scatterplot(data=auto, x='weight', y='displacement', ax=ax[2]);
53
What is a scatter plot with marginal density?
We can have a scatter plot with marginal histograms of the two quantitative variables: EXAMPLE: g = sns.jointplot(data=auto, x='displacement', y='mpg') g.fig.set_size_inches(3,3);
54
What are possible issues with scatter plots?
* Like rug plot, scatter plots can be subjected to overplotting. * One possible solution: smaller markers * Another solution: 2D "histogram", density plot * NOTE: if we only we want to see the relations between variables, overplotting is not necessarily an issue
55
What is a hexplot and how do you attain it?
Hex plot is a tool to visualise the joint distribution. It divides the plane into regular hexagons, counts the number of observations that fall into each hexagon, and then maps the count to the hexagon fill. * Can be thought of as a two dimensional histogram * More shaded hexagons typically indicate a greater density/frequency EXAMPLE: _, ax=plt.subplots(figsize=(3, 2.5)) auto.plot.hexbin(x="displacement", y="mpg", gridsize=10, ax=ax);
56
How can the the use of hexagon bins help?
The use of hexagon bins can: * Avoid the visual artefacts sometimes generated by the very regular alignment of grides * Visualise relations better EXAMPLE: g = sns.displot(auto, x="displacement", y="mpg", cbar=True) g.fig.set_size_inches(3,2.5);
57
What are contour plots (density plots) and how do you attain it?
Contour plots are two-dimensional density plots. EXAMPLE: sns.jointplot(data=auto, x='displacement', y='mpg', kind='kde', fill=True, height=3);
58
What is a scatter matrix and how do you attain it?
* You may want to use a scatter matrix to visualise the relations between all pairs of quantitative variables for multivariate data * It consists of: - Histograms / KDE plots to visualise the marginal distribution of each variable on diagonal - Scatter plots for all possible pairs * It is a convenient visualisation for quick exploratory data analysis EXAMPLE1 With marginal density plots: g = sns.pairplot(auto[['mpg', 'displacement', 'weight']], plot_kws={"marker": '+'}) g.fig.set_size_inches(3,3); EXAMPLE2 Lower triangle only: g = sns.pairplot(auto[['mpg', 'displacement', 'weight']], plot_kws={"marker": '+'}) g.fig.set_size_inches(3,3);
59
Is it possible to add more dimensions to a scatter plot?
It is possible to show more than 2 quantitative variables on a scatter plot. We can do so by encoding the other variables by different colours, styles and shapes of the markers: * Quantitative: marker colour, marker size * Categorical: marker colour, marker style * NOTE: While in principle you can show 5 variables in one scatter plot, use it with care as the plot can be difficult to read. * EXAMPLE1: acceleration vs mpg, with marker colour representing the number of cylinders: plt.subplots(figsize=(10, 3)) sns.scatterplot(data=auto, x='acceleration', y='mpg', hue='cylinders', palette='flare'); * EXAMPLE2: with more than five variables to a scatter plot: plt.subplots(figsize=(10, 3)) sns.scatterplot(data=auto, x='acceleration', y='mpg', hue='cylinders', style='origin', size='weight', palette='flare');
60
What are the different ways you can show trend relationships between variables?
1. Line plot 2. Moving average plot 3. Stacked area graph 4. Multiple lines in one graph 5. Different y-axes 6. Line plot (juxtaposition)
61
What is a line plot and how do you attain it?
A line plot is a type of chart which displays ordered data connected by straight line segments. * Useful for sequential data to find out patterns, trends, seasonality, changes and anomalies * For time series data, often we plot data against time to see if there is any trend in the data * EXAMPLE: coronavirus cases in the UK: cases = pd.read_csv('data/cases.csv', usecols=[3,4], parse_dates=[0], index_col=0) cases.columns = ['cases']; cases.sort_index(ascending=False) cases.plot(figsize=(8, 1.5), legend=False, ylabel='cases');
62
What is a moving average plot and how do you attain it?
To visualise the overall trend better, you may want to remove the periodic fluctuations around the trend. * By using a 7-day rolling mean, we can remove the periodic fluctuations and visualise the overall trend better: * EXAMPLE: Covid number of cases reported is higher during the weekdays than the weekends: cases['weekday'] = cases.index.weekday cases.groupby('weekday').mean().plot.bar(ylabel='average number of cases', legend=False, figsize=(5,2.5)) plt.xticks(range(7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']) plt.tick_params(axis='x', rotation=0); plt.subplots(figsize=(8, 3)) cases['cases'].rolling(7, center=True).mean().plot(ylabel='cases');
63
What is a stacked area graph?
A stacked area graph visualises how a quantitative variable of each group changes over time. * Value of each group at each time point is represented by the height * EXAMPLE: COVID number of cases in France, Spain and Italy: import seaborn.objects as so eu_cases = pd.read_csv('data/cases_eu.csv', usecols=[0,4,5,6], parse_dates=[0], dayfirst=True, index_col=0) eu_cases.columns = ['cases', 'deaths', 'country']; eu_cases.index.name = 'date' eu_cases = eu_cases[(eu_cases['country'].isin(['France', 'Spain', 'Italy'])) & (eu_cases.index >= pd.to_datetime('2021-09-01'))] p = so.Plot(eu_cases, "date", "cases", color='country') p.add(so.Area(alpha=1), so.Stack()).layout(size=(15, 3))
64
What are multiple lines in one graph (trends) and how do you attain it?
Multiple lines in one graph can be used to visualise how a quantitative variable of each group changes over time. * EXAMPLE: p = so.Plot(eu_cases, "date", "cases", color='country') p.add(so.Line(), so.Agg()).layout(size=(15, 4))
65
What are issues with multiple lines in one graph and possible solutions?
1. When the scale of the variables are different, it is difficult to compare the lines when they are plotted on the same graph. * EXAMPLE: Number of COVID cases vs deaths in France SOLUTION1: Plot with different y-axes *One possible solution is we use different y axes for the two variables * But use with care - can be misleading * EXAMPLE: ax_1 = eu_cases.loc[eu_cases.country == 'France', 'cases'].plot(figsize=(8, 4)) eu_cases.loc[eu_cases.country == 'France', 'deaths'].plot(secondary_y=True) plt.legend(ax_1.get_lines() + ax_1.right_ax.get_lines(), ['cases', 'deaths']) ax_1.set_ylabel('cases'); ax_1.right_ax.set_ylabel('deaths'); SOLUTION2: Line plot juxtaposition * Possible solution: side-by-side plot with the same x-axis * EXAMPLE: ax = eu_cases[eu_cases.country == 'France'].plot(figsize=(8, 3), subplots=True, legend=False) ax[0].set_ylabel('cases'); ax[1].set_ylabel('deaths');