W4 Flashcards
What are the relevant visualisation libraries used and how do you import them?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
What are the 4 potential types of visualisations to consider?
- Distribution: how a variable in the dataset distributes over a range of possible values (histograms)
- Comparison: how multiple variables compare (boxplots)
- Relationship: how the values of variables in the dataset relate (scatterplots)
- Trend: how
values evolve over time (time-series related)
Titanic EXAMPLE: What is some preprocessing of the data to 1. Change the type of the data, 2. Select the variables that we are interested in?
Open the data
titanic = pd.read_csv(‘data/titanic/train.csv’)
titanic[‘survived’] = pd.Categorical.from_codes(titanic.survived, [‘not survived’, ‘survived’])
titanic[‘Sex’] = titanic[‘Sex’].astype(‘category’)
titanic[‘Pclass’] = pd.Categorical(titanic[‘Pclass’], ordered=True)
#make the data ordinal
titanic = titanic[[‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘Fare’]]
Titanic EXAMPLE: How do you attain the descriptive data from the data set?
titanic.describe(include=’all’)
Titanic EXAMPLE: How can you learn about the distribution of age of the passengers by looking at the raw data and by visualising the same data using a histogram?
print(titanic[‘Age’].to_list()[:300])
plt.subplots(figsize=(15, 3))
sns.histplot(titanic[‘Age’]);
What is a motivation for reviewing data sets before analysing them?
- Importance of looking at data graphically before analysing them to discover an unusual pattern that we never expected to see from the descriptive statistics
- Inadequacy of basic statistics for describing datasets
EXAMPLE: In Anscombe’s quartet: all 4 data sets give you nearly the same mean, standard deviation and correlation. Yet if we fit linear regression, they have very different distributions and appear very differently when graphed
Why do we use visualisation?
- We tend to see patterns/structure of data much more easily by visual means than looking at raw numbers
- Descriptive statistics may not be adequate for us to understand the data
- Identify hidden, unexpected patterns and trends
- Visualisation complements statistics. Both descriptive statistics and visualisation should be used to help us to understand the data
What are the 2 main goals of visualisation?
- Exploratory: Understand your data
* Key part of exploratory data analysis (EDA)
* Evaluate model performance
* Audience: yourself
* Tool you use while thinking - not worry too much about the formatting, etc. - Explanatory: Communicate results to others
* Explain and inform
* Provide evidence and support
* Audience: others
* Tool you use to influence and persuade - highly editorial and selective
What is EDA and how does it differ from IDA?
Exploratory data analysis (EDA) is an approach of analysing datasets to:
* Summarise their main characteristics, often by visualising the data or some summary statistics
* Understand the data beyond the formal modelling or hypothesis testing
EDA is different from initial da
ta analysis (IDA)
* IDA: Process of data inspection - check the quality of data, handle issues with the data, etc.
EDA is a critical first step for data analysis, followed by formal (confirmatory) data analysis.
What are the objectives of EDA?
- Enable unexpected discoveries in the data
- Discover relationships among variables
- Suggest hypotheses about the causes of observed phenomena
- Preliminary selection of appropriate statistical tools, techniques and models
- Assess assumptions on which statistical inference will be based
What are the types of EDA?
- Graphical vs non-graphical
* Last week: non-graphical, descriptive statistics - Univariate vs multivariate
* Univariate: look at only 1 variable at a time
- For tabular data, it can be only looking at one column
* Multivariate: look at two or more variables at a time to explore relationships
RECAP: What are the different descriptive statistics?
Central tendency
* Mean
* Median
* Mode
Spread/Dispersion
* Range
* IQR
* Standard deviation
Relations
* Correlation
* Cross tabulation/contingency table
What are some plots/distributions for univariate variables?
Categorical data:
* Bar chart
* Pie chart
Quantitative data
* Rug plot
* Histogram and density plot
* Box plot and violin plot
For univariate categorical variables, how can we display the distribution by tabulation?
EXAMPLE: The gender of the Titanic’s passengers
titanic_gender_data = titanic.value_counts(‘Sex’)
titanic_gender_data.to_frame()
What is a bar chart and how do you attain it?
A bar chart is commonly used to display the distribution of a univariate categorical variable (and quantitative data with only very few distinct values).
- The rectangular bars with heights or lengths proportional to the number of observations for the corresponding category
- Width does not represent the property of the data
EXAMPLE: Gender of the titanic passengers:
titanic_gender_data.plot.bar(ylabel=’number of passengers’, rot=0, figsize=(3, 1.5)));
What is a Pie Chart and how do you attain it?
A pie chart is another commonly used chart to show the distribution of a univariate categorical variable.
- The arc length of each slice (or its angle and area) is proportional to the number of observations for the corresponding category
EXAMPLE: Gender of the Titanic’s passengers:
titanic_gender_data.plot(kind=’pie’, legend=False, autopct=lambda p:f’ {p: . 2f}%’, textprops={‘fontsize’: 11}, ylabel=’ ‘, figsize=(4, 2.5)));
What is a Rug Plot and how do you attain it?
A Visualised distribution of a univariate quantitative variable, a rug plot simply maps the data to locations on an axis
- It helps to show the distribution of a single quantitative variable
- It shows every value
- Note the y-axis does not represent the property of the data
EXAMPLE:
rug_data = [0, 2, 0, 0, 3, 0, 1, 0, 5, 10]
plt.subplots(figsize=(15, 0.5))
g = sns.rugplot(rug_
data, height=1)
g.set(ylim=(0, 1), yticks=[]);
What are possible issues with the Rug Plot?
- Too much details: No need to know each value
- Overplotting: Cannot tell how many observations for each mark are representing
What is an alternative to a rug plot?
We could use spike plot, which maps the data to positions on the x-axis, with the height to represent the frequency of the occurrences of the corresponding value.
- But like rug plot, spike plot can still have too many details which makes it difficult to
generalise and interpret the graph.
What is a histogram and how do you attain it?
A Histogram is another type of plot which is used to show the distribution of a univariate (i.e. single) quantitative variable.
- Like spike plot, it can use height to represent frequency
- Unlike spike plot, instead of counting (and plotting) the occurrences for each value, it groups values to some intervals (“bins”) and counts how many observations fall into each bin
- Lose details but see the big picture
EXAMPLE:
plt.subplots(figsize=(15, 2.5))
heights, bins, _ = plt.hist(titanic[‘Age’], bins=list(range(0, 100, 10)))
plt.ylabel(‘count’); plt.xlabel(‘age’);
What does the height of the histogram rectangles represent when the bins are of equal size?
When the bins are of equal size, the height of the rectangles represents the frequency (absolute/relative) of the values in the corresponding bin.
- Verify the height is equal to the number of values in that bin for our example above:
- Height of the rectangles:
-> EXAMPLE: heights - Number of observations inside each interval to be [left, right) (except the last one): [left, right]
-> EXAMPLE: [sum(titanic[‘Age’].between(bins[i], bins[i+1], inclusive=’left’))
for i in range(len(bins)-1)]
Why may you want to normalise a histogram and how do you normalise it?
Histogram can be “normalised” to display the relative frequency, with the sum of the area of all rectangles is 1.
- Can be considered as an approximate representation of the probability density of the data
- Area of each rectangle represents the empirical probability of observations in the corresponding interval (bin) indicated by the x axis
- EXAMPLE: (age of the Titanic’s passengers):
plt.subplots(figsize=(15, 2.5))
heights, bins, _ = plt.hist(titanic[‘Age’], bins=list(range(0, 100, 10)), density=True)
plt.ylabel(‘density’); plt.xlabel(‘age’);
NOTE1: You can verify the sum of the area of all rectangles is 1 via: sum(heights*10)
NOTE2: To check the proportion of distribution (i.e. the proportion that passengers with 20 <= age <= 40): heights[2]10 + heights[3]10
How can you select an appropriate number of bins?
- You may need to see the shape of the distribution (e.g.
via histogram) to decide an appropriate number of
bins. So it can be an iterative process - Often the default choice is quite good, but there is no guarantee
- EXAMPLE:
titanic[‘Age’].hist(figsize=(5,2));
What affects how the estimated probability density of a histogram looks?
- Different number of bins affects how the estimated probability density looks.
- The higher the number of bins, the more detailed the histogram is
*Beware of drawing strong conclusions from the looks of a histogram
EXAMPLE:
_, ax = plt.subplots(1, 4, figsize=(20, 5))
for i, bins in enumerate(np.array([1, 4, 12, 100])):
ax[i].hist(titanic[‘Age’], bins=bins, density=True)
ax[i].set_title(f’number of bins = {bins}’)