W5 Flashcards

1
Q

What are the relevant libraries to import for visualisation?

A

iimport matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import seaborn as sns

import numpy as np
import numpy.random as nr
from datetime import datetime

Set the default size of plots
import matplotlib
matplotlib.rcParams[‘figure.figsize’] = (4, 2.5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can you attain a bar chart and pie chart for the survivor rate from the Titanic EXAMPLE?

A

Different ways to visualise the same data (i.e. bar chart and pie chart)

survive_count = titanic.value_counts(‘Survived’)
_, ax = put.subplots(ncols=2, figsize=(8, 2))
survive_count.plot.bar(ylabel=’count, rot=0, ax=ax[0], title=’Bar chart’)
survive_count.plot.pie(y_label=’ ‘, ax=ax[1], title=’Pie chart’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the different visualise the scatter plots for 3 variables in different colours in the auto EXAMPLE?

A

_, ax = plt.subplots(ncols=3, figsize=(10, 2), sharey=True)
sc_m = ax[0].scatter(x=auto.displacement, y=auto.mpg, c=auto.weight, s=5)
plt.subplots_adjust(wspace=0.3); plt.colorbar(sc_m)
auto.plot.scatter(x=’displacement’, y=’mpg’, c=’weight’, s=5, ax=ax[1], title=’pandas default’)
sc_s = sns.scatterplot(auto, x=’displacement’, y=’mpg’, hue=’weight’, s=5, ax=ax[2])
sns.move_legend(sc_s, “upper left”, bbox_to_anchor=(1, 1))
ax[0].set_title(‘matplotlib default’); ax[2].set_title(‘seaborn default’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the different ways we can map data into visual properties when visualising data?

A
  1. Length or height
    * i.e does the data visuals run vertically or horizontally
  2. Position
    * Is the data scaling
  3. Area
    * How much of the graph space is used
  4. Angle/area
  5. Line weight
  6. Hue and shade
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do we want to achieve through visualisation and what are the considerations?

A

Take advantage of the human visual system to
* Understand data and extract information
* Communicate

Considerations
* Correctness
* Effectiveness: e.g. match human perception

Through visualisation, we encode the data into plots. To understand the data through the plots, it relies on the audience’s capability to decode the plots correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are marks and channels?

A

Marks and channels are building blocks for visual encoding.

  • Marks: geometric primitives
  • Channels: control the appearance of marks based on attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do marks do and some examples?

A

Marks represent items or links, for now, we only consider items

Basic geometric elements, classified according to the number of dimensions

EXAMPLES:
* Points (zero-dimensional)
* Lines (one-dimensional)
* Areas (two-dimensional)

_, ax = plt.subplots(ncols=3, figsize=(11, 3)); plt.subplots_adjust(wspace=0.3)
auto.plot.scatter(x=’displacement’, y=’mpg’, ax=ax[0], s=5, title=’points (0d)’)
survive_count.plot.bar(ylabel=’count’, rot=0, ax=ax[1], width=0.1, title=’lines (1d)’)
survive_count.plot.pie(ylabel=’’, ax=ax[2], title=’areas (2d)’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do channels do, the 2 types, and some examples?

A

Channels (or visual variables) control the appearance of marks, proportional to / based on attributes

2 types of channels:

  1. Identity channels: what something is
    * E.g. shape, hue of colours, spatial region
  2. Magnitude channels: ordered attributes
    * E.g. position, length, area, angle (or tilt), lightness of colours

EXAMPLES:
1. Position (horizontal, vertical, both)
2. Shape (triangle, star, line, right angle)
3. Size (length, area, volume)
4. Colour
5. Tilt
6. Volume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 2 principles of the use of visual channels?

A
  1. Expressiveness principle
    * Visual encoding should express all of, and only, the information in the dataset attributes
  2. Effectiveness principle
    * The importance of the attribute should match the salience of the channel
    - i.e. the most important attributes should be encoded with the most effective channels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the expressiveness principle, counterexample and EXAMPLE?

A
  1. Expressiveness Principle: Visual encoding should express all of, and only, the information in the dataset attributes
    * Magnitude channels: Quantitative and ordinal data
    * Identity channels: Categorical attributes

Counterexamples (using identity channel for a quantitative variable):

EXAMPLE:
g = sns.scatterplot(auto, x=’displacement’, y=’mpg’, style=’cylinders’)
sns.move_legend(g, “upper left”, bbox_to_anchor=(1, 1));

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the expressiveness channels ranked by effectiveness?

A

A. Magnitude Channels: Ordered Attributes -
1. Position on common scale
2. Position on unaligned scale
3. Length (1D size)
4. Tilt/angle
5. Area (2D size)
6. Depth (3D position)
7. Colour luminance
8. Colour Saturation
9. Curvature
10. Volume (3D Size)

B. Identity Channels: Categorical Attributes -
1. Spatial region
2. Colour hue
3. Motion
4. Shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the factor attributes to effectiveness for channels?

A
  1. Accuracy: capability to estimate the magnitude of data encoded
  2. Discriminability: the capability to distinguish items as intended (this quantifies the number of bins available for visual encoding)
    * EXAMPLE: In using use different lightness of green to represent different categories, it is quite difficult to distinguish which is which, as the number of bins available when using lightness as a channel is limited.
  3. Separability: can we combine multiple channels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the rankings of accuracy across different channels?

A

Ranking of enabling accurate estimates:
1. Position along common scales
2. Position along identical, nonaligned scales
3. Length
4. Direction/slope
5. Angle
6. Area
7 Volume
8. Shading and saturation
9. Colour hue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some different ways we can interpret separability?

A

a) Fully Separable
- Position + hue (colour)

b) Some interference
- Size + hue (colour)

c) Some/significant interference
- Width + Height

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is proportional judgement and the ranked error (from least to most error-prone graphs)?

A

Proportional judgement - Is the ability to recognise and distinguish proportions in data/graphs

Ranked from least to most error-prone for proportional judgements:
1. Positions
2. Angles
3. Circular Areas
4. Rectangular Areas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is human capability and its limitations in interpreting graphical results?

A

The human perceptual system is fundamentally based on relative judgements, not absolute ones. Our perception of colour and luminance is contextual, based on the contrast with surrounding colours.

Our visual system evolved to provide colour constancy, so that the same surface is identifiable across a broad set of illumination conditions, even though a physical light meter would yield very different readings.

  1. Comparing lengths (unframed unaligned vs. framed unaligned vs. unframed aligned)
  2. Comparing Luminance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the key takeways from selecting channels?

A
  1. Choose the channels that suit your need, which are not necessarily the ones that enable accurate estimates
    * For example, using channels at the bottom half of the scale can be appropriate if the goal is not to enable accurate judgements, but to reveal general patterns
    * One can annotate the plot to help the audience to decode
  2. Multiple graphic forms may enable multiple tasks
    * For example, if you want to show both general impressions of the share of each grade and at the same time allow readers to compare the number of students per each grade easily and accurately, you may want to use separate charts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Should colour be included as a channel?

A
  • Colour can be a redundant channel, and therefore unnecessary
  • Alternative channels can be more effective
  • Colour luminance and colour saturation were understood as being least effective on the ordered attributes of magnitude channels; yet were 2nd most effective on the categorical attributes of identity channels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a colourmap and its different

A

A colourmap (or colour palette) specifies a mapping between colours and data values.

  • Texonomy of colourmaps:
    1. Categorical (or qualitative)
    2. Continuous
    -> Sequential
    -> Diverging
  • It is important to match colourmaps to data type characteristics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the categorical colormaps?

A
  • Often are miscellaneous colours: pastel1, pastel2, paired, accent, dark2, set1, set2, set3, tab10, tab20,…
  • Suitable for categorical nominal data
  • Ideally, each colour should have the same lightness
  • But this would restrict the number of discriminable colours
  • Note the number of discriminable colours is limited in noncontiguous small regions:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the (continuous) sequential colours?

A

A sequential colourmap ranges from a minimum value to a maximum value.

  • The colour changes in lightness and possibly saturation of colour incrementally to full lightness and saturation, often using a single hue
  • Good for ordered data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are problems with rainbow colours?

A
  1. Perceptually unordered:
    * No clear “greater than” or “less than” logic to order the colour
    * Hue, which represents the type of colours, may not be appropriate to represent order
  2. Perceptually nonlinear:
    * Steps of the same size at different points in the colourmap range are not perceived equally by our eyes
    * Human is not very good at perceiving changes in hue
  3. Colour blind readers may not be able to distinguish red and green colour
    * Readers perceive sharp transitions in colour as sharp transitions in the data, even when this is not the case (misleading)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a Perceptually uniform sequential colourmap?

A

Perceptually uniform sequential colourmaps are colourmaps that may contain multiple types of colours, but equal steps in data are perceived as equal steps in the colour space in terms of lightness.

  • The default colourmaps of both matplotlib and seaborn for quantitative data are perceptually uniform sequential colourmap.

_, ax = plt.subplots(ncols=2, figsize=(7, 2.5), sharey=True)
sc_m = ax[0].scatter(x=auto.displacement, y=auto.mpg, c=auto.weight)
ax[0].set_title(‘matplotlib default’); ax[0].set_ylabel(‘mpg’); plt.colorbar(sc_m)
sc_s = sns.scatterplot(auto, x=’displacement’, y=’mpg’, hue=’weight’, ax=ax[1])
sns.move_legend(sc_s, “upper left”, bbox_to_anchor=(1, 1))
plt.subplots_adjust(wspace=0.3); ax[1].set_title(‘seaborn default’);

24
Q

What is a key difference between Perceptually uniform sequential colourmap and rainbow colourmap?

A

Perceptually uniform sequential colourmaps have reasonable representations in grayscale, whereas rainbow colourmaps may not (i.e. lightness scaling in perceptually uniform sequential colourmaps)

25
What are diverging colour maps?
Change in lightness and possibly saturation of two different colours that meet in the middle; suitable for data with a critical middle value like zero
26
What are the Colourmaps provided by matplotlib and seaborn?
* Matplotlib: https://matplotlib.org/3.5.1/tutorials/colors/colormaps.html * Seaborn: https://seaborn.pydata.org/tutorial/color_palettes.html
27
What are different ways to visualise correlation?
1. Heat Map Suppose we are interested in visualising the pairwise correlation between each pair of variables in the dataset
28
What is a heat map and how do you attain it?
One can visualise the correlation using a heat map. * Heat map is a 2-dimensional data visualisation technique, mapping values to colours EXAMPLE: Heat map to visualise pairwise correlation using the default colourmap: sns.heatmap(auto_corr, annot=True, square=True, linewidths=.5, annot_kws={"fontsize":6}, cbar_kws={"shrink":.8});
29
What are ways of improving heat maps to visualise correlation?
Sequential colourmap is used in the previous slide * Issues: - Difficult to see easily which pairs have positive or negative correlation - Difficult to see the magnitude *For correlation, there is a critical middle value 0. Therefore, it is more appropriate to use a diverging colour map: EXAMPLE: sns.heatmap(auto_corr, vmin=-1, vmax=1, center=0, cmap='RdBu', linewidths=.5, annot=True, square=True, annot_kws={"fontsize":6}, cbar_kws={"shrink":.8}); NOTE: * Red colour represents a negative linear correlation, and blue colour represents a positive linear correlation, and white colour represents there is no linear correlation - center=0 sets the critical centre value to 0 * Darker colour represents a larger magnitude of correlation
30
What are considerations when using colours?
1. The type of data you are plotting - categorical? ordered? 2. Meaning/emotion associated with colours * Red is "warm" and blue is "cold" * Political parties are often associated with some colours. For example:
31
How can you represent a third variable in a scatterplot?
Scatter plot using colour to represent the third variable: EXAMPLE: _, ax = plt.subplots(ncols=3, figsize=(10, 2), sharey=True) sc_m = ax[0].scatter(x=auto.displacement, y=auto.mpg, c=auto.weight, s=5) plt.subplots_adjust(wspace=0.3); plt.colorbar(sc_m) auto.plot.scatter(x='displacement', y='mpg', c='weight', s=5, ax=ax[1], title='pandas default') sc_s = sns.scatterplot(auto, x='displacement', y='mpg', hue='weight', s=5, ax=ax[2]) sns.move_legend(sc_s, "upper left", bbox_to_anchor=(1, 1)) ax[0].set_title('matplotlib default'); ax[2].set_title('seaborn default');
32
How can you import the titanic, DC, and Auto datasets?
titanic = pd.read_csv('data/titanic.csv) dc = pd.read_csv('data/dc-wikia-data.csv') auto = pd.read_csv('data/auto-mpg.csv', na_values='?') auto['origin'] = auto['origin'].astype('category')
33
For choosing an appropriate chart type, what are the things to consider and different chart options?
What would you like to show? 1. Relationship a) Two variables: SCATTER CHART b) Three variables: BUBBLE CHART 2. Comparison a) Among Items -> a1) One Variables per Item --> a1i) Few Categories ---> a1iA) Few Items: COLUMN CHART ---> a1iB) Many Items: BAR CHART --> a1ii) Many Categories: Table or TABLE WITH EMBEDDED CHARTS -> a2) Two Variables per Item: VARIABLE WIDTH COLUMN CHART b) Over time -> b1) Many Periods --> b1i) Cyclical Data: CIRCULAR AREA CHART --> b1ii) None cyclical Data: LINE CHART -> b2) Few Periods --> b2i) Single or few categories: COLUMN CHART --> b2ii) Manay categories: LINE CHART 3. Distribution a) Single Variable -> a1) Few Data points: COLUMN HISTOGRAM -> a2) Many Data points: LINE HISTOGRAM b) Two variables: SCATTER CHART c) Three variables: 3D AREA CHART 4. Composition a) Static -> a1) Simple share of total: PIE CHART -> a2) Accumulation or subtraction to total: WATERFALL CHART -> a3) Components of components: STACKED 100% COLUMN CHART WITH SUBCOMPONENTS c b) Changing over time -> b1) Few Periods --> b1i) Only relative differences matter: STACKED 100% COLUMN CHART --> b1ii) Relative and Absolute Differences Matter: STACKED COLUMN CHART -> b2) Many Periods --> b2i) Only relative differences matter: STACKED 100% AREA CHART --> b2ii) Relative and absolute differences matter: STACKED AREA CHART
34
What are the different graphical components which should be considered?
* Figure layout * Legends (for the graphic components) * Titles * Aspect ratio * Orientation and grid * Ticks (major and minor): i.e. they run along the outside of the graph and provide the steps between the scale - Major vs minor tick labels * Axes: x-axis label and. y-axis labels * Markers * Colours * Spines
35
What is important to note about the default provided
Default may not be the best * Plotting library comes with a set of default settings * Manual tuning of the different settings to visualise data and express the message better EXAMPLE: x = np.linspace(-3, 3, 1000); y1 = np.cos(x); y2 = np.sin(x) _, ax = plt.subplots(ncols=2, figsize=(15, 2.5)) ax[0].plot(x, y1); ax[0].plot(x, y2); ax[0].set_title('default') # default ax[1].plot(x, y1, label='y=cos(x)'); ax[1].plot(x, y2, label='y=sin(x)') ax[1].spines['left'].set_position('center'); ax[1].spines['bottom'].set_position('center') ax[1].spines['right'].set_color('none'); ax[1].spines['top'].set_color('none') ax[1].legend(); ax[1].set_title('customised'); ax[1].set_yticks([-1, 0, 1]) ax[1].set_xticks([-np.pi,-np.pi/2,0,np.pi/2,np.pi], ['$-\pi$','$-\pi/2$','0','$\pi/2$','$\pi$'])
36
What is the layout and why is it important?
The layout (arrangement of multiple panels, facets, or subplots) is important for efficient comparison. * Compare y-axis values: aligned horizontally with a single y-axis * Compare x-axis values: aligned vertically with a single x-axis * Matrix layouts (multiple rows and columns in a single figure) should only be used if: - Data in individual panels are not related - Too many panels to fit on a single row / column
37
What is important about the aspect ratio?
* A square figure should be considered in particular if the two axes share a communality such as a measurement before and after some event.
38
What is important on the inclusion of origins and limits in the data?
Bar chart uses height to represent the data. Therefore, in general, 0 should be included (or we can say 0-base should be used)
39
How should ticks and labels be important?
Steps (ticks) and labels (as well as the location of spines)
40
What happens when data has an exponential relation?
Add in logarithmic axes * When the data has some extreme values, taking log on the values may help to reveal the patterns and provide more meaningful visualisation.
41
How can you reorder data based on frequency?
Sorting the bars in a bar chart if "important" categories are the ones with the highest (or lowest) frequency, and if such reordering does not cause confusion or wrong impression. Same data but for the right one the order of bars based on frequency EXAMPLE: _, ax = plt.subplots(ncols=2, figsize=(10, 3)); plt.subplots_adjust(wspace=0.3) wc_df = pd.DataFrame.from_dict(wc.process_text(text), orient='index', columns=['count']) wc_df[:20].plot.barh(legend=False, title='First 20, not sorted', ax=ax[0]) wc_df.sort_values('count').tail(20).plot.barh(legend=False, title='Top 20', ax=ax[1]);
42
How does orientation improve interpretations?
Same side-by-side boxplots but different orientations
43
What are the goals of explanatory visualisation?
1. Present data 2. Provide evidence and support 3. Influence and persuade
44
What are features of plots used for explanatory purposes?
Some features of plots for explanatory purpose: * Attractive and aesthetically pleasing - Draw attention - Easier to understand, more intuitive to people * Use annotations and colours to highlight what they want you to see * Use captions to guide readers - What the plot is about - How readers should understand the plot * Have graphical integrity and be truthful - Plots should be accurate and not misleading. - Intentionally misleading people is ethically unacceptable - Any misleading/inaccurate plots will undermine your creditability * Be clear about what you want to show: - Do not bury the lead. - Use the right graphic form (marks and channels) - Arrange the graphic components --> All plots should also be self-contained with axis labels, title, etc. - Keep it simple - Use captions and annotations to guide your audience (to be covered in the workshop)
45
What is word cloud?
Word cloud: The importance of each word is visualised through font size or colour. * This plot provides a general idea of the most important words in the given article and is aesthetically pleasing, but it is difficult to compare the word counts between different tokens. from wordcloud import WordCloud, STOPWORDS wc = WordCloud(stopwords=STOPWORDS, background_color="white").generate(text) plt.imshow(wc); plt.axis("off"); NOTE: You may need to install Word Cloud first
46
What is a key takeway from selecting graphics?
* Choose the graphic form that suits your need, which is not necessarily the one that enables accurate estimates * Multiple graphic forms may enable multiple tasks - E.g. If you want to show both general impressions of the important words and at the same time allow readers to compare the word count easily and accurately, use both charts
47
What are auxiliary elements?
You can add additional elements, for example in a time-series graph, you can add an average horizontal line _, ax = plt.subplots(1, 2, figsize=(10,2.5), sharey=True); plt.subplots_adjust(wspace=0.3) income_df['income'].plot(ylabel='income (US dollars)', title='original', ax=ax[0]) income_df['income'].plot(ylabel='US dollars', title='With average shown') plt.axhline(y=income_df['income'].mean(), color='b', linestyle='--') plt.annotate(f' Average income: \n {income_df["income"].mean()}', xy=(datetime(2000,12,31), income_df['income'].mean()-200), fontsize=12, annotation_clip=False); # annotation outside the plot
48
How can you handle the scale of the data?
Data can be in a very different magnitude which makes the plot difficult to read. Possible solutions: * Set the scale of axes * Create more than one plot, each with its own scale to "zoom in" the data
49
What are some issues with colour?
Convention/intuition may actually be biased: * Good=Green, Bad=Red * Male = Blue, Female = Pink/red
50
What is a general tip on the detail of graphs?
Keep it simple * Avoid chart junk * Maximise data-ink ratio * If it can be visualised in 2d, do not visualise it in 3d. If it can be visualised in 1d, do not visualise it in 2d
51
What are chart junks?
Chart junks are visual embellishments that are not essential to understanding the data * They are non-data and/or redundant data elements in a graph * They can be artistic decoration, but more often in the form of conventional graphical elements that are unnecessary in that they add no value * They can however be useful to tell a message/story which is captivating In "Useful Junk? The Effects of Visual Embellishment on Comprehension and Memorability of Charts" Findings from the study: - Accuracy in describing the embellished charts ("Holmes style" charts) was no worse than for "plain" charts - Significantly better when recalling after 2-3 weeks for embellished charts - Participants saw value messages in the "Holmes style" charts significantly more often than in the plain charts - Participants found the Holmes charts more attractive, most enjoyed them, and easiest and fastest to remember
52
What is the data-ink ratio?
data-ink ratio = data ink / total ink used in graphic
53
Why only show half the correlation matrix?
DUPLICATION (i.e you are showing correlation with itself)
54
What are remarks on embellished graphs?
* Messages for the selected plots are simple, and the embellishments relevant to that message - Therefore, there is no surprise that readers can comprehend the message as well as they would have the graph lacked embellishment * The examples in the study consisted only of a few embellished charts created by only Nigel Holmes * The plain plots used in the study are unnecessarily hard on the eyes, and some of them were designed in a confusing way
55
How can you determine if it is a chart junk?
* If a graphical element supports the chart’s message in a meaningful way, it is not junk * Graphical embellishments can potentially support the effectiveness of a data visualisation: - Engage the interest of the reader (i.e., getting them to read the content) - Draw reader's attention to particular items that merit emphasis - Make the message more memorable * Embellishments only enhance effectiveness, however, if they refrain from undermining the message by distracting from it or misrepresenting it.