W5 Flashcards

Question

What are diverging colour maps?

Answer 1

Change in lightness and possibly saturation of two different colours that meet in the middle; suitable for data with a critical middle value like zero

Answer 2

* Matplotlib: https://matplotlib.org/3.5.1/tutorials/colors/colormaps.html * Seaborn: https://seaborn.pydata.org/tutorial/color_palettes.html

Answer 3

1. Heat Map Suppose we are interested in visualising the pairwise correlation between each pair of variables in the dataset

Answer 4

One can visualise the correlation using a heat map. * Heat map is a 2-dimensional data visualisation technique, mapping values to colours EXAMPLE: Heat map to visualise pairwise correlation using the default colourmap: sns.heatmap(auto_corr, annot=True, square=True, linewidths=.5, annot_kws={"fontsize":6}, cbar_kws={"shrink":.8});

Answer 5

Sequential colourmap is used in the previous slide * Issues: - Difficult to see easily which pairs have positive or negative correlation - Difficult to see the magnitude *For correlation, there is a critical middle value 0. Therefore, it is more appropriate to use a diverging colour map: EXAMPLE: sns.heatmap(auto_corr, vmin=-1, vmax=1, center=0, cmap='RdBu', linewidths=.5, annot=True, square=True, annot_kws={"fontsize":6}, cbar_kws={"shrink":.8}); NOTE: * Red colour represents a negative linear correlation, and blue colour represents a positive linear correlation, and white colour represents there is no linear correlation - center=0 sets the critical centre value to 0 * Darker colour represents a larger magnitude of correlation

Answer 6

1. The type of data you are plotting - categorical? ordered? 2. Meaning/emotion associated with colours * Red is "warm" and blue is "cold" * Political parties are often associated with some colours. For example:

Answer 7

Scatter plot using colour to represent the third variable: EXAMPLE: _, ax = plt.subplots(ncols=3, figsize=(10, 2), sharey=True) sc_m = ax[0].scatter(x=auto.displacement, y=auto.mpg, c=auto.weight, s=5) plt.subplots_adjust(wspace=0.3); plt.colorbar(sc_m) auto.plot.scatter(x='displacement', y='mpg', c='weight', s=5, ax=ax[1], title='pandas default') sc_s = sns.scatterplot(auto, x='displacement', y='mpg', hue='weight', s=5, ax=ax[2]) sns.move_legend(sc_s, "upper left", bbox_to_anchor=(1, 1)) ax[0].set_title('matplotlib default'); ax[2].set_title('seaborn default');

Answer 8

titanic = pd.read_csv('data/titanic.csv) dc = pd.read_csv('data/dc-wikia-data.csv') auto = pd.read_csv('data/auto-mpg.csv', na_values='?') auto['origin'] = auto['origin'].astype('category')

Answer 9

What would you like to show? 1. Relationship a) Two variables: SCATTER CHART b) Three variables: BUBBLE CHART 2. Comparison a) Among Items -> a1) One Variables per Item --> a1i) Few Categories ---> a1iA) Few Items: COLUMN CHART ---> a1iB) Many Items: BAR CHART --> a1ii) Many Categories: Table or TABLE WITH EMBEDDED CHARTS -> a2) Two Variables per Item: VARIABLE WIDTH COLUMN CHART b) Over time -> b1) Many Periods --> b1i) Cyclical Data: CIRCULAR AREA CHART --> b1ii) None cyclical Data: LINE CHART -> b2) Few Periods --> b2i) Single or few categories: COLUMN CHART --> b2ii) Manay categories: LINE CHART 3. Distribution a) Single Variable -> a1) Few Data points: COLUMN HISTOGRAM -> a2) Many Data points: LINE HISTOGRAM b) Two variables: SCATTER CHART c) Three variables: 3D AREA CHART 4. Composition a) Static -> a1) Simple share of total: PIE CHART -> a2) Accumulation or subtraction to total: WATERFALL CHART -> a3) Components of components: STACKED 100% COLUMN CHART WITH SUBCOMPONENTS c b) Changing over time -> b1) Few Periods --> b1i) Only relative differences matter: STACKED 100% COLUMN CHART --> b1ii) Relative and Absolute Differences Matter: STACKED COLUMN CHART -> b2) Many Periods --> b2i) Only relative differences matter: STACKED 100% AREA CHART --> b2ii) Relative and absolute differences matter: STACKED AREA CHART

Answer 10

* Figure layout * Legends (for the graphic components) * Titles * Aspect ratio * Orientation and grid * Ticks (major and minor): i.e. they run along the outside of the graph and provide the steps between the scale - Major vs minor tick labels * Axes: x-axis label and. y-axis labels * Markers * Colours * Spines

Answer 11

Default may not be the best * Plotting library comes with a set of default settings * Manual tuning of the different settings to visualise data and express the message better EXAMPLE: x = np.linspace(-3, 3, 1000); y1 = np.cos(x); y2 = np.sin(x) _, ax = plt.subplots(ncols=2, figsize=(15, 2.5)) ax[0].plot(x, y1); ax[0].plot(x, y2); ax[0].set_title('default') # default ax[1].plot(x, y1, label='y=cos(x)'); ax[1].plot(x, y2, label='y=sin(x)') ax[1].spines['left'].set_position('center'); ax[1].spines['bottom'].set_position('center') ax[1].spines['right'].set_color('none'); ax[1].spines['top'].set_color('none') ax[1].legend(); ax[1].set_title('customised'); ax[1].set_yticks([-1, 0, 1]) ax[1].set_xticks([-np.pi,-np.pi/2,0,np.pi/2,np.pi], ['$-\pi$','$-\pi/2$','0','$\pi/2$','$\pi$'])

Answer 12

The layout (arrangement of multiple panels, facets, or subplots) is important for efficient comparison. * Compare y-axis values: aligned horizontally with a single y-axis * Compare x-axis values: aligned vertically with a single x-axis * Matrix layouts (multiple rows and columns in a single figure) should only be used if: - Data in individual panels are not related - Too many panels to fit on a single row / column

Answer 13

* A square figure should be considered in particular if the two axes share a communality such as a measurement before and after some event.

Answer 14

Bar chart uses height to represent the data. Therefore, in general, 0 should be included (or we can say 0-base should be used)

Answer 15

Steps (ticks) and labels (as well as the location of spines)

Answer 16

Add in logarithmic axes * When the data has some extreme values, taking log on the values may help to reveal the patterns and provide more meaningful visualisation.

Answer 17

Sorting the bars in a bar chart if "important" categories are the ones with the highest (or lowest) frequency, and if such reordering does not cause confusion or wrong impression. Same data but for the right one the order of bars based on frequency EXAMPLE: _, ax = plt.subplots(ncols=2, figsize=(10, 3)); plt.subplots_adjust(wspace=0.3) wc_df = pd.DataFrame.from_dict(wc.process_text(text), orient='index', columns=['count']) wc_df[:20].plot.barh(legend=False, title='First 20, not sorted', ax=ax[0]) wc_df.sort_values('count').tail(20).plot.barh(legend=False, title='Top 20', ax=ax[1]);

Answer 18

Same side-by-side boxplots but different orientations

Answer 19

1. Present data 2. Provide evidence and support 3. Influence and persuade

Answer 20

Some features of plots for explanatory purpose: * Attractive and aesthetically pleasing - Draw attention - Easier to understand, more intuitive to people * Use annotations and colours to highlight what they want you to see * Use captions to guide readers - What the plot is about - How readers should understand the plot * Have graphical integrity and be truthful - Plots should be accurate and not misleading. - Intentionally misleading people is ethically unacceptable - Any misleading/inaccurate plots will undermine your creditability * Be clear about what you want to show: - Do not bury the lead. - Use the right graphic form (marks and channels) - Arrange the graphic components --> All plots should also be self-contained with axis labels, title, etc. - Keep it simple - Use captions and annotations to guide your audience (to be covered in the workshop)

Answer 21

Word cloud: The importance of each word is visualised through font size or colour. * This plot provides a general idea of the most important words in the given article and is aesthetically pleasing, but it is difficult to compare the word counts between different tokens. from wordcloud import WordCloud, STOPWORDS wc = WordCloud(stopwords=STOPWORDS, background_color="white").generate(text) plt.imshow(wc); plt.axis("off"); NOTE: You may need to install Word Cloud first

Answer 22

* Choose the graphic form that suits your need, which is not necessarily the one that enables accurate estimates * Multiple graphic forms may enable multiple tasks - E.g. If you want to show both general impressions of the important words and at the same time allow readers to compare the word count easily and accurately, use both charts

Answer 23

You can add additional elements, for example in a time-series graph, you can add an average horizontal line _, ax = plt.subplots(1, 2, figsize=(10,2.5), sharey=True); plt.subplots_adjust(wspace=0.3) income_df['income'].plot(ylabel='income (US dollars)', title='original', ax=ax[0]) income_df['income'].plot(ylabel='US dollars', title='With average shown') plt.axhline(y=income_df['income'].mean(), color='b', linestyle='--') plt.annotate(f' Average income: \n {income_df["income"].mean()}', xy=(datetime(2000,12,31), income_df['income'].mean()-200), fontsize=12, annotation_clip=False); # annotation outside the plot

Answer 24

Data can be in a very different magnitude which makes the plot difficult to read. Possible solutions: * Set the scale of axes * Create more than one plot, each with its own scale to "zoom in" the data

Answer 25

Convention/intuition may actually be biased: * Good=Green, Bad=Red * Male = Blue, Female = Pink/red

Answer 26

Keep it simple * Avoid chart junk * Maximise data-ink ratio * If it can be visualised in 2d, do not visualise it in 3d. If it can be visualised in 1d, do not visualise it in 2d

Answer 27

Chart junks are visual embellishments that are not essential to understanding the data * They are non-data and/or redundant data elements in a graph * They can be artistic decoration, but more often in the form of conventional graphical elements that are unnecessary in that they add no value * They can however be useful to tell a message/story which is captivating In "Useful Junk? The Effects of Visual Embellishment on Comprehension and Memorability of Charts" Findings from the study: - Accuracy in describing the embellished charts ("Holmes style" charts) was no worse than for "plain" charts - Significantly better when recalling after 2-3 weeks for embellished charts - Participants saw value messages in the "Holmes style" charts significantly more often than in the plain charts - Participants found the Holmes charts more attractive, most enjoyed them, and easiest and fastest to remember

Answer 28

data-ink ratio = data ink / total ink used in graphic

Answer 29

DUPLICATION (i.e you are showing correlation with itself)

Answer 30

* Messages for the selected plots are simple, and the embellishments relevant to that message - Therefore, there is no surprise that readers can comprehend the message as well as they would have the graph lacked embellishment * The examples in the study consisted only of a few embellished charts created by only Nigel Holmes * The plain plots used in the study are unnecessarily hard on the eyes, and some of them were designed in a confusing way

Answer 31

* If a graphical element supports the chart’s message in a meaningful way, it is not junk * Graphical embellishments can potentially support the effectiveness of a data visualisation: - Engage the interest of the reader (i.e., getting them to read the content) - Draw reader's attention to particular items that merit emphasis - Make the message more memorable * Embellishments only enhance effectiveness, however, if they refrain from undermining the message by distracting from it or misrepresenting it.

W5 Flashcards

(55 cards)