Sect 4- data visualization, R programming, python, statistical significance, & ANOVA Flashcards

1
Q

advantages of data visualization

A

Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and squares from circles. Our culture is visual, including everything from art and advertisements to TV and movies. Data visualization is another form of visual art that grabs our interest and keeps our eyes on the message. When we see a chart, we quickly see trends and outliers. If we can see something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend, you know how much more effective a visualization can be.

Some other advantages of data visualization include:

Easily sharing information.
Interactively explore opportunities.
Visualize patterns and relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

disadvantages of data visualization

A

While there are many advantages, some of the disadvantages may seem less obvious. For example, when viewing a visualization with many different datapoints, it’s easy to make an inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s biased or confusing.

Some other disadvantages include:

Biased or inaccurate information.
Correlation doesn’t always mean causation.
Core messages can get lost in translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The importance of data visualization is simple: it helps people see, interact with, and better understand data. Whether simple or complex, the right visualization can bring everyone on the same page, regardless of their level of expertise.

It’s hard to think of a professional industry that doesn’t benefit from making data more understandable. Every STEM field benefits from understanding data—and so do fields in government, finance, marketing, history, consumer goods, service industries, education, sports, and so on.

While we’ll always wax poetically about data visualization (you’re on the Tableau website, after all) there are practical, real-life applications that are undeniable. And, since visualization is so prolific, it’s also one of the most useful professional skills to develop. The better you can convey your points visually, whether in a dashboard or a slide deck, the better you can leverage that information. The concept of the citizen data scientist is on the rise. Skill sets are changing to accommodate a data-driven world. It is increasingly valuable for professionals to be able to use data to make decisions and use visuals to tell stories of when data informs the who, what, when, where, and how.

A

While traditional education typically draws a distinct line between creative storytelling and technical analysis, the modern professional world also values those who can cross between the two: data visualization sits right in the middle of analysis and visual storytelling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

As the “age of Big Data” kicks into high gear, visualization is an increasingly key tool to make sense of the trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form easier to understand, highlighting the trends and outliers. A good visualization tells a story, removing the noise from data and highlighting useful information.

A

However, it’s not simply as easy as just dressing up a graph to make it look better or slapping on the “info” part of an infographic. Effective data visualization is a delicate balancing act between form and function. The plainest graph could be too boring to catch any notice or it make tell a powerful point; the most stunning visualization could utterly fail at conveying the right message or it could speak volumes. The data and the visuals need to work together, and there’s an art to combining great analysis with great storytelling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

General types of data visualization

A

Chart: Information presented in a tabular, graphical form with data displayed along two axes. Can be in the form of a graph, diagram, or map.

Table: A set of figures displayed in rows and columns.

Graph: A diagram of points, lines, segments, curves, or areas that represents certain variables in comparison to each other, usually along two axes at a right angle.

Geospatial: A visualization that shows data in map form using different shapes and colors to show the relationship between pieces of data and specific locations.

Infographic: A combination of visuals and words that represent data. Usually uses charts or diagrams.

Dashboards: A collection of visualizations and data displayed in one place to help with analyzing and presenting data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Specific types of data visualization

A

Area Map: A form of geospatial visualization, area maps are used to show specific values set over a map of a country, state, county, or any other geographic location. Two common types of area maps are choropleths and isopleths.

Bar Chart: Bar charts represent numerical values compared to each other. The length of the bar represents the value of each variable.

Box-and-whisker Plots: These show a selection of ranges (the box) across a set measure (the bar).

Bullet Graph: A bar marked against a background to show progress or performance against a goal, denoted by a line on the graph.

Gantt Chart: Typically used in project management, Gantt charts are a bar chart depiction of timelines and tasks.

Heat Map: A type of geospatial visualization in map form which displays specific data values as different colors (this doesn’t need to be temperatures, but that is a common use).

Highlight Table: A form of table that uses color to categorize similar data, allowing the viewer to read it more easily and intuitively.

Histogram: A type of bar chart that split a continuous measure into different bins to help analyze the distribution.

Pie Chart: A circular chart with triangular segments that shows data as a percentage of a whole.

Treemap: A type of chart that shows different, related values in the form of rectangles nested together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of Information Visualization
Information visualization tools can help users compare different values, show the bigger picture, track trends in the data, and understand different relationships between variables. The following visualization formats are most commonly used for these purposes:

A

Column chart
Bar graph
Network graph
Stacked bar graph
Histogram
Line chart
Pie chart
Scatter plot or 3D scatter plot
Box plot
Bubble chart
Dual-axis chart
Stream graph
Sankey diagram
Chord diagram
Choropleth map
Hex map
Voronoi polygon diagram
Ridgeline plot
Interactive decision tree
Heatmap
Tree map
Circle packing
Violin plot
Real-time tracker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Almost everyone within modern organizations is demanding access to data, making the representation of that data in an easy-to-understand format even more important. Business users need a way to interpret data and interact with it in an intuitive way. Information visualization tools help these decision-makers navigate the data with less difficulty and therefore deliver value to the entire organization.

A

Information visualization is a key skill today as more companies look to digitally transform and make data a key asset across the organization. With ever-growing volumes of data, being able to present data in a meaningful way for others to understand has become crucial for a business to remain competitive. Information visualization turns data into actionable insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What Makes an Information Visualization Successful?
Information visualization is an art and therefore relies on the following aspects of design:

A

The subject matter: information or data being represented
The story: the concept being portrayed in the visualization
The goal: meeting the purpose with the right visualization
The visual: using key elements of structure and design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

select() — Selecting Columns in your Data Set
Selecting only the columns continent, year, and pop.

A

gapminder %>%
select(continent, year, pop) %>%
head(rows)

Selecting all columns but the year column.
gapminder %>%
select(-year) %>%
head(rows)

Selecting all columns that start with co using starts_with(). Please have a look at the documentation for additional useful functions, including ends_with() or contains().
gapminder %>%
select(starts_with(“co”)) %>%
head(rows)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

rename() — Renaming Columns
Rename the columms year into Year and lifeExp into Life Expectancy.

A

gapminder %>%
select(country, year, lifeExp) %>%
rename(
Year = year,
“Life Expectancy” = lifeExp
) %>%
head(rows)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

arrange() — Sorting your Data Set
Sort by year.

A

gapminder %>%
select(continent, year, lifeExp) %>%
arrange(year) %>%
head(rows)

Sort by lifeExp and the by year (descending).
gapminder %>%
select(continent, year, lifeExp) %>%
arrange(lifeExp, desc(year)) %>%
head(rows)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

filter() — Filtering Rows in your Data Set
Filter rows with the year 1972.

A

gapminder %>%
select(country, year, lifeExp) %>%
filter(year == 1972) %>%
head(rows)

Filter rows with the year 1972 and with a life expectancy below average.
gapminder %>%
select(country, year, lifeExp) %>%
filter(
year == 1972,
lifeExp < mean(lifeExp)
) %>%
head(rows)

Filter rows with the year 1972 and with a life expectancy below average, and with the country either to be Bolivia OR Angola.
gapminder %>%
select(country, year, lifeExp) %>%
filter(
year == 1972,
lifeExp < mean(lifeExp),
country == “Bolivia” | country == “Angola”
) %>%
head(rows)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

mutate () — Generate new Rows in your Data Set
Create a column that combines continent and coountry information, and another column that shows the rounded lifeExp information.

A

gapminder %>%
arrange(year, pop) %>%
mutate(
con_country = paste(continent, “-“, country),
rn_lifeExp = round(lifeExp)
) %>%
select(continent, country, con_country, lifeExp, rn_lifeExp) %>%
head(rows)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

summarize() — Create Summary Calculations in your Data Set
For the whole data set calculate mean and standard deviation for population and life expectations.

A

gapminder %>%
summarize(
pop_mean = mean(pop),
pop_sd = sd(pop),
le_mean = mean(lifeExp),
le_sd = sd(lifeExp)
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

group_by() — Group your Data Set and Create Summary Calculations
The summary function is only so useful without the group_by() function. Using both together is a powerful way to create new data sets. In the example below, I will group the data set by continent and then I will create summaries for populationand lifeExp.

A

gapminder %>%
group_by(continent) %>%
summarize(
pop_mean = mean(pop),
pop_sd = sd(pop),
le_mean = mean(lifeExp),
le_sd = sd(lifeExp)
)

It is also possible to group by more than one column. In the next example I use group_by() with continent and year.
gapminder %>%
filter(year > 1989) %>%
group_by(continent, year, .add = TRUE) %>%
summarize(
pop_mean = mean(pop),
pop_sd = sd(pop),
le_mean = mean(lifeExp),
le_sd = sd(lifeExp)
) %>%
head(rows)

17
Q

View()

A

Before starting with any kind of data analysis, it is crucial to understand the data we are dealing with. Plotting is a very important tool to get a quick overview of the statistical properties of data and to detect possible outliers. However, visualization might not always be possible, due to the size or complexity of the data set.

As an alternative solution, it might be convenient to interactively dig through the data set. This could be done by a spreadsheet-like interface, similar to Microsoft Excel, which enables to filter, sort and inspect tabular data structures.

R provides the function View(), which shows an interactive data viewer. Depending on the used platform and editor, this viewer might look differently.

18
Q

str()

A

Sometimes we need to analyze very large and complex data structures. Displaying these data sources may already be overwhelming and simply not possible with interactive tools. In these cases, the str() function comes to the rescue and prints the structure, as well as the first few values of any R object. Even very large and complex data structures can easily be displayed in the console that way.

19
Q

Dataframes are generic data objects of R which are used to store the tabular data. Dataframes are the foremost popular data objects in R programming because we are comfortable in seeing the data within the tabular form. They are two-dimensional, heterogeneous data structures. These are lists of vectors of equal lengths.

Data frames have the following constraints placed upon them:

A data-frame must have column names and every row should have a unique name.
Each column must have the identical number of items.
Each item in a single column must be of the same data type.
Different columns may have different data types.
To create a data frame we use the data.frame() function.

Example:

A

A vector which is a character vector

R program to illustrate dataframe

Name = c(“Amiya”, “Raj”, “Asish”)

Language = c(“R”, “Python”, “Java”)

Age = c(22, 25, 45)

# and then pass each of the vectors
# we have created as arguments
# to the function data.frame()
df = data.frame(Name, Language, Age)

print(df)

20
Q

confidence intervals

A

error bar that can be placed on graphs to show sampling error. Confidence intervals visually show the reader the most plausible range of the unknown population average. They are usually 90% or 95% by convention. What’s nice about confidence intervals is that they act as a shorthand statistical test, even for people who don’t understand p-values. They tell you if two values are statistically different along with the upper and lower bounds of a value.

That is, if there’s no overlap in confidence intervals, the differences are statistically significant at the level of confidence (in most cases). For example, Figure 1 shows the findability rates on two websites for different products along with 90% confidence intervals depicted as the black “whisker” error bars.

21
Q

Standard Error Error Bars

A

Another type of error bar uses the standard error. These standard error error bars tend to be the default visualization for academia. Don’t be confused by the name—standard error error bars aren’t necessarily the “standard.” The name is due to the fact that they display the standard error (which is an estimate of the standard deviation of the population mean).
The standard error is often used in multiple statistical calculations (e.g. for computing confidence intervals and statistical significance) so an advantage of showing just the standard error is that other researchers can more easily create derived computations.

The main disadvantage I see is that people still interpret it as a confidence interval, but the non-overlap no longer corresponds to the typical thresholds of statistical significance. Showing one standard error is actually equivalent to showing a 68% confidence interval. The 90% confidence intervals for the same data are shown in Figure 3. You can see the overlap in R1 and R2 (meaning they are NOT statistically different); whereas the non-statistical difference is less easy to spot with standard error error bars (Figure 2).

22
Q

Shaded Graphs

A

Error bars of any kinds can add a lot of “ink” to a graph, which can freak out some readers. A visualization that avoids error bars is to differ the shading on the bars of a graph that are statistically significant. The dark red bars in Figure 4 show which comparisons are statistically significant. This shading can be done in color or in black-and-white to be printer friendly.

23
Q

Asterisks

A

An asterisk (*) or other symbol can indicate statistical significance for a modest number of comparisons (shown in Figure 5). We’ve also seen (and occasionally use) multiple symbols to indicate statistical significance at two thresholds (often p)

24
Q

Notes

A

It’s often the case that so many comparisons are statistically significant that any visual indication would be overwhelming (or undesired). In those cases, a note depicting significance is ideal. These notes can be in the footer of a table, the caption of an image (as shown in Figure 6), or in the notes section of slides.

25
Q

Connecting Lines and Hybrids

A

When differences aren’t contiguous an alternative approach is to include connecting lines as shown in Figure 7 below. It shows 8 conditions in a UX research study using three measures (satisfaction, confidence, likelihood to purchase). Three differences were statistically different as indicated by the connecting lines. The graph also includes 95% confidence intervals and notes in the caption.

26
Q

Analysis of variance (ANOVA) tests the hypothesis that the means of two or more populations are equal. ANOVAs assess the importance of one or more factors by comparing the response variable means at the different factor levels. The null hypothesis states that all population means (factor level means) are equal while the alternative hypothesis states that at least one is different.

To perform an ANOVA, you must have a continuous response variable and at least one categorical factor with two or more levels. ANOVAs require data from approximately normally distributed populations with equal variances between factor levels. However, ANOVA procedures work quite well even if the normality assumption has been violated, unless one or more of the distributions are highly skewed or if the variances are quite different. Transformations of the original dataset may correct these violations.

A

For example, you design an experiment to assess the durability of four experimental carpet products. You put a sample of each carpet type in ten homes and you measure durability after 60 days. Because you are examining one factor (carpet type) you use a one-way ANOVA.

If the p-value is less than your alpha, then you conclude that at least one durability mean is different. For more detailed information about the differences between specific means, use a multiple comparison method such as Tukey’s.

The name “analysis of variance” is based on the approach in which the procedure uses variances to determine whether the means are different. The procedure works by comparing the variance between group means versus the variance within groups as a way of determining whether the groups are all part of one larger population or separate populations with different characteristics.

27
Q

Within-Subjects ANOVA

A

A within-subjects ANOVA is appropriate when examining for differences in a continuous level variable over time. A within-subjects ANOVA is also called a repeated measures ANOVA. This type of test is frequently used when using a pretest and posttest design, but is not limited to only two time periods. The repeated measures ANOVA can be used when examining for differences over two or more time periods. For example, this analysis would be appropriate if the researcher seeks to explore for differences in job satisfaction levels, measured at three points in time (pretest, posttest, 2-month follow up).

28
Q

A one-way ANOVA is used when assessing for differences in one continuous variable between ONE grouping variable. For example, a one-way ANOVA would be appropriate if the goal of research is to assess for differences in job satisfaction levels between ethnicities. In this example, there is only one dependent variable (job satisfaction) and ONE independent variable (ethnicity).

A

A factorial ANOVA is a general term applied when examining multiple independent variables. For example, a factorial ANOVA would be appropriate if the goal of a study was to examine for differences in job satisfaction levels by ethnicity and education level. In this example, there is only one dependent variable (job satisfaction) and TWO independent variables (ethnicity and education level). A factorial ANOVA can be applied when there are two or more independent variables.

29
Q

Mixed-model ANOVA

A

A mixed model ANOVA, sometimes called a within-between ANOVA, is appropriate when examining for differences in a continuous level variable by group and time. This type of ANOVA is frequently applied when using a quasi-experimental or true experimental design. This analysis would be applicable if the purpose of the research is to examine for potential differences in a continuous level variable between a treatment and control group, and over time (pretest and posttest).

30
Q

ANCOVA

A

An analysis of covariance (ANCOVA) is appropriate when examining for differences in a continuous dependent variable between groups, while controlling for the effect of additional variables. The “C” in ANCOVA denotes that a covariate is being inputted into the model, and this covariate examination can be applied to a between-subjects design, a within-subjects design, or a mixed-model design. ANCOVAs are frequently used in experimental studies when the researcher wants to account for the effects of an antecedent (control) variable.

31
Q

MANOVA

A

a multivariate analysis of variance (MANOVA) is an extension on the ANOVA, and is appropriate when examining for differences in multiple continuous level variables between groups. For example, a MANOVA would be applicable if assessing for differences between ethnicities in job satisfaction AND intrinsic motivation levels of participants. In this example, job satisfaction and intrinsic motivation are the continuous level dependent variables. The MANOVA can be conducted with multiple independent variables, and can also include covariates (i.e., MANCOVA).