2.1 Examining Numerical Data Flashcards

1
Q

What does a scatterplot provide?

A

A scatterplot provides a case-by-case view of data for two numerical variables. Each point
represents a single case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When are scatterplots helpful?

A

Scatterplots are helpful in quickly spotting associations relating variables, whether those
associations come in the form of simple trends or whether those relationships are more complex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a dot plot provide?

A

A dot plot provides a view of a single variable.

It’s the most basic of displays. A dot plot is a one-variable scatterplot;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the sample mean?

A

The mean, often called the average, is a common way to measure the center of a distribution
of data. To compute the mean interest rate, we add up all the interest rates and divide by the number
of observations. The sample mean is often labeled ¯x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a population mean?

A

The average of the entire population. Computed the same way as the sample mean.
However, the population mean has a special label: µ. The symbol µ is the Greek letter mu and
represents the average of all observations in the population. Sometimes a subscript, such as _x, is used
to represent which variable the population mean refers to, e.g. µ_x. Often times it is too expensive
to measure the population mean precisely, so we often estimate µ using the sample mean, ¯

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a histogram provide?

A

Useful for larger data sets. Rather than showing the value of each observation, we think of the value as belonging to a bin. Observations that fall on the boundary of a bin (e.g. 10.00%) are allocated to the lower bin. These binned counts are plotted as bars into what is called a histogram.
Histograms provide a view of the data density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does right and left skewed mean? And what about symmetric?

A

When data trail off to the right and has a longer right tail, the shape is said to be right skewed. 5Other ways to describe data that are right skewed: skewed to the right, skewed to the high end, or skewed
to the positive end.
Data sets with the reverse characteristic – a long, thinner tail to the left – are said to be left skewed. We also say that such a distribution has a long left tail. Data sets that show roughly equal trailing off in both directions are called symmetric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does it mean, when a distribution has a “long tail”?

A

When data trail off in one direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a mode?

A

A mode is represented by a prominent peak in the distribution.
s histograms that have one, two, or three prominent peaks. Such distributions
are called unimodal, bimodal, and multimodal, respectively. Any distribution with more than
2 prominent peaks is called multimodal. Notice that there was one prominent peak in the unimodal
distribution with a second less prominent peak that was not counted since it only differs from its
neighboring bins by a few observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two measures of varability?

A

The variance and the standard deviation.
The standard deviation roughly describes how far away the typical observation is from the mean.
We call the distance of an observation from its mean its deviation.
If we square deviations and then take an average, the result is equal to the sample variance, denoted by s^2.
We divide by n − 1, rather than dividing by n, when computing a sample’s variance; there’s some
mathematical nuance here, but the end result is that doing this makes this statistic slightly more
reliable and useful.
Notice that squaring the deviations does two things. First, it makes large values relatively
much larger, seen by comparing (−0.67)^2
, (−1.65)^2
, (14.73)^2
, and (−5.49)^2
. Second, it gets rid of
any negative signs.
The standard deviation is defined as the square root of the variance.

The variance is the average squared distance from the mean. The standard deviation is the
square root of the variance. The standard deviation is useful when considering how far the data
are distributed from the mean.
The standard deviation represents the typical deviation of observations from the mean. Usually
about 70% of the data will be within one standard deviation of the mean and about 95% will
be within two standard deviations. However, as seen in Figures 2.8 and 2.9, these percentages
are not strict rules.
the population values for variance and standard deviation have special symbols:
sigma in the second for the variance and sigma for the standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can you see in a box plot?

A

A box plot summarizes a data set using ve statistics while also plotting unusual observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the median and how to find it?

A

Splits the data in half. 50% of the data falling below the median and other 50% falling above the median. If there is an even number in the dataset, the median is the average of the two observations closest to
the 50th percentile.
When there are an odd number of observations, there will be exactly one observation that splits the data into two halves, and in such a case that observation is the median (no average needed).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the INTERQUARTILE RANGE (IQR)

A

The IQR is the length of the box in a box plot. It is computed as
IQR = Q3 - Q1
where Q1 and Q3 are the 25th and 75th percentiles.
It, like the standard deviation, is a measure of variability in data. The more
variable the data, the larger the standard deviation and IQR tend to be. The two boundaries of the
box are called the rst quartile (the 25th percentile, i.e. 25% of the data fall below this value) and
the third quartile (the 75th percentile)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are whiskers?

A

Extending out from the box, the whiskers attempt to capture the data outside of the box.
However, their reach is never allowed to be more than 1,5 * IQR. They capture everything within
this reach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are outliers?

A

Any observation lying beyond the whiskers is labeled with a dot. The purpose of labeling these
points { instead of extending the whiskers to the minimum and maximum observed values { is to help
identify any observations that appear to be unusually distant from the rest of the data. Unusually
distant observations are called outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are robust statistics and give example?

A

The median and IQR are called robust statistics because extreme observations have little effect on their values: moving the most extreme value generally has little in uence on these statistics.

17
Q

What is a transformation?

A

There are some standard transformations that may be useful for strongly right skewed data
where much of the data is positive but clustered near zero. A transformation is a rescaling of
the data using a function. For instance, a plot of the logarithm (base 10) of county populations
results in the new histogram in Figure 2.13(b). This data is symmetric, and any potential outliers
appear much less extreme than in the original data set. By reigning in the outliers and extreme
skew, transformations like this often make it easier to build statistical models against the data.
Transformations other than the logarithm can be useful, too. For instance, the square root
and inverse are commonly used by data scientists. Common goals in transforming data are to see the data structure differently, reduce skew, assist in
modeling, or straighten a nonlinear relationship in a scatterplot.

18
Q

What is an intensity map?

A

When we encounter geographic data, we should create an intensity map, where colors are used to show higher and lower values of a variable.
The color key indicates which colors correspond
to which values. The intensity maps are not generally very helpful for getting precise values in
any given county, but they are very helpful for seeing geographic trends and generating interesting
research questions or hypotheses.

19
Q

What is a Contingency table?

A

A table that summarizes data for two categorical variables. Each value in the table
represents the number of times a particular combination of variable outcomes occurred. Row and column totals are also included. We can also create a table that shows only the overall
percentages or proportions for each combination of categories, or we can create a table for a single
variable.

20
Q

What is a stacked bar plot?

A

A stacked bar plot is a graphical display of contingency table information.
The stacked bar plot is most useful when it’s reasonable to assign one variable as the explanatory
variable and the other variable as the response, since we are effectively grouping by one variable rst
and then breaking it down by the others.

21
Q

What is a side-by-side bar plot?

A

A graphical display of contingency table information.
Compared to stacked bar plot, side-by-side bar plots are more agnostic in their display about which variable, if any, represents the explanatory and which the response variable. It is also easy to discern the number of cases in different group combinations. However, one downside is that it tends to require more horizontal
space; the narrowness of Figure 2.23(b) makes the plot feel a bit cramped. Additionally, when two
groups are of very different sizes, as we see in the own group relative to either of the other two
groups, it is difficult to discern if there is an association between the variables.

22
Q

What is the standardized stacked bar plot?

A

A graphical display of contingency table information.
The standardized stacked bar plot is helpful if the primary variable in the stacked bar plot is relatively
imbalanced, e.g. the own category has only a third of the observations in the mortgage category,
making the simple stacked bar plot less useful for checking for an association. The major downside
of the standardized version is that we lose all sense of how many cases each of the bars represents.

23
Q

What is a mosaic plot?

A

A mosaic plot is a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.
It is built up in a square, that is broken up into columns for each category of a variable. Each column
represents a level of the variable, and the column widths correspond to the number of cases in each of those categories.
In general, mosaic plots use box areas to represent the number of cases in each category.

24
Q

When are pie charts useful?

A

Pie charts can be useful for giving a high-level overview to show how a set of cases break down.
However, it is also difficult to decipher details in a pie chart.

25
Q

What are hollow histograms used for?

A

Compares numerical data across groups. These are just the outlines of histograms of each group put on the same plot,compare numerical data across
groups. These are just the outlines of histograms of each group put on the same plot.
The hollow histograms are more useful for seeing distribution shape, skew, and potential anomalies.

26
Q

What is the side-by-side box plot used for?

A

The side-by-side box plot is a traditional tool for comparing numerical data across groups.
There are two box plots, one for each group, placed
into one plotting window and drawn on the same scale.
The side-by-side box plots are especially useful for comparing centers and spreads.