ST102.1 - Data visualisation and descriptive statistics Flashcards

1
Q

What are the two broad aims of statistical analysis?

A
  1. Descriptive statistics: summarise the data which were collected, in order to make them more understandable.
  2. Statistical inference: use the observed data to draw conclusions about some broader population.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of descriptive statistics?

A

Descriptive statistics attempt to summarise some key features of the data to make them understandable and easy to communicate. These summaries may be graphical or numerical (tables or individual summary statistics).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are categorical variables?

A

Categorical variables (aka, qualitative) take on values that are names or labels. The colour of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables.

  1. Nominal categorical variables are a type of data used to name variables without providing any numerical value. Coined from the Latin nomenclature “Nomen” (meaning name), this data type is a subcategory of categorical data.
    - Unordered categories are nominal data.
    - EXAMPLE. This is a nominal variable coded (in alphabetical order) as follows:
    - 1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 = Northern America, 6 = Oceania.
  2. Ordinal categorical variables are a data type with a set order or scale to it. However, this order does not have a standard scale on which the difference in variables in each scale is measured.
    - Ordered categories are ordinal data.
    - EXAMPLE. This is an 11-point ordinal scale from 0 (lowest level of democracy) to
    10 (highest level of democracy).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are quantitative variables?

A

Quantitative data is information about quantities, and therefore numbers and qualitative data is descriptive and regards phenomenon which can be observed but not measured, such as language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Where is statistical data typically stored?

A

The statistical data in a sample are typically stored in a data matrix.

  • Variables are organised column-wise (x).
  • Individual observations (units) are organised row-wise (y).
  • The number of units in a dataset is the sample size, typically denoted by n. EXAMPLE. Here, n = 155 countries.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why would you need to distinguish between uppercase and lowercase N in a data set?

A

Capital N denotes population size whereas lowercase n denotes sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 2 different characteristics of variables?

A
  1. Continuous variables can, in principle, take any real values within some interval.
    - EXAMPLE. GDP per capita is continuous, taking any
    non-negative value.
  2. Discrete variables if it is not continuous, i.e. if it can only take certain values, but not any others.
    - EXAMPLE. region and the level of democracy are discrete, with possible values of 1, 2, . . . , 6, and 0, 1, 2, . . . , 10, respectively.
    - However, a discrete variable can also have an unlimited number of possible values.
    - EXAMPLE. The number of visitors to a website in a day: 0, 1, 2, . . . .
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the simplest possibility of a discrete variable?

A

Many discrete variables have only a finite number of possible values. The simplest possibility is a binary, or dichotomous, variable, with just two possible values.
- EXAMPLE. A person’s sex could be recorded as 1 = female and 2 = male. Or having 1. Yes and 2. No (i.e having two options).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the sample distribution of a variable consist of?

A

The sample distribution of a variable consists of:

  • A list of the values of the variable which are observed in the sample.
  • The number of times each value occurs (the counts or frequencies of the observed values).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can we do when the number of different observed values is small in a sample distribution?

A

When the number of different observed values is small, we can show the whole sample distribution as a frequency table of all the values and their frequencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is relative frequency?

A

Relative frequency or experimental probability is calculated from the number of times an event happens, divided by the total number of trials in an actual experiment.

  • This is a measure of proportion.
  • EXAMPLE. Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the sample.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a bar chart?

A

A bar chart is the graphical equivalent of a table of frequencies. The relative frequencies of each region are clearly visible. You can display grouped data here which can be derived from raw data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a histogram?

A

A histogram is like a bar chart, but without gaps between bars, and often uses more bars (intervals of values) than is sensible in a table.

Histograms are usually drawn using statistical software, such as R. You can let the software choose the intervals and the number of bars.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is important when you group your data into non-overlapping intervals?

A

When you need to group your data into non-overlapping intervals you need it to be Mutually Exclusive & Collectively Exhausted (MECE) means that it must belong to one and at most one of these intervals.

  • Mutually Exclusive means that individual frequencies belong to at most one interval/group.
  • Collectively Exhausted means that individual frequencies must belong to at least one of these intervals/groups.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the meaning of different brackets with non-overlapping intervals in frequency tables?

A

( or ) means that it is up to but not including the value (exclusive), so it is similar to < or >.
[ or ] mean that it is including the value (inclusive), so is similar to =< or =>.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you better display the sample distribution on a histogram?

A

A greater number of intervals on a histogram will better display the sample distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is skewness and symmetry used for in data presentation?

A

Skewness and symmetry are terms used to describe the general shape of a sample distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the interpretation if something is positively/negatively skewed of symmetric?

A
  • If the distribution of GDP per capita has a ‘long right tail’ it is called positively skewed (or skewed to the right).
  • A distribution with a longer left tail (i.e. toward small values) is negatively skewed (or skewed to the left).
  • A distribution is symmetric if it is not skewed in either direction (it may be symmetric but may not be normal - it could have distribution symmetrically skewed in both directions).
19
Q

What are summary statistics?

A

Summary statistics are descriptive statistics that summarise one feature of the sample distribution in a single number.
- Frequency tables, bar charts and histograms aim to summarise the whole sample distribution of a variable.

20
Q

What are measures of central tendency?

A

We consider the following measures of central tendency:

  • Mean (i.e. the average, sample mean or arithmetic mean)
  • Median
  • Mode.
21
Q

What is the difference between univariate and bivariate?

A
  1. Univariate means “one variable” (one type of data).

2. Bivariate means “two variables”, in other words, there are two types of data.

22
Q

What single letter is used to denote a variable and what denotes a single observation of a variable?

A
  1. In formulae, a generic variable is denoted by a single letter. In these course notes, usually X.
    - However, any other letter (Y , W etc.) can also be used, as long as it is used consistently.
  2. A letter with a subscript denotes a single observation of a variable.
    - We use Xi to denote the value of X for unit i, where i can take values 1, 2, . . . , n, and n is the sample size.
    - Therefore, the n observations of X in the dataset (the sample) are X1, X2, . . . , Xn. These can also be written as Xi, for i = 1, 2, . . . , n.
23
Q

How do you use summation notation to find the sum of numbers?

A

Let X1, X2, . . . , Xn (i.e. Xi, for i = 1, 2, . . . , n) be a set of n numbers. The sum of the numbers is written as:
(n)
Σ Xi = X1 + X2 + · · · + Xn.
(i=1)

This may be written as ΣiXi, or just ΣXi.

24
Q

How do you use summation notation to find the infinite sum of numbers?

A

(∞)
Σ Xi = X1 + X2 + · · ·
(i=1)

25
Q

How do you use summation notation to find sums of sets of observations other than 1 to n?

A

(n/2)
Σ Xi = X2 + X3 + · · · + Xn/2
(i=2)

26
Q

What is the sample mean and how do you calculate it?

A

The sample mean (‘arithmetic mean’, ‘mean’ or ‘average’) is the most common measure of central tendency.

  • The sample mean of a variable X is denoted X¯ (where the bar is above the X).
  • It is the ‘sum of the observations (Σ Xi)’ divided by the ‘number of observations (n)’ (sample size) expressed as:
      (n) X¯ =  Σ Xi / n
     (i=1)

EXAMPLE. The mean X¯ of the numbers 1, 4 and 7 is:
(1 + 4 + 7)/3 = 4

27
Q

What is a linear operator?

A

Function f is called a linear operator if it has the two properties:

  • f(x+y)=f(x)+f(y) for all x and y;
  • f(cx)=cf(x) for all x and all constants c.

EXAMPLE.
(n)
Σ (Xi − X¯) = 0.
(i=1)

is the same a:
(n) (n)
Σ (Xi) - Σ (X¯) = 0.
(i=1) (i=1)

AND
nX¯ - nX¯ = 0 (as above).

28
Q

What are the order statistics and what is it used to calculate?

A

Used to calculate the Sample Median

  • Let X(1), X(2), . . . , X(n) denote the sample values of X when ordered from the smallest to the largest, known as the order statistics, such that:
  • X(1) is the smallest observed value (the minimum) of X (this is not to be confused with X1 which is the first value of X as X(1) /= X1).
  • X(n) is the largest observed value (the maximum) of X..
29
Q

How can you calculate the sample median if n is odd or even?

A

The (sample) median, q50, of a variable X is the value that is ‘in the middle’ of the ordered sample.

  1. If n is odd, then q50 = X((n+1)/2).
    - EXAMPLE. If n = 3, q50 = X(2).
    - EXAMPLE. If n = 155, so q50 = X(78) i.e the 78th value of X.
  2. If n is even, then q50 = (X(n/2) + X(n/2+1))/2.
    - EXAMPLE. If n = 4, q50 = (X(2) + X(3))/2 so between n2 and n3.
30
Q

What is more sensitive to outliers? Median or mean?

A

In general, the mean is affected much more than the median by outliers, i.e. unusually small or large observations.

  • Therefore, you should identify outliers early on and investigate them – perhaps there has been a data entry error, which can simply be corrected.
  • If deemed genuine outliers, a decision has to be made about whether or not to remove them.
31
Q

How does the mean and median affect the skew of distribution?

A

Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the long tail of the sample distribution.
- When summarising variables with skewed distributions, it is useful to report both the mean and the median.

  1. For a positively skewed distribution, the mean is larger than the median.
  2. For a negatively skewed distribution, the mean is smaller than the median.
  3. For an exactly symmetric distribution, the mean and median are equal.
32
Q

What is the mode?

A

The (sample) mode of a variable is the value that has the highest frequency (i.e. appears most often) in the data.

33
Q

What are the measures of dispersion?

A

Measures of dispersion is a statistical term that describes the size of the distribution of values expected for a particular variable and can be measured by several different statistics, such as range, variance, and standard deviation.

34
Q

What is the sample variance?

A

The sample variance of a variable X, denoted S^2 (or S^2x), is defined as:

S^2 = 1/(n − 1) Σ (Xi − X¯)^2

35
Q

What is the sample standard deviation?

A

The sample standard deviation of X, denoted S (or SX ), is the positive square root of the sample variance.
- The standard deviation is more understandable than the variance, because the standard deviation is expressed in the same units as X (rather than the variance, which is expressed in squared units).
_________________
S = V 1/(n − 1) Σ (Xi − X¯)^2

36
Q

What is important about the distribution within 1 or 2 standard deviations from the mean?

A

A useful rule-of-thumb for interpretation is that for many symmetric distributions, such as the ‘normal’ distribution:

  • About 2/3 of the observations are between X¯ − S and X¯ + S, that is, within one (sample) standard deviation about the (sample) mean.
  • About 95% of the observations are between X¯ − 2 × S and X¯ + 2 × S, that is, within two (sample) standard deviations about the (sample) mean.
37
Q

What is important to remember about standard deviation and variance?

A

Standard deviations (and variances) are never negative, and they are zero only if all the Xi observations are the same (that is, there is no variation in the data).

38
Q

What are some sample quantiles?

A
  1. The first quartile, q25 or Q1, is the value which divides the sample into the smallest 25% of observations and the largest 75%.
  2. The median, q50, is basically the value which divides the sample into the smallest 50% of observations and the largest 50%.
  3. The third quartile, q75 or Q3, gives the 75%–25% split.
  4. If we consider other percentage splits, we get other (sample) quantiles, (percentiles), qc .
    - The extremes in this spirit are the minimum, X(1) (the ‘0% quantile’, so to speak), and the maximum, X(n) (the ‘100% quantile’).
    - These are no longer ‘in the middle’ of the sample, but they are more general measures of location of the sample distribution.
39
Q

What are the quantile-based measures of dispersion?

A
  1. Range: X(n) − X(1) = maximum − minimum
    - The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the extremes of the distribution, i.e. the minimum and maximum observations.
  2. Interquartile range (IQR): IQR = q75 − q25 = Q3 − Q1.
    - The IQR focuses on the middle 50% of the distribution, so it is completely insensitive to outliers.
40
Q

What is a boxplot?

A

A boxplot (in full, a box-and-whiskers plot) summarises some key features of a sample distribution using quantiles. The plot is comprised of the following:

  • The line inside the box, which is the median.
  • The box, whose edges are the first and third quartiles (Q1 and Q3).
  • Hence the box captures the middle 50% of the data. Therefore, the length of the box is the interquartile range.
  • The bottom whisker extends either to the minimum or up to a length of 1.5 times the interquartile range below the first quartile, whichever is closer to the first quartile.
  • The top whisker extends either to the maximum or up to a length of 1.5 times the interquartile range above the third quartile, whichever is closer to the third quartile.
  • Points beyond 1.5 times the interquartile range below the first quartile or above the third quartile are regarded as outliers, and plotted as individual points.
  • Boxplots are useful for comparisons of how the distribution of a continuous variable varies across different groups, i.e. across different levels of a discrete variable.
41
Q

What is a scatterplot?

A

A scatterplot shows the values of two continuous variables against each other, plotted as points in a two-dimensional coordinate system.

42
Q

What is a lineplot?

A

A common special case of a scatterplot is a line plot (time series plot), where the variable on the x-axis is time. The points are connected in time order by lines, to show how the variable on the y-axis changes over time.

43
Q

What is a contingency table?

A

A (two-way) contingency table (or cross-tabulation) shows the frequencies in the sample of each possible combination of the values of two discrete variables. Such tables often show the percentages within each row or column of the table.