Qual and Quant Research Methods - Statistics Flashcards

1
Q

Why are statistics necessary in political science research?

A

Stats are used to conduct quantitative analysis and understand raw data collected in research.

Descriptive stats are often used to describe the data collected even if the project uses mixed qual and quant methods.

Recent trends have increased the prevalence of statistical analysis in political science research in tandem with an increase in data availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is data usually in data matrix form in our research?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe a data matrix… what are the rows, columns and cells?

A

Rows = observations (individuals, countries, elections…)

Columns = variables (income, age, level of education, campaign spending…)

Cells = represent the value of a variable for a specific observation (e.g. a specific individual’s income, a country’s GDP per capita…)

The specific format of the file depends on the program, it was created with (e.g. excel spreadsheet, state file…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is nominal data?

A

Data where response categories cannot be placed in a specific order (you cannot judge the distance between categories) - they are just things

E.g. country of birth, ethnicity…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ordinal data?

A

Data where response categories can be placed in rank order (but distance between categories cannot be measured mathematically - if lots of categories we sometimes treat them as continuous for analysis purposes)

E.g. linkert scale, rank preference, levels of education…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is quantitative (interval and ratio) data?

A

Responses are measured on a continuous scale with rank order - assuming uniform distance/interval between responses. Treated as continuous.

E.g. age in years, temperatures in degrees, 1-10 ranking, income in GBP…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three measures of central tendency?

A

Mean, median and mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the four measures of spread and position?

A

Range, standard deviation, percentiles and interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the mean? How do you calculate it?

A

The simple average - you take the sum of all values (∑) in a sample, and then divide them by the number of observations (n)

Mean is denoted by 𝑥̅

The mean is only appropriate for quantitative - interval (or ratio)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the median?

A

The observation in the middle when we rank/order all observations from lowest to highest (e.g. ages lowest to highest)

BUT if we have an even number of observations, take the mid-point between the two middle values

Appropriate for both interval and ordinal variables, but not nominal variables!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ordinal example: Imagine the same sample of 10 respondents and what social class they identify as (between working, middle, and upper) - can you take the mean and/or median?

A

No - because there are no numerical values to be summed…

However, you can take the median if we arrange the values in order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the mode?

A

The mode is the value that occurs most frequently

If there are values that occur equally and more than the other values, this is called bimodal distribution

Appropriate for interval, ordinal AND nominal variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Nominal example: Imagine the same sample of 10 respondents and what region they live in - can you take the mean, median and/or mode?

A

Mean = no (there are no numeric values to be summed)

Median = no (there is no meaningful order to put the categories into)

Mode = YES (it is the response that occurs the most)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you compute the new mean if the origin of measurement is shifted?

A

Say if next year all respondents are 1 year older and we want the new mean age the mean age will be 1 year greater

Also applies to the median and mode when used for interval (numeric) variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you compute the new mean when there is a change in scale?

A

Say we want to measure age in months rather than years, we can just multiply everyone’s age by 12 to get each respondent’s age in months

This also applies to the median and mode when used for interval (numeric) data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you get the mean of two related variables?

A

To get mean of the sum, add the two means together

E.g. imagine variable age is actually composed of two variables: years spent in school and years not spent in school - once you get the means of the two variables separately you add them to get the mean of the sum of two variables

Does not work for mode and median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why/when may the mean be not as informative as the median?

A

Where there are strong outliers that may affect the sample…

The mean is often heavily influenced by outliers (observations that have extreme values), and where there are strong outliers, the median might be a better measure of central tendency, or of a ‘typical observation’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why/when may the median be uninformative?

A

If there are relatively few values and/or a lot of zeroes!

Here the mean is often far more informative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you calculate the range?

A

Largest value MINUS smallest value of a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Can two samples have the same mean but different ranges?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why/when may the range be uninformative?

A

The range is extremely sensitive to outliers - the range may not represent the spread of the majority of the data

E.g. could have a data set of 1, 2, 2, 3, 5, 6, 6, 29 - range would be uninformative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What do percentiles divide the data set into?

A

Distributions of 100ths

First percentile of the data is the first 1%, second percentile is 2%, and so on… median percentile is the 50th percentile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are quartiles in data sets?

A

Divides the data into quarters

The inter-quartile range is oftentimes more information than the range and often presented as a box-plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the ‘variance’ in data sets?

A

Variance is a measure of dispersion - it is a measure of how far a set of numbers is spread out from their average/mean value

You take the difference between each value and the mean (e.g. difference between age 19 and mean of 28 is -9)… you then square each of the differences making them all positive (prevents the sum from being zero)…

Then we sum the difference and divide by n-1 for a sample of a population (and by n if we have the entire population)…

If the sum of distances from the mean squared is 656 you divide it by the sample number of 10-1, so 656 divided by 9 and you get a variance of 72.9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is standard deviation?
The square root of the variance - represents the typical distance of an observation from the mean As we calculated the variances using a square of differences, if we take the square root of the variance, then we get a measure of the rough typical distance of an observation from the mean So the standard deviation of our sample of 10 where the variance is 72.9, the standard deviation is 8.54, and so if our observations are ages we can deduce from the standard deviation that observations are typically around 8.54 years away from the mean (in either direction ofc)
26
For what type of data is frequency and frequency distribution analysis useful for?
Categorical data It looks at how many observations are in each category - often looks at relative frequencies like the proportions or percentages in different categories
27
How can relative frequency distribution be well presented?
With bar charts The height of the bar shows the frequency or relative frequency in that category - bars are also separate as to emphasise the it is a categorical variable E.g. each bar may represent the relative frequency of each social class respondents identify themselves as
28
When are proportions particularly helpful?
When we have a dichotomous/dummy/binary variable (nominal variable with only 2 categories - e.g. yes/no, true/false...) Let's say we take the values of this variable 0 (no) and 1 (yes) and add up all 10 given values and divide by 10 (no. of respondents). If we have four 1s and 6 0s we find that the proportion of respondents that answered yes is 0.4.
29
What are histograms?
A useful way to represent frequency distributions for quantitative variables Values of the variable on the x (horizontal) axis and how often each value occurs on the y (vertical axis) - useful for displaying frequency distribution of age, or maybe GDP per capita by country... (as the sample size increases, the sample distribution looks more like the population distribution and for continuous variables, if we imagine the sample size growing indefinitely, the shape of the histogram approaches a smooth curve)
30
What is statistical inference?
The process of using what we know about a sample to make probabilistic statements about the broader population! Using probability theory, we can estimate what is going on in the population based on the sample
31
What does statistical inference rely on?
Probability - the mathematical likelihood of a given event occurring (e.g. a proportion of a population voting for Donald Trump)
32
What is a population parameter?
A quantity of the total population
33
What is a sample statistic?
A quantity of the sample - this sample statistic provides an estimate of the population parameter
34
What can probability theory be used for quantifying?
Quantifying the uncertainty that using a sample statistic brings
35
What is probability in a random sample?
In a random sample or randomised experiment, the probability that an observation has a particular outcome is the proportion of times that outcome would occur in a very long sequence of similar observations. It is a number between 0 and 1 but in practice it is often expressed as a percentage.
35
What is a probability distribution?
A probability distribution lists possible outcomes of an event and their probabilities (i.e. probability of each outcome) Assigns a probability to each possible value of a random variable - sum of the probabilities of each possible value equals 1! (i.e. 5 possible outcomes - can be 0.1, 0.2, 0.4, 0.1, 0.2)
36
What is an example of a continuous variable?
Age? - theoretically can be an infinite number of ages
37
What does a graph of the probability distribution of a continuous variable look like?
A smooth continuous curve (e.g. relative frequency distribution of age variable)
38
Under the line of a continuous variable's probability distribution graph you may want to calculate an interval of variables (i.e. probability of ages 25-30), how do you calculate an interval?
The area under the curve for an interval of values represents the probability that the variable takes a value in that interval. The probability of the interval containing all the possible values equals 1 - so you add the probability of the individual variables to calculate the probability of the interval
39
What is the empirical rule?
The empirical rule states that... approx 68% of values are within 1 standard deviation, 95% within 2 standard deviations, and 99.7% (almost all) within 3 standard deviations from the mean If a distribution is approximately bell-shaped on an x axis of 0-100 with a mean (y̅) of 50 and standard deviation (s) of 15 minus and 15 plus (35 to 65), then: About 68% of the observations fall between 35 and 65 ("y̅ − s" and "y̅ + s") About 95% of the observations fall between 20 and 80 ("y̅ − 2s" and "y̅+2s") All or nearly all observations fall between 5 and 95 ("y̅ − 3s" and "y̅ + 3s") Called the Empirical Rule because many frequency distributions seen in practice are approximately bell shaped. The standard deviation (s) is key, and so is the mean (y̅)! (e.g. exam results are on a bell curve)
40
What is normal distribution in probability?
Normal distribution is the empirical rule applied to probability distribution It has the following characteristics: -Bell shaped curve -Distribution about the mean (mean, median, and mode are equal) -Defined by two parameters - the mean and standard deviation
41
Why is normal distribution (normal probability distribution) important?
It approximates well sampled data in the real world AND we can use it to make statistical inference
42
How can sampling distribution (and a wider sample set) help us to greater understand sample means?
The true mean may be 50 BUT WE DO NOT KNOW THIS... using random samples we have means of 47, 49, 51, 53... HOWEVER, we do not know how close these fall to the true population mean (50) - we can build a profile to better estimate the true value with the more samples we have! BUT using information about the spread of the sampling distribution we can predict how close it falls!
43
What is central limit theorem in understanding sampling distribution?
As the number of samples increases, the sampling distribution approximates the normal distribution! This occurs even if the variable you are interested in is not nominally distributed - the mean of the sampling distribution is the population mean! More samples = more accurate reflection of true population mean (key assumption is that samples are randomly drawn from the population)
44
What is a point estimate?
Point estimate - a single number that is the best guess for the parameter value
45
What is a confidence interval?
Confidence interval - an interval of numbers around the point estimate that we believe (with some confidence) contains the parameter value Our confidence interval has a confidence level of any number (usually 90 something %) and this is the percentage of samples in which during repeat sampling, the parameter value falls within the interval
46
How do we calculate a confidence interval?
Confidence interval... Lower point = point estimate - margin of error Upper point = point estimate + margin of error Point estimate ± Margin of error = ȳ ± t*(se)= ȳ ± t*𝒔/√𝒏
47
What is the algebraic representation of the point estimate? (i.e. which letter)
ȳ
48
How do you calculate the margin of error?
t*𝒔/√𝒏 Confidence level times by the standard error (standard deviation / square root of number of cases) t = assigned confidence level s = standard deviation √n = square root of number of cases
49
What is a significance test?
A statistical significance test uses data to summarise the evidence about a hypothesis – by comparing point estimates of the parameters with the values predicted by the hypothesis! Involves... 1 - assumptions 2 - hypotheses 3 - test statistic 4 - p-value 5 - conclusion of statistical significance
50
What kind of assumptions do we make in statistical significance tests?
-Type of data (quant or qual) -Randomisation (assumed random sampling maybe) -Population distribution (some tests assume a certain distribution) -Sample size (approx normal sampling distribution - but if sample large enough there is no need for normal population distribution)
51
What is the null hypothesis and the alternative hypothesis?
Null hypothesis 𝐻𝑜: a statement that the parameter takes a particular value (e.g. the difference between men and women's salaries is 0) Alternative hypothesis 𝐻𝑎: the parameter falls in some alternative range of values (e.g. the difference between men and women's salaries is higher/lower than 0 - it is not 0)
52
What does a significance test do in regards to comparing and analysing the null vs alternative hypothesis?
A sig test analyses the sample evidence about the null hypothesis, and investigates if the data contradicts the null hypothesis which would suggest the alternative hypothesis to be true It provides proof by contradiction of the null hypothesis - null hypothesis presumed to be true. Under this presumption, if the data observed is very unusual, we reject the null.
53
What does the test statistic do in analysing the estimate vs the parameter value?
The test statistic summarises the entire data set. The test statistic compares (1) how much variation in Y can be explained by X - the regression slope estimate - and (2) how accurately we can measure the population slope - the standard error It displays how consistant with, or how far the estimate falls from the parameter value in 𝐻𝑜 (null hypothesis) - it is the number of standard errors between the estimate and the null hypothesis and the 𝐻𝑜 value e.g. t = 1.22 or t = 7.15
54
What is the p-value in relation to the test statistic?
Probability we would have obtained the sample if the null hypothesis were actually true The p-value is the probability that the test statistic equals the observed value, or a value even more extreme in the direction predicted by 𝐻𝑎 (alternative hypothesis) The smaller the p-value, the stronger the evidence against the null hypothesis - smaller p value means more confidence - i.e. 0.05 p value = 95% confidence and 0.01 = 99% A moderate to large p-value means the data is consistent with 𝐻𝑜
55
What is the 𝛼 (a) level in regards to the p-value?
The a level is the boundary value of 0.05 - if the p-value is below/equal to this a-level of 0.05 (i.e. 95% confidence) we can reject the null hypothesis! (i.e. the test is of statistical significance) If the p-value is above this 0.05 a-level then there is not enough proof by contradiction to reject the null hypothesis The smaller the 𝛼−level the stronger the evidence must be to reject 𝐻0. To avoid bias in the decision-making process you select 𝛼 before analysing the data!
56
Do we ever say that we accept the null hypothesis?
No - we say that we fail to reject it when the p-value is above the a-level! We cannot be sure that the null hypothesis is correct - our data just does not adequately reject it
57
What are the three key components of statistical association we need to think about?
Nature/Direction - how do the variables actually affect each other (e.g. does a higher income increase how right wing you are) Strength of relationship - how strongly does variable A affect variable B Statistical significance - how likely is it that the association you observe in a sample generalises to the population
58
What is the simplest way of describing (or showing) a linear relationship?
As a linear relationship - with a straight line
59
What actually is a linear regression?
A linear regression gives us the best linear association between two variables
60
What is the formula of a linear function?
𝒚=𝜶+ 𝜷𝒙 y = the dependent variable x = the independent variable a = intercept/constant (value of y when x = 0) B = slope of gradient (how much y changes when x increases by 1 - higher = stronger affect of x)
61
How do we rewrite the linear function to show the EXPECTED value of the dependent variable (as real world data is messy, not deterministic)?
E(𝒚) = 𝜶 + 𝜷𝒙 E(𝒚) is the mean of y for a given x - we account for variation around the regression line Linear regression does not make exact predictions; it predicts an average value of Y for a given X
62
What is a least squares estimation (or aka the sum of squared errors)?
The line that (under some assumption) best fits the data hand It is th mathematical procedure for finding the best-fitting curve to a given set of points by minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve = ∑(𝑦 − 𝑦 ̂)^𝟐 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 (distance between an observed value and the predicted value) = 𝑦 − 𝑦 ̂
63
What is the sign of the slope coefficient (𝜷)? What does it tell us if 𝜷 is > 0 OR < 0 OR =0
Tells us about the nature (direction) of linear associations 𝛽 > 0 = positive relationship (as X increases, so does Y) 𝛽 < 0 = negative relationship (as X increases, Y decreases) 𝛽 = 0 = independence (as X increases, Y stays the same) Size of the 𝜷 gets bigger with a stronger linear association Something to bear in mind is that the size of 𝜷 depends on measurement units
64
What does Pearson's r summarise/represent?
Pearson's r summarises the association between two quantitative variables in a single number. It is calculated using a standardised regression slope on a range of -1 to +1 i.e. it is the correlation between x and y, denoted by r
65
Pearson's r is on a scale of -1 to +1, if r < 0, r > 0, and r = 0, what does this mean regarding the association between the variables?
r = 1 = perfect positive correlation r = > 0 = positive association (over 0.40 is strong, 0.20 is moderate, below 0.20 is weak) r = 0 = no association r = < 0 = negative association r = -1 = perfect negative association
66
How do you represent the null and alternative hypotheses with 𝜷 (sign of the slope coefficient)?
Null hypothesis: 𝜷 = 0 Alternative hypothesis: 𝜷 ≠ 0
67
What does a coefficient plot ('dot and whisker') show when regression results are plotted?
A point is plotted - this can be seen as the point estimate, BUT with a confidence interval (i.e. upper and lower bound of margin of error are plotted around the main point) - it displays uncertainty to help put data in greater perspective
68
What does statistical control do?
It measures and accounts for potential confounders (z) in observational research - i.e. does the association between X and Y remain after controlling for Z?
69
How can statistical control be implemented?
Using a straightforward extension of a simple linear regression known as a multiple linear regression function
70
What does a multiple linear regression function help us do?
It helps us to engage in statistical control and rule out confounders (z) Interpretation remains the same, there are just new predictor variables (k)