Qual and Quant Research Methods - Statistics Flashcards
Why are statistics necessary in political science research?
Stats are used to conduct quantitative analysis and understand raw data collected in research.
Descriptive stats are often used to describe the data collected even if the project uses mixed qual and quant methods.
Recent trends have increased the prevalence of statistical analysis in political science research in tandem with an increase in data availability.
Is data usually in data matrix form in our research?
Yes
Describe a data matrix… what are the rows, columns and cells?
Rows = observations (individuals, countries, elections…)
Columns = variables (income, age, level of education, campaign spending…)
Cells = represent the value of a variable for a specific observation (e.g. a specific individual’s income, a country’s GDP per capita…)
The specific format of the file depends on the program, it was created with (e.g. excel spreadsheet, state file…)
What is nominal data?
Data where response categories cannot be placed in a specific order (you cannot judge the distance between categories) - they are just things
E.g. country of birth, ethnicity…
What is ordinal data?
Data where response categories can be placed in rank order (but distance between categories cannot be measured mathematically - if lots of categories we sometimes treat them as continuous for analysis purposes)
E.g. linkert scale, rank preference, levels of education…
What is quantitative (interval and ratio) data?
Responses are measured on a continuous scale with rank order - assuming uniform distance/interval between responses. Treated as continuous.
E.g. age in years, temperatures in degrees, 1-10 ranking, income in GBP…
What are the three measures of central tendency?
Mean, median and mode
What are the four measures of spread and position?
Range, standard deviation, percentiles and interquartile range
What is the mean? How do you calculate it?
The simple average - you take the sum of all values (∑) in a sample, and then divide them by the number of observations (n)
Mean is denoted by 𝑥̅
The mean is only appropriate for quantitative - interval (or ratio)
What is the median?
The observation in the middle when we rank/order all observations from lowest to highest (e.g. ages lowest to highest)
BUT if we have an even number of observations, take the mid-point between the two middle values
Appropriate for both interval and ordinal variables, but not nominal variables!
Ordinal example: Imagine the same sample of 10 respondents and what social class they identify as (between working, middle, and upper) - can you take the mean and/or median?
No - because there are no numerical values to be summed…
However, you can take the median if we arrange the values in order
What is the mode?
The mode is the value that occurs most frequently
If there are values that occur equally and more than the other values, this is called bimodal distribution
Appropriate for interval, ordinal AND nominal variables
Nominal example: Imagine the same sample of 10 respondents and what region they live in - can you take the mean, median and/or mode?
Mean = no (there are no numeric values to be summed)
Median = no (there is no meaningful order to put the categories into)
Mode = YES (it is the response that occurs the most)
How do you compute the new mean if the origin of measurement is shifted?
Say if next year all respondents are 1 year older and we want the new mean age the mean age will be 1 year greater
Also applies to the median and mode when used for interval (numeric) variables
How do you compute the new mean when there is a change in scale?
Say we want to measure age in months rather than years, we can just multiply everyone’s age by 12 to get each respondent’s age in months
This also applies to the median and mode when used for interval (numeric) data
How do you get the mean of two related variables?
To get mean of the sum, add the two means together
E.g. imagine variable age is actually composed of two variables: years spent in school and years not spent in school - once you get the means of the two variables separately you add them to get the mean of the sum of two variables
Does not work for mode and median
Why/when may the mean be not as informative as the median?
Where there are strong outliers that may affect the sample…
The mean is often heavily influenced by outliers (observations that have extreme values), and where there are strong outliers, the median might be a better measure of central tendency, or of a ‘typical observation’
Why/when may the median be uninformative?
If there are relatively few values and/or a lot of zeroes!
Here the mean is often far more informative.
How do you calculate the range?
Largest value MINUS smallest value of a data set
Can two samples have the same mean but different ranges?
Yes
Why/when may the range be uninformative?
The range is extremely sensitive to outliers - the range may not represent the spread of the majority of the data
E.g. could have a data set of 1, 2, 2, 3, 5, 6, 6, 29 - range would be uninformative
What do percentiles divide the data set into?
Distributions of 100ths
First percentile of the data is the first 1%, second percentile is 2%, and so on… median percentile is the 50th percentile
What are quartiles in data sets?
Divides the data into quarters
The inter-quartile range is oftentimes more information than the range and often presented as a box-plot
What is the ‘variance’ in data sets?
Variance is a measure of dispersion - it is a measure of how far a set of numbers is spread out from their average/mean value
You take the difference between each value and the mean (e.g. difference between age 19 and mean of 28 is -9)… you then square each of the differences making them all positive (prevents the sum from being zero)…
Then we sum the difference and divide by n-1 for a sample of a population (and by n if we have the entire population)…
If the sum of distances from the mean squared is 656 you divide it by the sample number of 10-1, so 656 divided by 9 and you get a variance of 72.9