Statistical Concepts Flashcards
Orderable versus nonorderable
Nonorderable variable: A discrete measure in which the sequence of categories cannot be meaningfully ordered
Orderable discrete variable: A discrete measure that can be meaningfully arranged into an ascending or descending sequence
Discrete versus continuous
Discrete variable: A variable that classifies persons, objects, or events according to the kind or quality of their attribute
Dichotomous variable: A discrete measure with two categories that may or may not be ordered (two MECE categories like “male” and “female”)
Continuous variable: A variable that, in theory, can take on all possible numerical values in a given interval (slightly counterintuitive examples: age, years at school, number of children born, occupational prestige, annual earned income)
Dichotomous variable
Discrete variables with directionality, as it an always be coded as 0 or 1. Example: Disability status.
Nominal variable
Discrete variables with no clear ordering. Example: Nationality.
Ordinal variable
Discrete variables that have a clear ordering to them. Example: Satisfaction.
Interval variable
Continuous variables where the distance between levels is equal. Yet they do not have meaningful zero points; for example, 80 degrees farenheit is not twice 40 degrees farenheit.
Ratio variable
Continuous variables where the distance between levels is equal and they DO have meaningful zero points; for example, 25 pounds is 5 times as heavy as 5 pounds.
Manifest variable
A variable that can be observed (opposite of latent)
Latent variable
A variable that cannot be observed and can only be measured indirectly (e.g., degree of centralization in decision making; inclusivity of company culture); opposite of manifest
Status variable
A variable whose outcome cannot be manipulated, like race or gender (often treated as independent variables nevertheless)
Predictor variable
A variable that has an antecedent or potentially causal role, usually appearing first in hypotheses
Outcome variable
A variable that has an affected role in relation to the predictor variable; in other words, the values taken on by the outcome variable depend on the predictor variable
Recoding
The process of changing the codes established for a variable; this could either be grouping values together, or winnowing down the number of groups.
Recoding becomes important when considering statistical tests; 80% of categories should have 5 or more observations, 100% should have at least 1 observation.
Inferential Statistics
Allows for generalizations or conclusions to be made about population parameters based on random sample parameters.
Needs to be random sample - every member of population has an equal, non-zero chance of being exposed to the research.
Null Hypothesis Significance Testing (NHST)
Judges whether variables are related in a population by testing the hypothesis that they are unrelated based on what we see in our sample. That is, we use inference to test the null hypothesis
Test the null hypothesis, not the alternative hypothesis
We can reject or fail to reject (in which case we “tentatively accept: the alternative hypothesis)
Need a strong theoretical basis in your proposal for an alternative hypothesis - this doesn’t solve all of your questions
If probability is small - we can tentatively decide the observed relationship is characteristic of the “true” population
Why random sampling matters?
Researchers can often only study a sample, because studying the whole population would be too di cult, too expensive and too time- consuming. However, in order to do so, the sample has to be representative of the population. The only way to ensure representativeness is to draw a random sample, so that every unit of the population has an equal chance of being included in the sample – otherwise no gen- eralisations to the population are not possible, as one might have selected a biased sample that does not well represent the underlying population.
Measures of central tendency
In statistics, the term central tendency relates to the way in which quantitative data tend to cluster around some value. A measure of central tendency is any of a number of ways of speci- fying this “central value”. Measures of central tendency include the mode, median and mean.
Mean
The mean gives information about the average value of a certain variable is, like e.g. the number of hours spent online. For skewed data, the mean might not be the most appropriate statistic as it is influenced by extreme values or outliers.
Median
With the median, researchers can describe the value for which below and above are 50% of the population. This is especially important for skewed distributions like income, as then the mean is not the best measure of central tendency, as it is heavily influenced by extreme values and outliers.
Measures of Dispersion
Measures of dispersion are important for describing the spread of the data, or its variation around a central value. The most important measures of dispersion are the variance and its positive square root, the standard deviation.
Standard Deviation
The variance has the unit of measure which is
squared, which in most cases makes an intuitive interpretation impossible. To restore the original measurement intervals, the positive square root of the variance is taken.
Positive skew
Right skew. The tail of a skewed distribution is to the right of the median (mean greater than median)
Negative skew
Left skew. The tail of a skewed distribution is to the left of the median (median greater than mean)
Why does skewness matter?
The information about skewness is very im- portant in social statistics, as most distributions of social factors are positively skewed: E.g. for income there are many people in the lower classes, but fewer in the upper classes.
Moreover, information about skewness is important, as then certain statistics might not be as meaningful anymore. For example, the mean is not the best statistic to describe skewed distributions, as it is highly influenced by extreme values and might therefore not represent a very typical value of the distribution.
Interpreting skewness statistics
If skewness is less than -1 or greater than 1, the distribution is highly skewed.
If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
Percentiles
Percentiles are very useful in social statistics in order to describe the value for which x% of values lie below and 100%-x% lie above. With percentiles distributions can be compared and information relative to other values can be given: For example, grades are often not given in absolute terms, but as percentiles, so that the 99th percentile means that 99% of the class scored worse.
Measures of Association
Statistics that show the direction and/or magnitude of a relationship between pairs of dis- crete variables. Usually 5 categories are assessed by measures of association: Existence, significance, strength, direction and pattern.
Many measures of assocation use the PRE interpretation concept: PRE (Proportionate re- duction in error) statistics reflect how well the knowledge of one variable improves prediction of the second variable.
Statistical Significance
A test of inference that conclusions based on a sample of observations also hold true for the population from which the sample was selected.
The null hypothesis always states that differences in percentages are due to chance, and the alternative hypothesis states that observed differences are too large to be explained by chance alone, therefore variables are associated.
Chi-Square Test
A test of statistical significance based on a comparison of the observed cell frequencies of a joint contingency table with frequencies that would be expected under the null hypothesis of no relationship.
MAIN difference from T-Test is that its focusing on the association more broadly (with categorical variables), versus T-Test is about means (continuous or ordinal outcome variables). Both about significance.
Existence
if there is any difference in the percentage distributions, an association exists.
Significance
The observed differences are too large to be explained by chance alone, and are thus statistically significant.
Strength
effect size (what is the size of the effect of the IV on the DV?). Not to be confused with statistical significance.
Direction
positive or negative; only applies to ordinal variables. Determined by sort order of pairs.
Pattern
for ordinal variables, linear is simplest. For nominal variables, interpret category by category.
Sample Chi-Square Statistic Interpretation
Null hypothesis: there is no relationship between internet usage and TV consumption.
Alternative hypothesis: there is a relationship between internet usage and TV consumption.
According to the output, we returned a chi-square statistic of 193.21 at df=2 and p