Chapter 12 and 13 Flashcards
Types of statistics
Descriptive and inferential
Define both terms of statistics
Descriptive statistics: used to describe the characteristics of a sample or population.
Ex: class average
Inferential statistics: used to infer (estimate) population parameters (value within a population) from a subgroup (sample) of the population.
Technical assumptions and the two parametric
Parametric statistics: built-in assumptions about the data distribution that must be met if the statistic is to be used
Non-parametric statistics: No built-in assumptions.
For ex, you could assume a normal distribution of the underlying populations.
Raw vs relative frequency
Raw: The results may indicate the actual number of cases
Relative: that take on each value or expressed as a percentage of the cases that take on each value
What are the measures of central tendency?
group of statistics that present a single value that best represents the distribution of response
Mean, mode, median
Measure of dispersion
group of statistics that indicate how well the measure of central tendency represents the distribution
Variation ratio, Range and Standard deviation
Measures of central tendency and dispersion of nominal variables
Mode: measure of central tendency used with nominal variables
Most frequent
Variation ratio: proportion of cases that do not fit within the modal category
Larger values indicate more variation, meaning the mode does not represent the distribution well
Smaller values indicate less variation, indicating the mode does a good job of representing the distribution
Measures of central tendency and dispersion of ordinal variables
Median: the most appropriate measure of central tendency. Value of observation that splits the distribution of cases in half
Range: the measure of dispersion used with ordinal-level variables. The range of possible values that the variable encompasses. Ignores all information except for the two most extreme scores
Interquartile range is more commonly used. The range between the 25th and 75th percentile. Not influenced by outliers
what is an outlier?
Outlier: a case that differs significantly from the others
Measures of central tendency and dispersion for interval/ratio variables
Arithmetic mean: calculated by adding all of the values and then dividing by the total number of cases
The median is a better measure because it is not influenced by extreme cases
Standard deviation: estimates the average amount that each observation differs from the mean.
Positive vs negative skew
pulling it in the direction of extreme scores
positively skewed: extreme scores pull the mean above the median
Negatively skewed: extreme scores pull the mean below the median
The greater the difference between the mean and median…
the more skewed the distribution is.
what is the standard deviation?
Standard deviation: estimates the average amount that each observation differs from the mean.
Characteristics of standard deviation
The size of the standard deviation depends on how clustered the scores are around the mean
Smaller deviation if the scores are closer to the mean
The values of a standard deviation are always positive
If all scores are identical there would be no deviation. Meaning it is equal to zero.
standardized scores
scores expressed as the number of standard deviations that fall from the mean of the total distribution scores.
Standardized scores can be positive or negative depending on whether they fall above or below the mean
contingency tables
Contingency tables: when working with ordinal or nominal variables, the cell in which the individual case is located is contingent upon its scores for each of the variables.
scatter plots
when working with interval/ratio variables. Graphs in which the point of an individual case lies are contingent upon its scores for each of the variable
what is a perfect correlation?
when knowing the value of one variable always allows us to predict the value of the other.
Measures of association
indicate the strength of the relationship with a single numerical value
What is the range of measures of association for each type of variable?
Nominal: 0 to 1
The closer the coefficient is to 0, the weaker the relationship
Ordinal and interval/ratio: -1 to +1
0 means a weaker relationship, while closer to +1 or - 1, means a stronger relationship
positive vs negative coefficient
A positive coefficient means a positive relationship (change in the same direction)
A negative coefficient means a negative relationship (decrease + increase)
If you want to compare to different data sets of different sizes. Would it be better to use…
relative frequency
what are standardized scores referred to?
Z scores
How to identify which Z score is less typical than the mean
If it is further away from the mean, it is less typical
what is the alpha level also known as
confidence level
p-value
The probability of observing a given sample statistic under the null hypothesis
The lower the p-value, the greater the confidence that the null hypothesis does not describe the
population from which our sample was drawn.
what does it mean when a value is statistically significant?
If a p-value is lower than our pre-determined alpha level, we conclude the relationship to be
“statistically significantly different from the null hypothesis.”
Meaning there is a small chance that the null hypothesis would stand true.
Type 1 vs Type II error
Type I means it is a false positive. When we infer that a relationship found in the sample exists in the population when in fact it does not
Type II means it is a false negative. when we do not find a relationship within the sample data
What is the PRE measure of association for nominal data?
Lamba
Used when one or both bivariate variables are nominal
uses the mode of prediction
PRE?
Proportional reduction in error measures: before and after comparison. Comparing the amount of error we have before knowing the value of an independent variable with the amount of remaining error after knowledge about the independent variable is taken into account.
What is the non-pre measure of association for nominal data?
Cramer’s V
Comparing the number of cases that would be expected in each cell there was no relationship between the two variables to the actual number of cases observed.
T-test
Comparing the means of two groups
Considers the difference between the mean scores of the two sample means and the amount of variation within each sample
Chi-square test
Measures the association between two categorical variables.
tests the independence of two variables by assessing the likelihood that the relationship observed in the sample is due to chance.
What are the measures for nominal data?
Gamma: A PRE measure that can be interpreted in terms of percentage reduction of error. Uses less information than Tau measures
Overstates the strength of the association
Can be used with both asymmetrical and symmetrical tables
Tau: use more information and less likely to inflate strengths of relationships
Selections between tau depend on the table dimensions
Tau B for symmetrical, Tau C for asymmetrical
What is the measure of association for interval/ratio data?
Pearson’s r: measures the linear relationship between an independent and dependent variable.
Varies between -1 and +1
Sampling distribution
all the possible sample means for a given sample size. Created by totalling the number of combinations that present the specified sample mean
One-tailed vs two tail test
If the direction of the difference is not important, we use a two-tailed test
If the direction of difference does matter, we use a one-tailed test
Basic linear regression
A statistical technique to estimate the location of this line for every value of the independent variable
In other words, understanding the value of X by determining the values of y
standard error
analogous to the standard deviation of the mean; it provides a single value that summarizes how closely the regression line fits the data.
What do we need to
minimize to get the best
linear regression line?
Unexplained variance
How to analyze the data presented by the OLS formula?
First, look at the incept (a), which defines Y when X is 0. Demonstrated predicted growth
(b) incidents when growth moves by 1 unit, the predicted growth
Explain the type I and II errors that can occur in the following scenario,
A new drug is proposed but must undergo a clinical trial. The null hypothesis is that there are no dangerous side effects
The drug is found to have dangerous side effects when it does not. and an opportunity for a promising drug is missed.
But, if a type II error is made, meaning there are no dangerous side effects when there are, this poses a danger to the public. Both are risks
what are dummy variables?
We can enter dummy variables as independent variables to assess their effect on the dependent variable.
what is insufficient evidence?
if the calculated value is less than the critical value, we must therefore accept the null hypothesis.
how are statistical significance tests affected by sample size?
The bigger the sample size, the more likely you’ll find statistical significance
what is the difference between substantive and significant statistics?
A relationship is substantively significant if it is theoretically important, if it plays a role in elaborating, modifying, or rejecting the theory. The need for substantive significance requires that the researcher fully examine the relationship between the variables
The central limit theorem
The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution
Coefficient of determination
Also known as R2. It is the proportion of variance in the outcome that can be explained by the predictor(s) in the model.
It is equivalent to the squared correlation between X and Y
In other words, R2 is the proportion of common variance between Y and
the other variables in the model.
The closer it is to 1 (range of 0 to 1) , the more the model fits the data.
we reject the null hypothesis…
if the probability of chance is less than 5%.
Meaning, we are 95% certain that this is an accurate representation of the population parameter.
In Chi-square testing, when would you reject the null hypothesis?
if the chi-square obtained exceeds the chi-square critical, we reject the null hypothesis of no relationship