STAT Definitions (EDUS 608) Flashcards

1
Q

Variable

A

a characteristic that can vary in value among subjects in a sample or a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Categorial Variable (Qualitative)

A

scale for measurement is a set of categories.

Examples:
Racial-ethnic group (white, black, Hispanic)
Political party identification (Dem., Repub., Indep.)
Vegetarian? (yes, no)
Happiness (very happy, pretty happy, not too happy)
Gender
Religious affiliation
Major

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Quantitative Variable

A

possible values differ in magnitude.

Examples:
Age, height, weight, BMI
Annual income
GPA
Time spent on Internet yesterday
Reaction time to a stimulus
(e.g., cell phone while driving in experiment)
Number of “life events” in past year
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Nominal Scale

A

Used to measure CATEGORICAL VARIABLES by using unordered categories.

Example:
Preference for President, Race, Gender,
Religious affiliation, Major
Opinion items (favor vs. oppose, yes vs. no)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal Scale

A

Used to measure CATEGORICAL VARIABLES by using ordered categories.

Political ideology (very liberal, liberal,
moderate, conservative, very conservative)
Anxiety, stress, self esteem (high, medium, low)
Mental impairment (none, mild, moderate, severe)
Government spending on environment (up, same,
down)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Interval Scale

A

Used to measure QUANTITATIVE VARIABLES by using numerical values.

The difference between values are consistent:
-Moving from $20,000 to $21,000 is the same
magnitude as moving from $50,000 to $51,000
-Moving from 90 degrees F to 95 degrees F is the
same as moving from 70 to 75

Note: In practice, ordinal categorical variables often
treated as interval by assigning scores

(e.g., Grades A,B,C,D,E an ordinal scale, but
treated as interval if assign scores 4,3,2,1,0 to
construct a GPA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Descriptive Statistics?

A
  1. Describing data with tables and graphs
    (quantitative or categorical variables)
  2. Numerical descriptions of center
    (mean/median) and variability (standard
    deviation/ variance) (quantitative variables)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Histogram

A

Bar graph of frequencies or percentages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Skewed right

A

Long tail on the right. Mean is to the RIGHT of the Median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Skewed left

A

Long tail on the left. Mean is LEFT of the Median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bimodal

A

Mean and median are the same, but there are two modes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bell-shaped

A

Mean, median, and mode are the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Median

A

Middle measurement of ordered sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Mean

A

average that is used to derive the central tendency of the data in question. It is determined by adding all the data points in a population and then dividing the total by the number of points. The resulting number is known as the mean or the average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Mean vs. Median (Distribution)

A

Mean sensitive to “outliers” (median often preferred for highly skewed distributions)

When distribution symmetric or mildly skewed or discrete with few values, mean preferred because uses numerical values of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Range

A

Difference between largest and smallest observations (highly sensitive to outliers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Standard Deviation

A

A “typical” distance from the mean. It is the square root of the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Variance

A

Measures how far a data set is spread out. It comes from calculating the average of the squared differences from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Deviation

A

The difference of an observation’s value from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Properties of Standard Deviation

A
  • s 3 0, and only equals 0 if all observations are equal
  • s increases with the amount of variation around the mean
  • like mean, affected by outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Empirical Rule

A

If distribution is approximately bell-shaped:
• about 68% of data within 1 standard dev. of mean
• about 95% of data within 2 standard dev. of mean
• all or nearly all data within 3 standard dev. of mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Point Estimation

A

Estimating parameters (mean, median, standard dev.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Inference

A

Testing theories about parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Hypothesis Testing

A

Creating models based on hypotheses and testing them with data to see if they are consistent with the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Null Hypothesis H0

A

– There is no effect.
– E.g. contestants on “Survivor” and members of the public will not differ in their scores on personality disorder questionnaires

It is called the “null” because it is frequently,
though not always, used to say that something
is 0)
• Examples of null hypotheses:
– Ho: μ=0
– Ho: There are no differences in math achievement
by SES level.
– Ho: “I don’t have the flu”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The alternative hypothesis, HA (or H1)

A

– There is an effect.
– E.g. contestants on “Survivor” will score higher on personality disorder questionnaires than members of the public

• Typically suggests that an effect exists, or
(in this class) is statistically significant
• Examples of alternative hypotheses
corresponding to the previous examples:
– Ha: μ≠0
– Ha: There are differences in math achievement by SES level.
– Ha: “I have the flu”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Null versus Alternative

A
  • The null and the alternative can’t both be true and are mutually exclusive
  • Using statistics, we have strong tools to assess the probability that one is correct and the other isn’t…
  • Based on the results you obtain, you will either reject the null hypothesis (you have evidence an effect exists), or you will fail to reject the null hypothesis (you don’t have enough evidence that an effect exists)
  • Instead of saying “fail to reject” the null hypothesis, some disciples use “retain” the null hypothesis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Type I Error

A

Aka. False Positive.
Reject the null if it is true.

• My test says that μ≠0, but actually μ=0
• My test says that achievement differs by SES, it actually
doesn’t
• My swab results say “I have the flu”, but I actually don’t
– i.e., I got a false positive
– In medical tests, testing “positive” means rejecting the null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Type II Error

A

Aka. False Negative.
Fail to reject the null when it’s false.

• My test says that I can’t reject the assertion that μ=0, but in reality μ≠0
• My test says achievement doesn’t differ by SES, but it actually does
• My swab results say “I don’t have the flu”, but I actually do
– i.e., I got a false negative
– In medical tests, testing “negative” means you do not reject the null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Alpha (α)

A
  • the proportion of the times I can expect to reject the null when it’s true in repeated randomly drawn samples of the same sample size from the population
  • aka the probability that I will make a type I error (with repeated sampling)
  • This is also called the “significance level”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How Do We Choose Alpha?

A

• There are conventional choices for α
– The most common choice for α is .05
– Also common are .1 and .01
• All these choices are arbitrary, but they are attempting to be conservative

The smaller the alpha level, the smaller the area where you would reject the null hypothesis. So if you have a tiny area, there’s more of a chance that you will NOT reject the null, when in fact you should. This is a Type II error.
In other words, the more you try and avoid a Type I error, the more likely a Type II error could creep in. Scientists have found that an alpha level of 5% is a good balance between these two issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Approaches to Conducting Hypothesis Tests

A
  1. Test of significance (t-test, z-test)

2. Confidence interval (90%, 95%, 99%)

33
Q

Test of Significance

A
  1. Use formula to create a test statistic
  2. Compare the value of the test statistic with the value corresponding to our choice of α in the distribution appropriate to the test (could be z, t, chi-square, etc.)
    * This value is called the “critical value”
  3. If the test statistic is larger than the critical value, you reject the null hypothesis. If the test statistics is smaller than the critical value, you fail to reject the null hypothesis.
34
Q

Critical Value

A

In hypothesis testing the value
against which a test statistic is compared to determine whether or not the null hypothesis is rejected.
for alpha = .05, this is +/- 1.96. If the test statistic is larger than the critical value, then we reject the null.

35
Q

Rejection Region

A

The set of values of a test
statistic that leads to rejecting the null hypothesis. If the test statistic is falls in the rejection region, then we reject the null hypothesis.

36
Q

p-value

A

The smallest significance level at which the null hypothesis can be rejected.

It captures the amount of confidence we have in our inference
based on the probability that we would get our particular
parameter estimate under our null hypothesis.

It can be thought of as the amount of evidence against the null.

37
Q

What would a low p-value indicate?

A

– The lower the p-value, the less likely it is that we would get the result we got if the null were true
– Thus, the lower the p-value, the more significantly different our estimate is from the hypothesized null value of the parameter.

Typically, we say that p-values need to be .05 or smaller to reject, but this is just one choice.

38
Q

Confidence Interval

A

an interval of numbers within which a given parameter is believed to fall

It has the form: Estimate ± Margin of error OR
[Estimate - Margin of error, Estimate + Margin of error]

39
Q

Elements of Confidence Interval

A
–Your estimate of the population mean
–Your estimate of the standard
deviation of the population
–The sample size
–The level of confidence that you
specify
40
Q

Alpha-Level

A

• We choose the level of α to give us a specific degree of confidence
– α =.10 => 90% confidence
– α =.05 => 95% confidence
– α =.01 => 99% confidence
• This is very similar to choosing a threshold for Type I error in hypothesis testing- you are
choosing which level of confidence is appropriate (usually at least .05/ 95%)
• Just like in significance testing, each level of confidence corresponds to a particular critical value

41
Q

Correlation

A

It is a way of measuring the extent to which two variables are related.

It measures the strength of the association between two interval/ratio quantitative variables.

It measures the pattern of responses across variables.

42
Q

Covariance

A

A measure of how much two variables change together.

43
Q

Problems with Covariance

A

It depends upon the units of measurement.
– E.g. The Covariance of two variables measured in
Miles might be 4.25, but if the same scores are
converted to Km, the Covariance is 11.

44
Q

(Pearson) Correlation coefficient

A

• One solution to problems with covariance is to standardize it.
– Divide by the standard deviations of both variables.
– It is relatively unaffected by units of measurement.

45
Q

About Correlation

A
• It varies between -1 and +1
– 0 = no relationship
• Positive values suggest a positive relationship
between the two variables
– As X increases, Y increases
– As X decreases, Y decreases
• Negative values suggest a negative relationship
– As X increases, Y decreases
– As X decreases, Y increases
46
Q

Effect Size (Correlation)

A

It can be interpreted an effect size:
– ±.1 = small effect
– ±.3 = medium effect
– ±.5 = large effect

47
Q

Coefficient of Determination, r-squared

A

– By squaring the value of r you get the proportion of variance in one variable shared by the other.

48
Q

Correlation + Causality

A

• The third-variable problem:
– in any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the results.

• Direction of causality:
– Correlation coefficients say nothing about which variable causes the other to change

• Correlation is not causation

49
Q

Nonparametric Correlation

A

• For small samples (say less than 30), or for data that are severely non-normally
distributed
• Spearman’s Rho
– Pearson’s correlation on the ranked data
• Kendall’s Tau
– Better than Spearman’s for small samples
• For this class: we will focus on the Pearson correlation (r)

50
Q

Regression

A

A way of predicting the value of one variable from another.
– It is a hypothetical model of the relationship between two variables.
– The model used is a linear one;
– Therefore, we describe the relationship using the equation of a straight line.

51
Q

Linear Model

A

y = b0 + b1*X (plus error)

52
Q

Population Model

A

Used to describe an overall theoretical linear relationship between two variables.

53
Q

Prediction Model

A

We use the same format as a population model but “plug in” estimated values and make a prediction.

Big difference: no error term, for prediction we assume error washes out

54
Q

Sum of Squares Regression

A

Model variability (difference in variability between the model and the mean).

55
Q

Sum of Squares Residual

A

Residual/Error variability (variability between the regression model and the actual data).

56
Q

Sum of Squares Total

A

Total Variability (variability between scores and the mean)

57
Q

How Sum of Squares Relate

A

Residual/Error variability (variability between the regression model and the actual data).

58
Q

How Good is the Model?

A

We need to test the model. We can can do this in two ways, using the sum of squares to create two diagnostic tests - F-test and R Square.

59
Q

Interpreting the F-test

A

Divide Sums of Squares by df, this gives you Mean Square. Divide Mean Square Regression / Mean Square Residual. Compare this value with df of (n-2). Check the p-value.

60
Q

R

A

the absolute value of the correlation coefficient.

61
Q

R Square

A

the square of the correlation coefficient. “coefficient of determination” Equal to Sum of Squares Regression divided by Sum of Squares Total.

Super useful for determining model fit - it represents the proportion of the total variance (sum of Squares Total) that is explained by your regression equation.

R Square shows you how well your model fits your data; it is much more precise than the F-test, which just tells us whether things fit in very general terms.

You can think of R Square as an effect size for your regression, with the same range of values:
.1 - .3 (small); .3 - .5 (medium); .5 or higher (large)

62
Q

Synthesize Results (Steps)

A

How much evidence do you have to justify your findings?

  1. Is the relationship significant?
  2. What is the direction of the relationship?
  3. How well does the model fit the data?
63
Q

Generalizability

A

The goal of statistics - we want to the research to apply to a larger population.

64
Q

Generalizability Assumptions for Regressions

A
  1. Variable Type
  2. Non-Zero Variance
  3. Linearity
  4. Independence
  5. Homoscedasticity
  6. Normally-distributed Errors
  7. No Multicollinearity
65
Q

Assumption - Variable Type

A
  • Outcome must be continuous (interval/ratio)

- Predictors can be continuous (interval/ration) or dichotomous.

66
Q

Assumption - Non-Zero Variance

A

Predictors must not have zero variance (have to have some variation).

67
Q

Assumption - Linearity

A

The relationship we model is, in reality, linear.

68
Q

Assumption - Independence

A

All values of the outcome should come from a different person.

69
Q

Assumption - Homoscedasticity

A

For each value of the predictors the variance of the error term should be constant.

70
Q

Assumption - Normally-distributed Errors

A

Residual error values should be normally distributed when viewed in a histogram.

71
Q

No Multicollinearity

A

Predictors must not be highly correlated.

*This only applies to multiple regressions, not simple linear regressions.

72
Q

Tools for Checking Assumptions Using Residuals (Errors) - aka going through the garbage

A
  1. Homoscedasticity and Linearity - plot standardized residual values (ZRESID) against standardized predicted values (ZPRED.)
  2. Normality of Errors - (1) normal probability plot or (2) a histogram of the standardized residuals.
73
Q

Big Picture for Residual Plots

A

From Field, p. 348: “If everything is OK [assumptions have been met] then this graph should look like a random array of dots, if the graph funnels out then that is a sign of heteroscedasticity and any curve suggests non-linearity.”
Look for:
1. Funneling- the width of the points gets smaller/ larger (heteroscedasticity)
2. Curving- the points show a curved trend up or down (or both!) (non-linearity)

74
Q

Homoscedasticity vs. Heteroscedasticity

A

We assume that the residual errors demonstrate
homoscedasticity (a good thing):
– This means that for each value of your predictor(s) the variance of the error term should be constant
– We will practice how to look for this next

If the residuals errors are not homoscedastic, then they demonstrate heteroscedasticity (a bad thing):
– This means that the variance of the error term is not constant
– The assumption of homoscedasticity has been violated

75
Q

Normally-Distributed Errors vs. Non-Normality of Errors

A

We assume that our residual error values are normally distributed (a good thing):
– This means that the residual errors should be normally distributed when viewed in a histogram or a probability plot

If the normality of errors assumption is violated, we have evidence of non-normality of errors (a bad thing):
– This means that there may be outliers or influential observations skewing our results, or maybe we
specified the wrong kind of model

76
Q

Linear vs. Non-Linear Relationship

A

We assume that our outcome and predictor(s) have a linear relationship (a good thing):
– This means that the relationship can be modeled with a line
– We will practice how to look for this next
• If the linearity assumption is violated, we have evidence of a non-linear relationship (a bad thing):
– This means that the relationship should not be modeled with a straight line (might be quadratic,
exponential, etc.)
– We would need another kind of model to accurately
represent it

77
Q

Normal Probability Plot

A

Looking for a nearly straight line and curve or patterns is a sign that something is up with the data.

78
Q

Multiple Linear Regression (MLR)

A

Relationships between more than two variables.

79
Q

Why we need MLR

A

Bivariate analyses (e.g. simple linear regression) are informative, but we usually need to take into account many variables.

  • Many predictors (“x”es) have an influence on any particular outcome (y). Especially in education: many things influence achievement, as an example.
  • The effect of a given predictor (x) on an outcome (y) may change when we take into account other variables.