Qualitative and Quantitative Flashcards
3 Steps in the Statistical Process
1) Collect Data
2) Describe & Summarize the Distribution
3) Interpret - draw general conclusion for the pop on the basis of the sample
Nominal Data
Mutually Exclusive groups, lack intrinsic order.
Zoning classification, social security numbers, sex.
Ordinal Data
Ordered implying a ranking of observations. Values are meaningless - rank is important.
Letter grades, response scales on a survey 1-5, suitability for development
Interval data
Data with ordered relationship where the difference between scales has meaning.
Temperature. Diff between 40 and 30 degrees is the same as 30 and 20 but 20 degrees is not twice as cold as 40 degrees.
Ratio Data
Gold standard of measurement. Absolute and relative difference have meaning.
Distance measurement. 40 - 30 miles is the same difference as 30-20 miles and 40 miles is twice as far as 20 miles.
Quantitative Variables
Variables where numerical value is meaningful.
Interval or ratio measurement.
HH income, level of pollution in river
Qualitative Variables
Variables where numerical value is not meaningful.
Nominal/Ordinal measurement.
Zoning classification
Continuous Variables
Infinite number of values.
Positive & negative.
Most measurements in physical sciences yield continuous variables.
Discrete variables
Finite number of distinct values.
Accidents per month - can’t be negative.
Binary/dichotomous variables
Special case of discrete variables which can only take on two values - 0/1 typically.
Descriptive variables
Describe the characteristics of the distribution of values in a population or sample.
Ex: on average, AICP test takers in 2018 are 30 years old
Inferential Statistics
use probability to determine characteristics of a pop based on a sample.
Distribution
the overall shape of observed data.
Ordered table, or histogram, or density plot
Normal or Gaussian Distribution
the bell curve.
Distribution is symmetric. The spread around the mean can be related to the proportion of observations.
More specifically, 95% of the observations that follow a normal distribution are within two standard deviations from the mean
Symmetric distribution
equal number of observations are below and above the mean
Central tendency
Typical or representative value for the distribution of observed values
Coefficient of Variation
the relative dispersion from the mean by taking the standard deviation and dividing by the mean.
z-score
This is a standardization of the original variable by subtracting the mean and dividing by the standard deviation.
The z-score in effect transforms the original measure into standard deviation units.
inter-quartile range or IQR.
Alternative measure of dispersion.
Breaks things into quartiles.
This is visualized in a box plot (also called box and whiskers plot).
confidence interval.
this constitutes a range around the sample statistic that contains the population statistic with a given level of confidence, typically 95% or 99%.
Standard Deviation
a measure of how much the data in a certain collection are scattered around the mean. A low standard deviation means that the data are tightly clustered; a high standard deviation means that they are widely scattered. There are two common formulas used for standard deviation, both yielding the same result.
Variance
the square of the standard deviation. It is a mathematical expectation of the average squared deviations from the mean. The formula is the same as that for the standard deviation except the “s” variable is squared, and no square root function is performed.
Coefficient of Variation
unlike the other three measures of dispersion measures relative dispersion from the mean rather than absolute dispersion across the field. It is merely the standard deviation divided by the mean (CV = s / x ).
Hypothesis Testing
is conducted to determine outcomes based on the scientific method. First, the statistician must declare the predicted (desired) outcome, then must also identify and describe all possible outcomes.
• The Research Hypothesis (designated H1) is a statement that describes the interrelationships between different characteristics. It is what the researcher is seeking to prove through the analysis.
• The Null Hypothesis (designated H0) is the opposite of the research hypothesis. It is what the researcher is seeking to prove wrong so that the research hypothesis can be assumed to be correct by implication.
• Remember that is easier to prove something wrong than correct (statistically speaking) so the null hypothesis is used.
• There are two kinds of error a researcher can make in hypothesis testing. First is a Type 1 Error, where H0 is rejected even though it is true. The other kind is a Type 2 Error where H0 is accepted when it should have been rejected as false.
t-test
allows us to compare the means of two groups and determine how likely the difference between the two means occurred by chance.
correlated t-test
concerned with the difference between the average scores of a single sample of individuals who is assessed at two different times (“before” vs. “after”) or on two different measures. The measures must be correlated (co-related), and so it can also compare average scores of samples of individuals who are paired in some way (i.e. parent-child).
independent t-test
compares the averages of two samples that are selected independently of each other. Independent t-tests come in “equal variance” and “unequal variance” flavors, but these go beyond the scope of this work.
ANOVA
an extension of the t-test. It stands for Analysis of Variance. It allows a composite view of data by assuming that by placing variable x into groups, a better understanding of variable y will be found.
o ANOVA identifies the relationship between two variables.
o The x variable is always nominal
o The y variable is always interval
• Mathematically, a line is expressed as y = Mx + b
Correlation
measures the strength of the relationship between variables or the degree to which two variables are correlated (co-related). It is used to demonstrate relationships between situations and/or actors, even disparate ones (think apples and oranges). The test is linear.
Regression
a statistical test of the effect one variable (condition/actor) has on another while holding all other conditions constant. This test is also linear. If there is no correlation, there is no need to utilize a regression test. Regression allows us to predict the value of one variable give the value of the other, or explore the relationships between variables.
o There is always one dependent variable (y) in regression.
o In simple regression, there is only one independent variable. The formula for simple regression is y=b0+b1x1.
o In multiple regression, there are two or more independent variables. Multiple regression simply extends simple regression y=b0+b1x1.+b2x2+ … bnxn.
o Regression answers one or more of these questions:
. What is the association between x and y?
. How can changes in y be explained by changes in x?
. What are the functional relationships between y and x?
o Beware of false relationships! Correlation and regression can be used to “prove” that fire trucks cause house fires (if there is a house fire, there are likely fire trucks).