Lesson 3 - Quantitative Analysis (Statistics) Flashcards
Population
the total number of some entity. The total number of planners preparing for the 2011 AICP exam would be a population.
Sample
a subset of the population. For example, 25 candidates out of the total number of planners preparing for the 2011 AICP exam.
Descriptive Statistics
describe the characteristics of a population.
Inferential Statistics
determine characteristics of a population based on observations made on a sample from that population. We infer things about the population based on what is observed in the sample.
Central tendency
the typical or representative value of a dataset. There are several ways to report central tendency, including mean, median, and mode.
appropriate measure of central tendency depends on data type and situation
Mean
he average of a distribution. The mean of [2, 3, 4, 5] is 3.5.
Weighted mean
when there is greater importance placed on specific entries or when the frequency distribution results in a representative value being assigned for each class.
Median
the middle number of a ranked distribution. The median of [2, 3, 4, 6, 7] is 4.
Mode
the most frequent number in a distribution. The modes of [1, 2, 3, 3, 5, 6, 7, 7] are 3 and 7. There can be more than one mode for a data set.
Nominal data
is classified into mutually exclusive groups that lack intrinsic order. Race, social security number, and sex are examples of nominal data. Mode is the only measure of central tendency that can be used for nominal data.
Ordinal data
has values that are ranked so that inferences can be made regarding the magnitude. However, ordinal data has no fixed interval between values. Educational attainment or a letter grade on a test are examples of ordinal data. Mode and median are the only measures of central tendency that can be used for ordinal data.
Interval data
is data that has an ordered relationship with a magnitude. For temperature, 30 degrees is not twice as cold as 60 degrees. Mean is the best measure of interval data. Where the data is skewed median can be used.
Ratio data
has an ordered relationship and equal intervals. Distance is an example of ratio data because 3.2 miles is twice as long as 1.6 miles. Any form of central tendency can be used for this type of data.
Qualitative Variables
can be nominal or ordinal
Quantitative Variables
can be interval or ratio.
Continuous Variables
can have an infinite number of values, such as 1.1111.
Dichotomous Variables
can only have two possible values, such as unemployed or employed which are symbolized as 0 and 1.
Hypothesis Test
allows for a determination of possible outcomes and the interrelationship between variables.
Null Hypothesis
shown as H0 is a statement that there are no differences. For example, a Null Hypothesis could be that Traffic Calming has no impact on traffic speed.
Alternate Hypothesis
designated as H1, proposes the relationship - Traffic Calming reduces traffic speed.
Normal distribution (data)
is one that is symmetrical around the mean. This is a bell curve.
Distribution skewed to the right
has a few high numbers (outliers) that pull the mean to the right. For example, if there are three $20 million homes in your community, it is likely to skew the mean home value to the right.
Distribution skewed to the left
has a few low numbers (outliers) that pull the mean to the left. When taking the AICP exam, for instance, a few people may give up and walk out resulting in a few very low scores, which would skew the mean score to the left.
Range (dispersion)
the simplest measure of dispersion. The range is the difference between the highest and lowest scores in a distribution. The age range of the respondents in a neighborhood survey goes from 18-year-old to 62-year-old. This results in a range of 44.
Variance (dispersion)
the average squared difference of scores from the mean score of a distribution.Variance is a descriptor of a probability distribution, how far the numbers lie from the mean.
Standard Deviation (dispersion)
is the square root of the variance. For instance, if we want to know the difference in wages among three employees at a planning department, we need to calculate the mean, variance, and standard deviation. If the employees earn $10, $20, and $35 per hour, the mean is $21.67. This means that employee 1 makes ($10 - $21.67) = $11.67 less than the mean; employee 2 makes ($20 - $21.67) = $1.67 less than the mean; and employee 3 makes ($35 - $21.67) = $13.33 more than the mean.
To compute the variance, we first square each difference and sum it. (11.67)2+ (1.67)2 + (13.33)2 = 136.19 + 2.79 + 177.69 = 316.67. We then divide 316.67 by the number of samples minus 1, which gives us 316.67/(3-1) = $158.33.
The standard deviation is simply the square root of the variance. In this case, the square root of 158.33 is $12.58.
Coefficient of Variation (dispersion)
measures the relative dispersion from the mean and is measured by taking the standard deviation and dividing by the mean.
Standard Error (dispersion)
is the standard deviation of a sampling distribution. Standard errors indicate the degree of sampling fluctuation. The larger the sample size the smaller the standard error.
Confidence Interval (dispersion)
gives an estimated range of values which is likely to include an unknown population parameter. The width of the confidence interval gives us an idea of how uncertain we are about the unknown parameter. A wide interval may indicate that we need more data before we can make a definitive statement. You frequently see confidence intervals provided on the polls. For example, 42% of California residents support one presidential candidate, 36% support another candidate, and 22% undecided, +/- 3%. This 3% is the confidence interval.
Chi Square (testing)
a non-parametric test statistic that provides a measure of the amount of difference between two frequency distributions. Chi Square is commonly used for probability distributions in inferential statistics. This Chi Square distribution is used to test the goodness of fit of an observed distribution to a theoretical one.
z-score (testing)
a measure of the distance, in standard deviation units, from the mean. This allows one to determine the likelihood, or probability that something would happen.
t-test (testing)
allows the comparisons of the means of two groups to determine how likely the difference between the two means occurred by chance. In order to conduct a t-test, one needs to know the number of subjects in each group, the difference between the means of each group, and the standard deviation for each group.
ANOVA (testing)
an analysis of variance. It studies the relationship between two variables, the first variable must be nominal and the second is interval.
Correlation (testing)
tests the strength of the relationship between variables. The Correlation Coefficient indicates the type and strength of the relationship between variables, ranging from -1 to 1. The closer to 1 the stronger the relationship between the variables. For example, you would expect a strong correlation coefficient between score on the AICP exam and hours of study. Squaring the correlation coefficient results in an r2
Regression (testing)
a test of the effect of independent variables on a dependent variable. A regression analysis explores the relationship between variables. For example, AICP Exam Score depends on number of hours studied, years of experience, and educational attainment. The result could show that for every 50 hours studied the score increases by 10%.
Sampling Error (testing)
occurs when one has taken a sample from a larger population. The sample is not representative of the population as a whole, creating a sampling error.
Nonsampling error
is one that cannot be explained by the representativeness of the sample. A nonsampling error can occur as a result of respondents misunderstanding a question or misreporting their answer and can also including missing values.