Research and assessment methods Flashcards
Mean (average)
sum values, then divide by count
```
median
middle number in ranked data
mode
most frequent number or value
variance
average squared deviation from the mean
- calculate mean
- calculate the squared deviation for each observation (observation - mean)^2
- sum squared deviations
- divide by count of observations
note - if the observations are from a sample, rather than the whole population, in step 4, divide by one less than the count of observations
squared deviation
(observation - mean)^2
standard deviation
square root of variance
sqrt(variance)
coefficient of variation
standard deviation divided by the mean
standard deviation/mean
z-score
- standardization of original variable
- subtract mean and divide by standard deviation
- mean of z-score is 0 and variance is 1
- z-score greater than 2 indiciates observation is more than 2 standard deviation from the mean
z = (observation - mean)/standard devation
interquartile range and fences
- difference in value of 75th percentile and 25th percentile
- fences = 1st quartile range minus 1.5x the interquartile range and 3rd quartile plus 1.5x the interquartile range
- outliers are outside the fences
for example, in a set of 20 observations, subtract the 5th value from the 15th value to get the interquartile range
P-value, type 1 error
- false positive
- probability we reject the null hypothesis when it is actually correct
- want 5% or 1% or smaller (0.05 or 0.01)
t-test
compare means of two populations based on their sample averages
ANOVA
- analysis of variance
- more compelx form of testing equality of means between groups
- more than 2 groups
- compare means of different groups
chi squared test
- measures fit
- tests relationship betwen z variables
- observed proportions compared to what is expected if variables are independent
- chi squared distribution: skewed, square of standard normal variable
correlation coefficient
- measures strength of linear relationship of 2 variables
- between -1 and 1
- r-squared is square of correlation coefficient
linear regression
hypothesizes relationship between a dependent variable and one or more explanatory variables
y = a +bx + e
y = dependent variable
x = independent variable
e = random error
a = intercept
b = slope coefficient
what are the 3 measures of central tendency?
mean
median
mode
what are the 3 measures of dispersion?
range
variance
standard deviation
Linear Method
Population Estimation
- uses change in population over a period of time to determine change into the future in a linear fashion
- example: population growth historically 1,000 people per year; assume future growth to be 1,000 people per year
- results in a straight line
Exponential Method
Population Estimation
- uses rate of population change to estimate current or future population
- for example: growth historically at 2% per year; growth in the future will be 2% per year
- results in a curved line
Modified Exponential Method
Population Estimation
- like exponential method, it uses rate of change in population historically to predict future population
- assumes there is a cap to the change and at some point growth will slow or stop
- results in an S-shaped curve
Gompertz Projection
Population Estimation
- variation of exponential and modified exponential methods of estimating population
- growth is slowest at the beginning and speeds up over time
Symptomatic Method
Population Estimation
- uses available data indirectly related to population size, such as housing starts, new drivers licenses, water taps, phone lines, voter registration, utility connections, etc.
- population estimate based on data and the average houeshold size (or other relevant ratio)
- for example: if 100 new single family building permits are issued in a year, and average household size is 2.5, estimate 250 new people in community.
Step-Down Ratio Method
Population Estimation
- uses the ratio of population of a smaller geography to a larger geography, such as city to county, at a known time to estimate current or future population
- example: city makes up 20% of population of county in 2000. If county population in 2005 is 20,000, then 20% of that is the estimated city population (4,000)
Distributed Housing Unit Method
Population Estimation
- multiples number of housing units by occupancy rate and persons per household
- reliable for slow growth or stable communities, less so for quickly changing communities
Cohort Survival Method
Population Estimation
- uses current population plus net natural increase (births minus deaths) plus net migration (in-migration minus out-migration) to calculate future population
- calculated for men and women in specific age groups
- uses specific time intervals - smallest interval is based on the time it takes for all members of a cohort to age to the next cohort (typically 5 years)
- natural increase = children born minus deaths during the time interval
- death rate = number of deaths per 1,000 people
- crude birth rate = total number of births per 1,000 people
- general fertility rate = number of births per 1,000 females of childbearing age
- age-specific fertility rate = number of births per 1,000 females in a given age group
- net migration = difference between number of people moving in and moving out
Discrete variable
- a numerical variable that can be counted, and comes in distinct values with nothing in between (ie. no fractions, certain increments, etc)
- example: the number of accidents (come in increments of one)
Binary variable
dichotomous variable
- only offers two choices
- example: 1 or 0
Continuous Variable
- variables that can have any number of value, in any increment/fraction
- example: temperature can be 51 degrees or 51.23 degrees
Nominal data
- mutually exclusive groups or categories
- lack intrinsic order
- examples: zoning districts, social security number, gender
for example, the labels do not matter, and do not imply an order or specifical numerical value
Ordinal Data
- ordered categories implying a ranking of observations
- may be given numerical values, but the values themselves are meaningless, only the rank matters
- examples: letter grades, suitability for development, response scales on a survey
for example, a rank of 2 versus 4 only implies that 2 is better/before 4, but not that 2 is half as much as 4.
Interval Data
- has an ordered relationship where the difference between the scales has a meaningful interpretation
- example: temperature - the difference between 30 and 40 degrees is the same as the difference between 20 and 30 degrees, but 20 degrees is not twice as cold as 40 degrees
Ratio Data
- both absolute and relative differences have a meaning
- for example: distance - the difference between 30 and 40 miles is the same as 20 to 30 miles AND 40 miles is twice as far as 20 miles
Population versus sample
- population = the entire group you want to draw conclusions about
- sample = the specific group you will collect data from to inform conclusions about the entire group
Do the American Community Survey (ACS) and the decennial census measure the entire population or a sample?
- decennial sensus measures data about the entire population
- ACS only measures a sample, a small percentage of the entire population
Descriptive statistics
- draw conclusions on data that has been observed
- can be for a sample or a population
- organized and presented as purely factual
- examples: mean, median, mode, standard deviation, quantiles etc.
Inferential Statistics
- describes or predicts what has not been observed
- when using a sample to generalize about the full population, or when you are trying to describe/predict behavior of a new population
- present results in form of probabilities
- draw conclusions beyond available data
- examples: hypothesis testing, confidence intervals, regression, correlation
Normal or Gaussian Distribution
bell curve
- symmetric
- spread around the mean can be related to the size of samples
- 95% of observaions are within 2 standard deviations of the mean
95% confidence interval
there is a 95% chance that, given your sample, the sample results are within two standard deviations of the actual number
Margin of error
- expresses the amount of random sampling error in the results of a survey
- larger margin of error means less confidence
- 2x the standard deviation
Hypothesis test
- null hypothesis and alternative hypothesis
- goal is to reject the null hypothesis
Economic Base Analysis
- looks at basic and non-basic activities
- exporting industries make up economic base of a region
- calculate location quotient for each industry - less than 1 indicates importing economy, greater than 1 indicates exporting economy
- basic industry= can be exported - make up economic base of a region
- non-basic industry = locally-oriented, cannot be exported
- location quotient = ratio of an industry’s share of local employment divided by its share of the nation (or other geography)
Basic activity/industry
- can be exported
- make up economic base of a region
Non-basic activity/industry
- locally oriented
- cannot be exported
location quotient
- ratio of an industry’s share of local employment divided by its share of the nation (or other geography)
- less than 1 indicates importing economy
- greater than 1 indicates exporting economy
Shift-Share analysis
- analyzes regional economy in comparison with national economy
- determines what portion of local economic growth or decline can be attributed to national, industry-specific, or regional factors
- Industrial mix effect, national growth effect, expected change, and regional competitive effect
- actual change - expected change = competitive effect
industrial mix effect
Shift Share Analysis
- the number of jobs expected to be added or lost within an industry in the region based on the industry’s national growth or decline
- (industry growth rate - national economy growth rate) X Number of regional industry jobs
National Growth Effect
Shift Share Analysis
- the number of jobs an industry is expected to gain or lose according to the nation’s job growth
- national growth rate X number of regional industry jobs
Input-Output analysis
- determine the employment effect that a particular project has on a local economy
- utilizes a series of multipliers to estimate employment, direct, indirect, and induced effects.
- identify primary suppliers, intermediate suppliers, intermediate purchasers, and final purchasers
- economy’s total output is equal to total production plus intermediate sales
- three tables: transactions, direct requirements, and total requirements
- requires a lot of data, costly
- primary supplier = purchase inputs for the production final goods
- intermediate suppliers = purchase inputs for the production of intermediate goods
- intermediate purchaser = buy intermediate goods and use them for the production of final goods
- final purchasers = purchase final goods for their own use, not production
North American Industry Classification System
NAICS
- standard used by Federal statistical agencies in classifying business establishments for the purpose of collecting, analyzing, and publishing statistical data about the U.S. economy
- developed by the Office of Management and Budget and in 1997 it replaced the Standard Industrial Classification (SIC) system.
- developed in partnership with Canada and Mexico
- The first two digits designate the largest business sector, the third digit designates the subsector, the fourth digit designates the industry group, the fifth digit designates the NAICS industries, and the sixth digit designates the national industries.
Survey
- research method that allows one to collect data on a topic that cannot be directly observed, such as opinions on downtown retailing opportunities
- typically taken of a sample of a population
cross-sectional survey
- gathers information about a population at a single point in time
longitudinal survey
- conducted over a period of time at specific time intervals
Written surveys
- mailed, printed, or administered in group setting
- large/broad sample size
- low-cost
- low response rate - around 20%
- requires literacy of respondents
Group administered surveys
- small sample size
- high and quick response rate
- challenge getting everyone together to complete
example: survey at the end of a workout class
drop-off survey
- survey his hand-delivered or dropped off at respondent’s residence or business
- personal contact increases response rates (compared to typical mail surveys)
- expensive - time and people to deliver surveys
- smaller sample size than mail survey
phone survey
- best for yes/no questions; longer questions or multiple answers harder to administer
- allow follow-up or further explanation on answers
- response rate varies greatly, and declining with less land-line phones
- more expensive than mail or online
- can be biased from interaction with interviewer
online survey
- inexpensive and quick responses
- higher response rate than written and interview surveys
- will not reach people without internet access
Probability sampling
- there is a direct mathematical relation between the sample and the population so that precise conclusions can be drawn
- examples: random, systematic, stratified, cluster samples
Non-probability sampling
- no precise connection between sample and population
- results must be interpreted with caution since they are not neccesarily representative of the population
- examples: convenience, snowball, or volunteer samples
Random sample
- everyone has the same chance of being selected to participate
- best when little information about the data population, there are too many differences to divide into subsets, etc.
systematic sampling
- random sample with a fixed periodic interval is selected from a larger population
steps:
1. define your population
2. settle on a sample size
3. assign every member of the population a number
4. divide population by the desired sample size to determine sampling interval
5. choose a starting point
6. identify every nth member of the population (n being sampling interval) to be members of the sample
Stratified Random Sampling
- population is divided into separate groups or classes, from which a sample is drawn such that the classes in the population are represented by the classes in the sample.
- divide population into homogeneous groups called strata (age, income, etc)
- select random samples from each stratum in a number proportional to the stratum’s size compared to the population
Cluster Sample
- a form of stratified sampling where a specific target group out of the general population is sampled from, such as the elderly, or residents of a specific neighborhood
Convenience Sample
- sampling individuals readily available
- non-probability sample, not necessarily representative of population
Snowball sample
- when one interviewed person suggests other potential interviewees
- non-probability sample, not necessarily representative of population
volunteer sample
- sample consists of self-selected respondents
one specific example is volunteered geographic information (VGI) - when participants enter information on a web map