8: Research Methods / Statistics Flashcards
What is ratio data?
Ratio data is the gold standard of measurment, where both absolute and relative differences have a meaning. An example would be distance measure.
Dif between 40 and 30 miles is the same as the dif between 30 and 20 miles. AND 40 miles is twice as far as 20 miles.
Nominal Data
This type of measurement is classified into mutually exculsive groups or categories and lack intristic order.
(examples: zoning classification, social security number).
label of categories does not imply any order.
Hypothesis test
This type of test is designed to reject a null hypothesis, but never to accept the alternative hypothesis.
Symptomatic Method
Uses data sets such as building permits that are reflective of populaAnation change and can be used to estimate current development population estimates.
Systematic random sample
Equal chance of being selected, every Xth person is surveyed
What is the probability of an event that is certain to happen?
1 - probabilities range from 0 to 1
Total acreage of federal indian reservations in the U.S.
56.2 million
What is a positive correlation?
When the high scores on one variable are associated with a high score on a second variable
Analysis of the relationship between two variables
Regression analysis
Total acreage of national forest land in the U.S.
192 million
Difference between the lowest and highest score on an exam
Range
Coefficient of Correlation
Measures the degree to which two variables are related
Stratified Sample
Subdivide the population into at least two different subgroups that share the same characteristics, then draw a sample from each subgroup.
SYSTEMATIC stratified sample represents the most effective way to get an accurate cross-section of the local population.
Qualitative V Quantitative V Mixed Methods
Qualitative: Approach for understanding the meaning individuals and groups ascribe to a human or social problem. Emerging questions.
Quantitative: Approach for testing objective theories by examining the relationships among variables (deductive). Nucmbered data.
Mixed methods: Collection of both qualitative & quantitative data. Integrating the 2 forms of data.
Discourse Analysis
Study of the way versions of the world, society, events, and psyche are produced in the use of language and discourse. It is often concerned with the construction of subjects within various forms of knowledge / power.
EXAMPLES: Semiotics, deconstruction, narrative analysis
Ethnography
Multi-method qualitative approach - studies people in their “naturally occurring settings or “fields” by means of methods which capture their social meanings and ordinary activities.
Grounded Theory
Inductive form of qualitative research where data collection & analysis are conducted together. Theories remain grounded in the observations rather than generated in the abstract.
Approach that develops the theory from the data collected rather than applying a theory to the data.
Narrative Analysis
Form of discourse analysis that seeks to study the textual devices at work in the constructions of process or sequence within a text. Tells researcher about the meaning of events in their lives.
3 steps to a statistical process:
1- Collect Data
2- Describe & Summarize the distribution of values in the data set.
3- Interpret by means of inferential stats & stat modeling.
Ordinal Data
Ordered categories which implies a ranking of the observations. Even though ordinal data may be given numeric values (example 1,2,3,4) - values are meaningless.
Example: Letter grades, suitability for development, response scales on a survey.
Interval data
Ordered relationship where the difference between the scales has a meaningful interpretation. Typical example = temperature.
Dif between 40 & 30 degrees, same as dif between 30 & 20 degrees - but 20 degrees is NOT twice as cold as 40 degrees.
Continuous variables
Can take an infinite number of values, both positive and negative, & with as fine of a degree of precision as desired.
Discrete variables
Can only take a finite number of distinct values. Example = count of the number of events, such as the number of accidents per month. Cannot be negative, can only take on integer values.
Binary or dichotomous variables = can only take on 2 values coded as 0 and as 1.
Population
Totality of some entity
Sample
Subset of the population.
Descriptive Statistics
Describes the characteristics of the distribution of values in a population or in a sample.
For example - descriptive stat such as the mean could be applied to the age distribution in the population of AICP exam takers. On average, test takers are 30 years old.
Inferential statistics
Use probability theory to determine characteristics of a population based on observations made on a sample from that population.
We infer things about the population based on what is observed in the sample.
Example - sample of 25 test takers and use their average age to say something about the mean age of all the test takers.
Distribution
Overall shape of all observed data.
Can be listed as an ordered table or graphically represented by a histogram or density plot.
HISTOGRAM: groups observations in bins represented in bar chart.
DENSITY PLOT: Shows a smooth curve.
Characteristics are summarized by descriptive statistics: like central tendency, dispersion, symmetry or lack thereof (skewness) & presence of thick tails aka higher likelihood of extreme values (kurtosis).
Range
Difference between largest & smallest value
Normal / Gaussian distributation
Bell curve.
Distribution is symmetric & has the additional property that spread around the mean can be related to the proportion of observations.
Often used as the reference distribution for statistical inference.
Symmetric distribution
One where an equal number of observations are below the mean & above the mean.
A Symmetric distribution, where there are either more observations below the mean or more above the mean - skewed.
Central Tendency
Typical or representative value for the distribution of observed values.
Ways to measure this = mean, median, mode.
Can be applied to the population as a whole, or to a sample from the population.
Mean, Weighted Mean
Average of a distribution. Computed by adding up the values and dividing by the number of observations.
Weighted mean - when there is a greater importance placed on specific entries or when representative values are used for groups of observations.
Mean appropriate for interval & ratio data, not for ordinal or nominal.
Median
Middle value of a ranked distribution
Mode
Most frequent number in a distribution
Symmetry
Symmetric distribution: Mean and median tend to be very close.
Skewed distributions: Tend to be different.
Variance & Standard Deviation
Both based on the squared difference from the mean
Standard deviation is the square root of the variance.
Standard deviation is in the same units as the original variable and preferred to escribe how values are spread out around the central tendency.
LARGER VARIANCE = greater spread around the mean.
Used for interval and ratio data.
Coefficient of Variation
Relative dispersion from the mean by taking the standard deviation and dividing by the mean.
Used for interval and ratio data.
Z-Score
Standardization of the original variable by subtracting by the mean and dividing by the standard deviation.
As a result the mean of the z-score is 0 and the variance (standard deviation) is 1.
Inter-quartile range (IQR)
Difference in value between the 75 percentile and the 25 percentile.
Example- if we have 20 observations ranked increasing order, take the 5th and the 15th observation and compute the differences between those two values. This is the IQR.
Hypothesis Test
Start by setting up a null hypothesis (used as a reference).
Then find evidence in the data to REJECT the null hypothesis statement in the direction of the alternative hypothesis.
Statistical evidence only provides support to reject the null hypothesis, never to accept the alternative hypothesis.
Statistical decision
The significance / P-value of a test (also called Type I Error) -the probability that we reject the null hypothesis with.
Confidence interval
Rane of confidence interval depends on the sampling error.
If the sampling error is large, means there isn’t much information in the sample relative to the population.
SMALLER sampling error = more precise statements.
T-test (students t-test)
Typically used to compare the means of 2 populations based on their sample averages.
T-test (students t-test) & how to test
Typically used to compare the means of 2 populations based on their sample averages.
Test by testing the significance of a regression coefficient.
ANOVA or Analysis of variance
More complex form of testing the equality of means between groups.
Use a treatment group & a control group.
F-test is a simple case of ANOVA.
Chi Square Test
Measure of Fit - Test that assesses the difference between as ample distribution and a hypothesized distribution.
Correlation Coefficient
Measures the strength of a linear relationship between two variables.
Does NOTE imply anything about causation (ex- whether one variable influences the other).
Positive correlation vs negative correlation
Positive = high values of one variable match high values of the other.
Negative = high values of one variable match low values of the other and vice versa.
Linear regression
Hypothesizes a linear relationship between a dependent variable & one or more explanatory variables.
TIGER
Acronym for Topographically Integrated Geographical Encoding and Referencing map, which is used for Census data. A TIGER map includes streets, railroads, zip codes, and landmarks.
Used by the Census Bureau and can be downloaded into a GIS system.
Digital Aerial Photogrpahy
Allowed for increased accuracy, can be incorporated into GIS.
Digital Elevation Models (DEM)
Show digital data about the elevation of the earth’s surface as it varies across communities, allows planners to analyze and map it. DEMs can be used for stormwater management, flood control, land use decisions, and other purposes.
Light Detection and Ranging (LIDAR)
New technology using a laser, instead of radio waves, that is mounted in an airplane to provide detailed topo information.
LIDAR can provide a dense pattern of data points to create 1-foot contours for digital elevation models (DEMs)
Use in watershed mapping, hydrologic modeling for flood control.
UrbanSim
Simulation software program that models planning and urban development. FREE. Designed to be used by MPOs
CommunityViz
ESRI Software that allows agencies to analyze land use scenarios and create 3D images. Allows citizens to visualize potential for development and redevelopment.
Urban Footprint
Developed by Peter Calthorpe & Associates and is a more recent addition to the simiulation program option for planners.
Uses a libraryof place types, block types, and building types to support interactive scenario building.
Sampling frame
Population of interest for a survey.
Cross-sectional survey vs Longitudinal surveys
CS Survey- Gathers information about a population at a single point in time.
Longitudinal Survey- Conducted over a period of time.
Written surveys- Pros & Cons
Pros: low cost, convenient for survey takers.
Cons: low response rate (AVERAGE 20%), requires literacy.
Could be bad for seniors and groups who don’t speak english and groups with low rates of literacy.
Drop-Off Survey - Pros & Cons
Pros: convenient for respondents. Response rates higher than mail survey because person dropping off the survey may have personal contact with the respondent.
Cons; Can be expensive because of time required to distribute the surveys. Sample generally smaller than mail survey.
Phone survey- pros and cons
Pros: Allow to get further explanations.
Response rates vary greatly.
More expensive than mail or internet surveys.
Can be biased due to interaction with the interviewer.
Long questions & those with multiple answers are difficult to administer.
Online Surveys
Popular - administered on a website, e-mail or text.
INEXPENSIVE. Higher response rate than interview or written surveys.
Won’t reach people without internet access
Keep these things in mind:
Make all questions clear (don’t use technical jargon).
Make sure each question only asks about one issue.
Make questions as short as possible.
Avoid negative items as they can confuse respondents.
Avoid biased items and terms.
Use a consistent response method, such as a scale of 1 to 7 or yes/no.
Sequence questions from general to specific.
Make the questions as easy to answer as possible.
Define any unique or unusual terms. For example, when you are conducting a survey about open space zoning be sure to define what the term means.
Cluster sample
A special form of stratified sampling where a specific target group out of the general population is sampled from, such as the elderly or residents of a specific neighborhood.
Non-probability sampling (Convenience, snowball, volunteer)
No precise connection between the sample and the population.
Convenience: individuals that are readily available
Snowball: one interviewed person suggests other potential interviewees.
Volunteer: Self-selected respondents.
Choropleth maps
Best way to link statistical data to discrete geographic areas using different colors or shades of the same color.
Correlation Values
STRONG correlation would be a vlaue close to either _1 or -1
WEAK correlation would be a value close to 0.
NEGATIVE correlation implies an inverse relationship between the two variables. If you study a lot, you don’t have a lot of free time.
Two variables, studying for the AICP Exam and Amount of Free Time, were examined to determine if there is a correlation. The analysis resulted in a correlation value of -0.85. What does this mean?
There is a high correlation between studying for the AICP Exam and the amount of free time people have.
Best reason for selecting a random sample from a large population?
To provide an approximation of the characteristics of the population.
Sample Selection Bias can occur when:
The availability of data is influenced by the selection process
Future value equation
FV = (1 + r)^y PV with the interest r in fractions of 100 (so 5% = 5/100) and y the number of years. The present value is thus
What can be described as a measure of dispersion around the mean that is calculated as the average of the sum of the squared deviations from the mean?
Variance