Final Exam Flashcards
In networks, what is degree? How is it measured?
A measure of local centrality. It is the most crude measure of how well connected a node is to other nodes.
It is measured by counting the number of edges(connections) a node has.
In networks, what is betweeness? How is it measured?
A measure of global centrality. It is a way to measure how well connected a specific node is to other modes.
It is measured by summing all the SHORTEST paths in a network that the node is on.
To calculate: Take all shortest paths between all two-node combinations and count how many times the specific node appears.
in networks what is centrality?
the extent to which each node is connected to other nodes and appears in the center of the graph.
In networks, what are the 4 measures of centrality?
degree
farness
closeness
betweenes
In networks what is a node?
An individual unit in the analysis
in networks what is an edge/vertice?
a line that represents the existence of a relationship between any pair of nodes.
What is a directed network? Given an example
A directed network is a network in which the edges travel either in or out, the edges only travel one direction.
Example: Twitter followers/following other.
What is an undirected network? Give an example
Qn undirected network is a network with edges that represent a two-way relationship that can travel both directions ad therefore has no direction
Example: On facebook by being “friends” the relationship ahs to go both ways.
In networks, what does in-degree mean? Give an example
In a directed network, in-degree is a measure of centrality that measures the number of incoming edges a node has
Example: In twitter the people who follow you are in -degree
in networks, what does out-degree mean? Give an exampl
in a directed network, out-degree is a measure of centrality that measures the number of outgoing edges a node has
example: in Twitter the people you follow is an out degree
in networks, what is farness? How is it measured?
Farness is a measure of centrality that measure how far away (distance) a node is from every other node,.
To measure farness, sum the distances between a node and every other node.
in networks, what is closeness? how is it measured?
Closeness is the inverse of farness. tells you how close a node is to every other node.
To measure closeness divide 1/farness
Explain the intuition behind interactions
When a hypothesis is conditional and the effect of a variable depends on another variable, the second variable becomes part of the equation rather than being “controlled” for in the equation. Interactions model this conditional effect.
What three terms are required for interactions?
Two separate constituent variable components and the interaction term.
in interactions what do the constituent terms mean
The effect of that term on Y when the other constituent term is zero
in interactions what does the interaction term mean
The slope of the conditional relationship
In interactions what is the interactive effect
The effect of all three terms
What is the equation for interactions
y= α + β1(Consituent 1) +β2(Constiuet2) + β3(β1*β2)
What is the unit of analysis
the unit that represents the entity you are studying
ex. country, individual, household, congressional district, state
What is the unit of observation
what uniquely identifies the observation being studied.
- is a characteristic of the unit of analysis
ex. country-year, state-month, individual wave
What is bias?
What are the 5 types of potential bias in survey sampling?
bias is the systematic faults in the sampling system. If it is not systematic then it is just white noise and not bias
1.) frame bias
2.) selection bias
3.) Unit non-response bias
4.) Item non-response bias
5.) response bias
What is Frame bias?
When the general population frame is non-representative
What is selection bias?
when the sample population is systematically not randomized
What is unit non-response bias?
When people in the sample or frame population systematically do not respond/participate in the survey
What is item non-response bias?
When participants in the survey systematically do not respond to a specific item on the survey
What is response bias?
When respondents lie on the survey or do not tell you the real response
ex.) social desirability bias, people tell you the answer they think is the most socially correct, not their real answer.
What are list experiments?
When are they useful?
Example?
List experiments are when the control group of respondents is given a list of 3 items and are asked how many of the 3 they support (or another indicator) and the treatment group is given the same list but with an extra 4th item. If the average number of “supported” items reported is increased in the treatment group compared to the control group, this indicates “support” for the 4th variable in the list.
useful when the questions are sensitive or there is social pressure.
Ex.) to determine if afghanis supported the Taliban a control group was given a list of 3 organizations to support, the average response was calculated. A treatment group was given the same question and list with he addition of the taliban. The increase in average supported groups was 2 in the control and 3 in the treatment. This indicates they do support the taliban.
What is probability Sampling?
Why is it used?
Is used to ensure representativeness.
Is when every unit in the population has a known non-zero probability of being selected to participate in the study
What is Simple Random Sampling?
Is used to properly randomize the sample. The bigger the sample, the more accurate the results.
In simple random sampling, every unit has an equal selection probability.
How do you find the interquartile range?
Subtract Q1 from Q3
How do you find Range?
subtract the minimum number from the maximum number
How do you find the three Quartiles?
Start by finding the median of the entire list. The median is considered Q2. The median then separates the list into two halves. Locate the median of the first half of the list, this median is Q1. Locate the median of the second half of the list, this median is Q3.
How do you determine if a number is an outlier?
You must find the highest and lowest limit of the dataset for non-outlier numbers. To find the lowest acceptable number take Q1 - 1.5IQR. To find the highest acceptable number take Q3 + 1.5IQR. If the number in question is below or above either of these numbers it is an outlier.
in linear regressions, How do you find Standard Deviation? What is the formula?
Formula:
SD = sqrt (1/n-1 * sum (xi-mean of X)^2
Steps:
1.) find the mean of X
2.) Subtract the mean from each x variable
3.) Square each result from step 2
4.) Add together all the squares
5.) Divide the sum of the squares by the total number of observations minus 1
6.) Square root the result of step 5
The result of step 6 is the standard deviation
How do you find the median?
What is the benefit of using median
If you have an odd amount of numbers locate the exact middle number.
If you have an even amount of numbers locate the two middle numbers, add them together, and divide the sum by 2.
benefit: more robust against the impact of outliers
How do you find Mean?
What is a detriment to using mean?
add together all of the numbers and divide the sum by the total amount of numbers
Detriment: can be influenced by outliers which pull the average too high or too low.
What is non-probability sampling
When all members of a population do NOT have an equal chance of participating in the study
what is a variable?
What is the key rule?
An empirical measure of a concept/characteristic.
key rule: variables must vary across observations
what are the 2 types of variables, describe them
1: Quantitative/Interval/Continuous- observations can take on an infinite number of numerical values between any two values (decimals).
2: Categorical — observations belong to one of a discrete set of categories & we assign a number to each category
what are the 3 types of categorical variables, describe them
1.) Nominal — categories are named (independent) but there is no order or ranking involved.
2.) Ordinal — categories are ranked
3.) Dichotomous variables — two values (e.g., yes/no)
What does the distribution of a variable tell us?
what values a variable takes and how often it takes on these values
what are the two types of modes a distribution can have, define them
unimodal: one mode/one hump in a distribution
bimodal: two modes/two humps in a distribution
what two S words are used to describe distribution?
define them
symmetric- looks the same on both sides, a normal bell curve distribution
skewed- the data bunches on one side of the curve and creates a tail on the other.
what is a Z score?
the score given to each observation of a variable which measures the number of standard deviations an observation is above or below the mean
It is a measure of deviation from the mean
It is not sensitive to how the variable is scaled and or shifted.
differentiate the two types of skewnees
right skew- the tail is on the right
left skew- the tail is on the left
how can you transform variables
You can collapse continuous variables into ordinal (or nominal) variables. this does not work in the reverse
ex. you can turn incomes into categories of incomes
Log Transformation for continuous variables
Why do we plot distributions
To better understand the spread of the data and to know if we need to log transform it.
What is probability
the set of mathematical tools that measure and model randomness in the world. It is a mathematical model of uncertainty
What is independent probability?
the probability is independent when the outcome of any one trial is NOT affected by the outcome of any other trial. The events are mutually exclusive
i.e the probability of event A happening does not change the probability of event b happening.
What is conditional probability?
the probability is conditional when the outcome of any one trial IS affected by the outcome of any other trial. The events are not mutually exclusive
ie. the probability of event A happening affects the probability of event B happening
What is sampling variability?
the extent to which the value of a statistic differs across a series of samples
What is a sample?
The smaller portion of the true population being used for the study.
Example: In a town of 50, 10 people participate in the study. The 10 people are the sample
What is population?
The entire group the study hopes to say something about
Example: In a town of 50 people and 10 people participate in the study the population is 50
What is the relationship between the sample and the population in the law of large numbers?
As the sample size increases, the sample average of a random variable approaches its expected value, the true population average
Define probability in terms of outcome
the expected proportion of times that the outcome would occur in the long run
What is the equation for conditional probability when the events are independent?
P(A | B)
P(A | B) = P(A and B) / P(B)
What is the equation for the probability of A?
p(A)
Number of elements in A / total number of elements
What is the equation for the probability of A or B if events are non-independent?
P(A or B)
P(A or B) = P(A) + P(B) – P(A and B)
What is the equation for the probability of A and B if both events are independent?
P(A and B)
P(A and B) = P(A) * P(B)
What is the equation for the probability of A or B is events are independent?
P(A or B) = P(A) + P(B)
What is the equation for the probability of A and B if the events are not independent?
P(A and B)= P(A | B)P(B)
or P(B | A)P(A)
How do you find Standard Deviation? What is the formula?
Formula:
SD = sqrt (1/n-1 * sum (xi-mean of X)^2
Steps:
1.) find the mean of X
2.) Subtract the mean from each x variable
3.) Square each result from step 2
4.) Add together all the squares
5.) Divide the sum of the squares by the total number of observations minus 1
6.) Square root the result of step 5
The result of step 6 is the standard deviation
Why do you square and then square root in standard deviation?
You square to eliminate the impact of being on the opposite side of the mean, it negates negative numbers and gives you a flat distance from the mean. Squaring prevents the numbers from canceling each other out so you don’t end up with 0. You need to square root because once the numbers are squared they are no longer in the same units of the original data, the square root brings them back to the same unit.
what is a Z score?
the score given to each observation of a variable which measures the number of standard deviations an observation is above or below the mean
It is a measure of deviation from the mean
It is not sensitive to how the variable is scaled and or shifted.
how do you determine Z scores (formula)
for each iteration subtract the mean of the variable from the iteration value and then divide the result by the standard deviation of the variable.
z score of Xi = (Xi-x̄) / SD of X
What is the z scores threshold?
What are the threshold numbers at the 90%, 95%, and 99% confidence levels
The threshold is the critical value the z score much reach to be considered statistically significant at the designated confidence level.
90% confidence- 1.64
95% confidence- 1.96
99% confidence- 2.58
How do you find the root mean square error (RMSE)?
RMSE = sqrt (RSS/n)
1.) find the value of RSS (subtract predicted y from real y, square the results, add the squares
2.) divide RSS by the total number of values
3.) square root the result
What does the RMSE indicate?
the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values
In probability what are permutations?
The number of ways to arrange objects with regard to order
In probability what are combinations
The number of ways to select/arrange objects without regard to their arrangement/order
What is a random variable?
a variable whose value is a numerical
outcome of a random phenomenon
What are the two types of random variables? define them
Discrete
* X has a finite number of possible values
*Probability distribution of X lists values and their probabilities
*Can determine probability of event by adding the probabilities of individual outcomes
Continuous
*X can take any value in an interval of numbers
*Probability distribution of X is described by a density curve
*Probability of event is the area under the density curve and above the values of X that make up the event
What is the central limit theorem
As the sample size increases, the distribution of
sample means approaches the standard Normal distribution
what is positive correlation?
When x is larger than its mean, y is likely to be larger than its mean
Line is slanted upwards /
What is negative correlation?
When x is larger than its mean, y is unlikely to be larger than its mean
Line is slanted downwards \
what does it mean to have high correlation?
data cluster tightly around a line
indicates the two variables have a strong relationship
what are the properties of the correlation coefficient r
1.) Correlation is between −1 and 1
2.) Order does not matter: cor(x, y) = cor(y, x)
3.) Not affected by changes of scale
4.) Correlation measures linear association
What are the 4 things to keep in mind when it comes to OLS regressions , and their components?
1.) OLS regressions are linear
-uses line of best fit, but it may not be appropriate
-not resistant to the influence of outliers
-slope is constant
-true relationship may not be linear
2.) OLS allows for unreasonable predictions
-only want to generate reasonable predictions
-Evaluating predictions is key to assessing the relationship between variables & the strength of the model
3.) OLS correlations do not necessarily indicate causation
-correlation can be driven by unobserved variables
4.) OLS regressions are versatile and robust
-Models the relationship between IV and DV and allows for making predictions
-allows for including additional variables in the model
-Continuous DVs & continuous and/or dichotomous IVs
In OLS it is valuable that we can add additional IVs, why should we control for additional variables?
Worried about omitted variable bias: some underlying (unobserved) factor (X2) is driving relationship between X1 and Y
important to ‘control’ for other variables that we think lie in the causal path. When we control we can determine how much effect each X is having on Y
-Venn diagram, find the net effect of each by removing overlapping areas.
are OLS regressions sensitive to outliers?
Why?
yes
because it uses a best-fit line to minimize the distance from all points to the line and if one or more points are far out of the pattern, the slope of
the line can change considerably
Ex. palm beach vote share
What is the equation for a linear regression model?
Y= α +βX + ε
What do the variables in the linear regression model mean?
Y= α +βX + ε
Y: dependent variable, what you are trying to predict
α: alpha, is the y-intercept. Where y is when X=0
β: Beta, slope, the increase in Y when X has a one-unit increase
X: independent variable, the predictor
ε: error term, the observed error (actual y - predicted y)
WHat is statistical inference
Guessing what we do not observe from what we do observe
What is the problem with the true population parameter
it is unobservable and can only be estimated
Why can we not know the estimation error
estimation error would be calculated by subtracting the estimated value from the true population parameter but we can never know the true population parameter and cannot do the calculation
what is a consistent estimator?
an estimator is consistent if as the n increases the estimates converge toward the true population parameter
In the law of large numbers, how does X behave as n increases
As n increases the X should get closer to the true population parameter
What is an unbiased estimator
an estimator is unbiased if over repeated sampling the parameter estimates produced are on average correct
what is standard error? What does it tell us?
when the standard deviation of a statistic is estimated from the data
Tells us on average how far the sample mean estimate is likely from the true population parameter
what is the formula for standard error of sample means
standard deviation / √n
what is the difference between standard error and standard deviation
Standard error is impacted by the size of n and standard deviation ignores n
what is the equation for standard error of a proportion
√ (x̄*(1-x̄)/n)
x̄= estimate
what is the equation for standard error of sampling distribution
√ (p*(1-p)/n
p=sample average
what is the equation for the standard error of a difference in means estimator
(comparing two samples)
√ (s1^2/n) + (s2^2 / M)
m= sample sixe of sample #2
n= sample size of sample 1
s1= sample 1 standard deviation
s2= sample 2 standard deviation
what do confidence intervals tells us?
how confident we can be that, over repeated sampling, the population parameter will be in the confidence interval
is a confidence interval?
a range of numbers that contains the true value 95% of the time over repeated data generating process
Can be created around the null or around the estimate
can also be at 90% or 99%
what is the equation for a confidence interval at the 95% level ?
What changes at other levels of confidence?
What changes to build the interval around the null?
[(estimate - 1.96SE) ,( estimate + 1.96SE)]
at other levels exchange 1.96 for the critical z value.
to build around the null exchange the estimate for the null value (usually 0)
How do you interpret the confidence interval?
say the CI contains the true parameter 95% (or 90% or 99%) of the time over repeated sampling
when does the central limit theorem not apply?
when the n is very small
how do you calculate degrees of freedm?
is it different for T-statistic?
n-1-k
for t-distribution just do n-1
n= number of observations
k-number of parameters to be estimated
why do we use n-1 in t-distribution?
it penalizes smaller samples more by requiring wider confidence intervals
when do you use a t-test? Why do you use it them? How do you use them?
you use a t-test when the n is less than 50. After 50 the z score kicks in as the 95% threshold stays around 1.96. before 50 you need the chart to tell you the exact confidence threshold. Under 50 you use the chart to determine the confidence threshold and then build a confidence interval around t like you would z to determine if the estimate is statistically significant.
what is a type 1 error
mistakenly reject the null hypothesis
what is a type 2 error
mistakenly fail to reject the null
hypothesis
describe the steps for a one sample t test
1- determine the value of the null hypothesis (usually 0)
2- determine what µ (mean) would be if the null were true
3- solve for the t-statististic
(estimate - null) / (standard error / √n)
4- use the chart to determine the p-value for the specific t-statisitic
5- Judge if the p value is significant at the specified level.
90%- less than or equal to .10
95%- less than or equal to .05
99%- less than or equal to .001
in terms of confidence intervals when is an estimate not statistically significant? Explain
when zero is included in the confidence interval. when zero is in the bounds of the CI you cannot distinguish the estimate and the null at the designated confidence level
what is the rule about average treatment effect and statistical significance?
If the Average treatment affect is double the standard error it is likely statistically significant
what is the equation to calculate t
coefficient/standard error
why can you not make categorical statements when using t
because the statistically significant threshold is not steady, you need the table to tell you the threshold
what are the two key assumptions of OLS regression with uncertainty?
Exogeneity and Homoskedasticity
What is exogeneity? what are the main problems?
The mean of E does not depend on X, error is random and uncorrelated with your Xs
the main problems are endogeneity (the explanatory variable is correlated with the error term) and omitted variable bias
what is Homoskedasticity
the variance in the error does not depend on x
and the error is fairly uniform throughout
What is used to estimate parameters in OLS regression with uncertainty? what is the equation for this?
least squares
ssr= sum ((actual y - predicted y)^2)
What are the statitsical properties of least squares?
under the exogeneity and homoskedasticity assumptions standard errors are unbiased
under exogeneity, alpha and beta are unbiased
t distribution is often used
what are the three ways to tell statistical significance at the 95% level?
1.) Is p ≤ 0.05?
2.) is t greater than 1.96 in two-tailed test (with a large N)
3.) Does a 95% confidence interval have the same sign on the lower and upper bound?
what affects the statistical significance
1.) the size of the coefficient (the bigger the effect X has on Y the less likely it is that the true effect in the population is zero)
2.) the size of the standard error (t-statistic is the coefficient divided by standard error -smaller standard error means bigger t)
3.) sample size (Smaller t gives bigger p-value as degrees of freedom increases. This increase diminishes quickly and (essentially) stops at about 1,000 degrees of freedom.-> SE becomes smaller regardless of truth
why do we need substantive significance? What does it tell us?
because statistical significance does not tell us how large or meaningful the effect of X is on Y
What a one-unit change in x means and what that predicts in terms of units of Y
What is t
a measure of standard error (how far the estimate is from the mean in terms of standard errors )
explain the intuition behind the standard error of the coefficient and prediction math
you need to have error in how x changes or else you cannot understand why y varies. you need variance among all components to get the standard error to be able to evaluate significance. If there is no error you cannot determine statistical significance.
Tells you how precise the predictions are
what is the t-statistic formula
(mean X - population mean) / (s/√n)
What is distance in networks
The number of edges in the shortest path between two nodes