Final Exam Flashcards

1
Q

In networks, what is degree? How is it measured?

A

A measure of local centrality. It is the most crude measure of how well connected a node is to other nodes.

It is measured by counting the number of edges(connections) a node has.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In networks, what is betweeness? How is it measured?

A

A measure of global centrality. It is a way to measure how well connected a specific node is to other modes.

It is measured by summing all the SHORTEST paths in a network that the node is on.

To calculate: Take all shortest paths between all two-node combinations and count how many times the specific node appears.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

in networks what is centrality?

A

the extent to which each node is connected to other nodes and appears in the center of the graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In networks, what are the 4 measures of centrality?

A

degree
farness
closeness
betweenes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In networks what is a node?

A

An individual unit in the analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

in networks what is an edge/vertice?

A

a line that represents the existence of a relationship between any pair of nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a directed network? Given an example

A

A directed network is a network in which the edges travel either in or out, the edges only travel one direction.

Example: Twitter followers/following other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an undirected network? Give an example

A

Qn undirected network is a network with edges that represent a two-way relationship that can travel both directions ad therefore has no direction

Example: On facebook by being “friends” the relationship ahs to go both ways.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In networks, what does in-degree mean? Give an example

A

In a directed network, in-degree is a measure of centrality that measures the number of incoming edges a node has

Example: In twitter the people who follow you are in -degree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

in networks, what does out-degree mean? Give an exampl

A

in a directed network, out-degree is a measure of centrality that measures the number of outgoing edges a node has

example: in Twitter the people you follow is an out degree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

in networks, what is farness? How is it measured?

A

Farness is a measure of centrality that measure how far away (distance) a node is from every other node,.

To measure farness, sum the distances between a node and every other node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

in networks, what is closeness? how is it measured?

A

Closeness is the inverse of farness. tells you how close a node is to every other node.

To measure closeness divide 1/farness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the intuition behind interactions

A

When a hypothesis is conditional and the effect of a variable depends on another variable, the second variable becomes part of the equation rather than being “controlled” for in the equation. Interactions model this conditional effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What three terms are required for interactions?

A

Two separate constituent variable components and the interaction term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

in interactions what do the constituent terms mean

A

The effect of that term on Y when the other constituent term is zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

in interactions what does the interaction term mean

A

The slope of the conditional relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In interactions what is the interactive effect

A

The effect of all three terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the equation for interactions

A

y= α + β1(Consituent 1) +β2(Constiuet2) + β3(β1*β2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the unit of analysis

A

the unit that represents the entity you are studying

ex. country, individual, household, congressional district, state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the unit of observation

A

what uniquely identifies the observation being studied.

  • is a characteristic of the unit of analysis
    ex. country-year, state-month, individual wave
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is bias?
What are the 5 types of potential bias in survey sampling?

A

bias is the systematic faults in the sampling system. If it is not systematic then it is just white noise and not bias
1.) frame bias
2.) selection bias
3.) Unit non-response bias
4.) Item non-response bias
5.) response bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Frame bias?

A

When the general population frame is non-representative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is selection bias?

A

when the sample population is systematically not randomized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is unit non-response bias?

A

When people in the sample or frame population systematically do not respond/participate in the survey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is item non-response bias?

A

When participants in the survey systematically do not respond to a specific item on the survey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is response bias?

A

When respondents lie on the survey or do not tell you the real response

ex.) social desirability bias, people tell you the answer they think is the most socially correct, not their real answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are list experiments?
When are they useful?
Example?

A

List experiments are when the control group of respondents is given a list of 3 items and are asked how many of the 3 they support (or another indicator) and the treatment group is given the same list but with an extra 4th item. If the average number of “supported” items reported is increased in the treatment group compared to the control group, this indicates “support” for the 4th variable in the list.

useful when the questions are sensitive or there is social pressure.

Ex.) to determine if afghanis supported the Taliban a control group was given a list of 3 organizations to support, the average response was calculated. A treatment group was given the same question and list with he addition of the taliban. The increase in average supported groups was 2 in the control and 3 in the treatment. This indicates they do support the taliban.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is probability Sampling?
Why is it used?

A

Is used to ensure representativeness.
Is when every unit in the population has a known non-zero probability of being selected to participate in the study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is Simple Random Sampling?

A

Is used to properly randomize the sample. The bigger the sample, the more accurate the results.

In simple random sampling, every unit has an equal selection probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How do you find the interquartile range?

A

Subtract Q1 from Q3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How do you find Range?

A

subtract the minimum number from the maximum number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How do you find the three Quartiles?

A

Start by finding the median of the entire list. The median is considered Q2. The median then separates the list into two halves. Locate the median of the first half of the list, this median is Q1. Locate the median of the second half of the list, this median is Q3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do you determine if a number is an outlier?

A

You must find the highest and lowest limit of the dataset for non-outlier numbers. To find the lowest acceptable number take Q1 - 1.5IQR. To find the highest acceptable number take Q3 + 1.5IQR. If the number in question is below or above either of these numbers it is an outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

in linear regressions, How do you find Standard Deviation? What is the formula?

A

Formula:
SD = sqrt (1/n-1 * sum (xi-mean of X)^2

Steps:
1.) find the mean of X
2.) Subtract the mean from each x variable
3.) Square each result from step 2
4.) Add together all the squares
5.) Divide the sum of the squares by the total number of observations minus 1
6.) Square root the result of step 5

The result of step 6 is the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How do you find the median?
What is the benefit of using median

A

If you have an odd amount of numbers locate the exact middle number.

If you have an even amount of numbers locate the two middle numbers, add them together, and divide the sum by 2.

benefit: more robust against the impact of outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How do you find Mean?
What is a detriment to using mean?

A

add together all of the numbers and divide the sum by the total amount of numbers

Detriment: can be influenced by outliers which pull the average too high or too low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is non-probability sampling

A

When all members of a population do NOT have an equal chance of participating in the study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

what is a variable?
What is the key rule?

A

An empirical measure of a concept/characteristic.

key rule: variables must vary across observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

what are the 2 types of variables, describe them

A

1: Quantitative/Interval/Continuous- observations can take on an infinite number of numerical values between any two values (decimals).
2: Categorical — observations belong to one of a discrete set of categories & we assign a number to each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

what are the 3 types of categorical variables, describe them

A

1.) Nominal — categories are named (independent) but there is no order or ranking involved.
2.) Ordinal — categories are ranked
3.) Dichotomous variables — two values (e.g., yes/no)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What does the distribution of a variable tell us?

A

what values a variable takes and how often it takes on these values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

what are the two types of modes a distribution can have, define them

A

unimodal: one mode/one hump in a distribution
bimodal: two modes/two humps in a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

what two S words are used to describe distribution?
define them

A

symmetric- looks the same on both sides, a normal bell curve distribution

skewed- the data bunches on one side of the curve and creates a tail on the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

what is a Z score?

A

the score given to each observation of a variable which measures the number of standard deviations an observation is above or below the mean

It is a measure of deviation from the mean

It is not sensitive to how the variable is scaled and or shifted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

differentiate the two types of skewnees

A

right skew- the tail is on the right
left skew- the tail is on the left

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

how can you transform variables

A

You can collapse continuous variables into ordinal (or nominal) variables. this does not work in the reverse
ex. you can turn incomes into categories of incomes

Log Transformation for continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Why do we plot distributions

A

To better understand the spread of the data and to know if we need to log transform it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is probability

A

the set of mathematical tools that measure and model randomness in the world. It is a mathematical model of uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is independent probability?

A

the probability is independent when the outcome of any one trial is NOT affected by the outcome of any other trial. The events are mutually exclusive

i.e the probability of event A happening does not change the probability of event b happening.

50
Q

What is conditional probability?

A

the probability is conditional when the outcome of any one trial IS affected by the outcome of any other trial. The events are not mutually exclusive

ie. the probability of event A happening affects the probability of event B happening

51
Q

What is sampling variability?

A

the extent to which the value of a statistic differs across a series of samples

52
Q

What is a sample?

A

The smaller portion of the true population being used for the study.

Example: In a town of 50, 10 people participate in the study. The 10 people are the sample

53
Q

What is population?

A

The entire group the study hopes to say something about

Example: In a town of 50 people and 10 people participate in the study the population is 50

54
Q

What is the relationship between the sample and the population in the law of large numbers?

A

As the sample size increases, the sample average of a random variable approaches its expected value, the true population average

55
Q

Define probability in terms of outcome

A

the expected proportion of times that the outcome would occur in the long run

56
Q

What is the equation for conditional probability when the events are independent?
P(A | B)

A

P(A | B) = P(A and B) / P(B)

57
Q

What is the equation for the probability of A?
p(A)

A

Number of elements in A / total number of elements

58
Q

What is the equation for the probability of A or B if events are non-independent?
P(A or B)

A

P(A or B) = P(A) + P(B) – P(A and B)

59
Q

What is the equation for the probability of A and B if both events are independent?
P(A and B)

A

P(A and B) = P(A) * P(B)

60
Q

What is the equation for the probability of A or B is events are independent?

A

P(A or B) = P(A) + P(B)

61
Q

What is the equation for the probability of A and B if the events are not independent?

A

P(A and B)= P(A | B)P(B)
or P(B | A)P(A)

62
Q

How do you find Standard Deviation? What is the formula?

A

Formula:
SD = sqrt (1/n-1 * sum (xi-mean of X)^2

Steps:
1.) find the mean of X
2.) Subtract the mean from each x variable
3.) Square each result from step 2
4.) Add together all the squares
5.) Divide the sum of the squares by the total number of observations minus 1
6.) Square root the result of step 5

The result of step 6 is the standard deviation

63
Q

Why do you square and then square root in standard deviation?

A

You square to eliminate the impact of being on the opposite side of the mean, it negates negative numbers and gives you a flat distance from the mean. Squaring prevents the numbers from canceling each other out so you don’t end up with 0. You need to square root because once the numbers are squared they are no longer in the same units of the original data, the square root brings them back to the same unit.

64
Q

what is a Z score?

A

the score given to each observation of a variable which measures the number of standard deviations an observation is above or below the mean

It is a measure of deviation from the mean

It is not sensitive to how the variable is scaled and or shifted.

65
Q

how do you determine Z scores (formula)

A

for each iteration subtract the mean of the variable from the iteration value and then divide the result by the standard deviation of the variable.

z score of Xi = (Xi-x̄) / SD of X

66
Q

What is the z scores threshold?
What are the threshold numbers at the 90%, 95%, and 99% confidence levels

A

The threshold is the critical value the z score much reach to be considered statistically significant at the designated confidence level.
90% confidence- 1.64
95% confidence- 1.96
99% confidence- 2.58

67
Q

How do you find the root mean square error (RMSE)?

A

RMSE = sqrt (RSS/n)

1.) find the value of RSS (subtract predicted y from real y, square the results, add the squares

2.) divide RSS by the total number of values

3.) square root the result

68
Q

What does the RMSE indicate?

A

the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values

69
Q

In probability what are permutations?

A

The number of ways to arrange objects with regard to order

70
Q

In probability what are combinations

A

The number of ways to select/arrange objects without regard to their arrangement/order

71
Q

What is a random variable?

A

a variable whose value is a numerical
outcome of a random phenomenon

72
Q

What are the two types of random variables? define them

A

Discrete
* X has a finite number of possible values
*Probability distribution of X lists values and their probabilities
*Can determine probability of event by adding the probabilities of individual outcomes

Continuous
*X can take any value in an interval of numbers
*Probability distribution of X is described by a density curve
*Probability of event is the area under the density curve and above the values of X that make up the event

73
Q

What is the central limit theorem

A

As the sample size increases, the distribution of
sample means approaches the standard Normal distribution

74
Q

what is positive correlation?

A

When x is larger than its mean, y is likely to be larger than its mean

Line is slanted upwards /

75
Q

What is negative correlation?

A

When x is larger than its mean, y is unlikely to be larger than its mean

Line is slanted downwards \

76
Q

what does it mean to have high correlation?

A

data cluster tightly around a line
indicates the two variables have a strong relationship

77
Q

what are the properties of the correlation coefficient r

A

1.) Correlation is between −1 and 1
2.) Order does not matter: cor(x, y) = cor(y, x)
3.) Not affected by changes of scale
4.) Correlation measures linear association

78
Q

What are the 4 things to keep in mind when it comes to OLS regressions , and their components?

A

1.) OLS regressions are linear
-uses line of best fit, but it may not be appropriate
-not resistant to the influence of outliers
-slope is constant
-true relationship may not be linear

2.) OLS allows for unreasonable predictions
-only want to generate reasonable predictions
-Evaluating predictions is key to assessing the relationship between variables & the strength of the model

3.) OLS correlations do not necessarily indicate causation
-correlation can be driven by unobserved variables

4.) OLS regressions are versatile and robust
-Models the relationship between IV and DV and allows for making predictions
-allows for including additional variables in the model
-Continuous DVs & continuous and/or dichotomous IVs

79
Q

In OLS it is valuable that we can add additional IVs, why should we control for additional variables?

A

Worried about omitted variable bias: some underlying (unobserved) factor (X2) is driving relationship between X1 and Y

important to ‘control’ for other variables that we think lie in the causal path. When we control we can determine how much effect each X is having on Y

-Venn diagram, find the net effect of each by removing overlapping areas.

80
Q

are OLS regressions sensitive to outliers?
Why?

A

yes
because it uses a best-fit line to minimize the distance from all points to the line and if one or more points are far out of the pattern, the slope of
the line can change considerably

Ex. palm beach vote share

81
Q

What is the equation for a linear regression model?

A

Y= α +βX + ε

82
Q

What do the variables in the linear regression model mean?
Y= α +βX + ε

A

Y: dependent variable, what you are trying to predict
α: alpha, is the y-intercept. Where y is when X=0
β: Beta, slope, the increase in Y when X has a one-unit increase
X: independent variable, the predictor
ε: error term, the observed error (actual y - predicted y)

83
Q

WHat is statistical inference

A

Guessing what we do not observe from what we do observe

84
Q

What is the problem with the true population parameter

A

it is unobservable and can only be estimated

85
Q

Why can we not know the estimation error

A

estimation error would be calculated by subtracting the estimated value from the true population parameter but we can never know the true population parameter and cannot do the calculation

86
Q

what is a consistent estimator?

A

an estimator is consistent if as the n increases the estimates converge toward the true population parameter

87
Q

In the law of large numbers, how does X behave as n increases

A

As n increases the X should get closer to the true population parameter

88
Q

What is an unbiased estimator

A

an estimator is unbiased if over repeated sampling the parameter estimates produced are on average correct

89
Q

what is standard error? What does it tell us?

A

when the standard deviation of a statistic is estimated from the data

Tells us on average how far the sample mean estimate is likely from the true population parameter

90
Q

what is the formula for standard error of sample means

A

standard deviation / √n

91
Q

what is the difference between standard error and standard deviation

A

Standard error is impacted by the size of n and standard deviation ignores n

92
Q

what is the equation for standard error of a proportion

A

√ (x̄*(1-x̄)/n)

x̄= estimate

93
Q

what is the equation for standard error of sampling distribution

A

√ (p*(1-p)/n

p=sample average

94
Q

what is the equation for the standard error of a difference in means estimator

(comparing two samples)

A

√ (s1^2/n) + (s2^2 / M)

m= sample sixe of sample #2
n= sample size of sample 1
s1= sample 1 standard deviation
s2= sample 2 standard deviation

95
Q

what do confidence intervals tells us?

A

how confident we can be that, over repeated sampling, the population parameter will be in the confidence interval

96
Q

is a confidence interval?

A

a range of numbers that contains the true value 95% of the time over repeated data generating process

Can be created around the null or around the estimate

can also be at 90% or 99%

97
Q

what is the equation for a confidence interval at the 95% level ?

What changes at other levels of confidence?
What changes to build the interval around the null?

A

[(estimate - 1.96SE) ,( estimate + 1.96SE)]

at other levels exchange 1.96 for the critical z value.

to build around the null exchange the estimate for the null value (usually 0)

98
Q

How do you interpret the confidence interval?

A

say the CI contains the true parameter 95% (or 90% or 99%) of the time over repeated sampling

99
Q

when does the central limit theorem not apply?

A

when the n is very small

100
Q

how do you calculate degrees of freedm?
is it different for T-statistic?

A

n-1-k
for t-distribution just do n-1
n= number of observations
k-number of parameters to be estimated

101
Q

why do we use n-1 in t-distribution?

A

it penalizes smaller samples more by requiring wider confidence intervals

102
Q

when do you use a t-test? Why do you use it them? How do you use them?

A

you use a t-test when the n is less than 50. After 50 the z score kicks in as the 95% threshold stays around 1.96. before 50 you need the chart to tell you the exact confidence threshold. Under 50 you use the chart to determine the confidence threshold and then build a confidence interval around t like you would z to determine if the estimate is statistically significant.

103
Q

what is a type 1 error

A

mistakenly reject the null hypothesis

104
Q

what is a type 2 error

A

mistakenly fail to reject the null
hypothesis

105
Q

describe the steps for a one sample t test

A

1- determine the value of the null hypothesis (usually 0)

2- determine what µ (mean) would be if the null were true

3- solve for the t-statististic

(estimate - null) / (standard error / √n)

4- use the chart to determine the p-value for the specific t-statisitic

5- Judge if the p value is significant at the specified level.

90%- less than or equal to .10
95%- less than or equal to .05
99%- less than or equal to .001

106
Q

in terms of confidence intervals when is an estimate not statistically significant? Explain

A

when zero is included in the confidence interval. when zero is in the bounds of the CI you cannot distinguish the estimate and the null at the designated confidence level

107
Q

what is the rule about average treatment effect and statistical significance?

A

If the Average treatment affect is double the standard error it is likely statistically significant

108
Q

what is the equation to calculate t

A

coefficient/standard error

109
Q

why can you not make categorical statements when using t

A

because the statistically significant threshold is not steady, you need the table to tell you the threshold

110
Q

what are the two key assumptions of OLS regression with uncertainty?

A

Exogeneity and Homoskedasticity

111
Q

What is exogeneity? what are the main problems?

A

The mean of E does not depend on X, error is random and uncorrelated with your Xs

the main problems are endogeneity (the explanatory variable is correlated with the error term) and omitted variable bias

112
Q

what is Homoskedasticity

A

the variance in the error does not depend on x
and the error is fairly uniform throughout

113
Q

What is used to estimate parameters in OLS regression with uncertainty? what is the equation for this?

A

least squares

ssr= sum ((actual y - predicted y)^2)

114
Q

What are the statitsical properties of least squares?

A

under the exogeneity and homoskedasticity assumptions standard errors are unbiased

under exogeneity, alpha and beta are unbiased

t distribution is often used

115
Q

what are the three ways to tell statistical significance at the 95% level?

A

1.) Is p ≤ 0.05?
2.) is t greater than 1.96 in two-tailed test (with a large N)
3.) Does a 95% confidence interval have the same sign on the lower and upper bound?

116
Q

what affects the statistical significance

A

1.) the size of the coefficient (the bigger the effect X has on Y the less likely it is that the true effect in the population is zero)

2.) the size of the standard error (t-statistic is the coefficient divided by standard error -smaller standard error means bigger t)

3.) sample size (Smaller t gives bigger p-value as degrees of freedom increases. This increase diminishes quickly and (essentially) stops at about 1,000 degrees of freedom.-> SE becomes smaller regardless of truth

117
Q

why do we need substantive significance? What does it tell us?

A

because statistical significance does not tell us how large or meaningful the effect of X is on Y

What a one-unit change in x means and what that predicts in terms of units of Y

118
Q

What is t

A

a measure of standard error (how far the estimate is from the mean in terms of standard errors )

119
Q

explain the intuition behind the standard error of the coefficient and prediction math

A

you need to have error in how x changes or else you cannot understand why y varies. you need variance among all components to get the standard error to be able to evaluate significance. If there is no error you cannot determine statistical significance.

Tells you how precise the predictions are

120
Q

what is the t-statistic formula

A

(mean X - population mean) / (s/√n)

121
Q

What is distance in networks

A

The number of edges in the shortest path between two nodes