Terms Flashcards
explanatory variable
IV = predictor = regressor
response variable
DV
Guassian distribution
normal distribution. It is a continous probability distribution for a real-valued random variable.
stratum
is a subset of the population, which is being sampled
stratification
process of dividing members of the population into homogenous subgroups before sampling
Coefficient of determination
R squared
pearson correlation coefficient
One of the most widely used correlation coefficients. Graphically, this can be understood as “how close is the data to the line of best fit?”
r = 1 is perfect fit r = 0 is no fit r = -1 is perfect negative fit
p-value
probability of the value of a test-statistic being at least as extreme as the one observed in our data under the null hypothesis
How does beta behave if the estimates are consistent?
Betas converge to the true values with increasing sample size
heteroscedasticity
refers to the circumstances in wich the variability of a variable is unequal across the range of a second variable that predicts it
homoscedasticity
= having the same variance.
logarithmic scale
= log scale
Often exponential growth curves are displayed on a log scale, otherwise they would increase too quickly to fit within a small graph
cross-entropy
- commonly used in ML as a loss function
- calculates the difference between two probability distributions for a given random variable
- can be used to calculate the total entropy between the distributions
- Cross-entropy builds upon the idea of entropy from information theory and calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution.
Sensitivity
= True Positive Rate
refers to the proportion of those who have the condition that received a positive result on the test
Specificity
= True Negative Rate
refers to the proportion of those who do not have the condition that received a negative result on this test
Sensitivity vs. Specificity
For all testing, both diagnostic and screening, there is usually a trade-off between sensitivity and specificity, such that higher sensitivities will mean lower specificities and vice versa.
Standard Error
tells you how accurate the mean of any given sample from that population is likely to be compared to the true population mean. When the standard error increases, i.e. the means are more spread out, it becomes more likely that any given mean is an inaccurate representation of the true population mean.
Time-series
focuses on a sigle individual at multiple time intervals
Panel data
focuses on mutliple individuals at multiple time intervals
Equidispersion
special property of the poission distribution. It means that the variance equals the mean
Negative binomial regression
DV is an observed count that follows the negative binomial distribution. DV has the possible values of non-negative integers 0,1,2,3
It is a generaliation of the Possion regression, which loosens the assumption of equidispersion.
Negative binomial distribuion
In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of failures (denoted r) occurs.
RSME
Root Mean Squared Error
RMSE is the average deviation of the prediction from the actual values of the data.
It is the SD of the residuals (predition errors). Residualsare a measure of how far from the regression line the data points are.
Pooled linear regression
Is a simple linear regression for panel data that does not take into account the possibility of unobserved individual specific effects.