Terms Flashcards

1
Q

explanatory variable

A

IV = predictor = regressor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

response variable

A

DV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Guassian distribution

A

normal distribution. It is a continous probability distribution for a real-valued random variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

stratum

A

is a subset of the population, which is being sampled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

stratification

A

process of dividing members of the population into homogenous subgroups before sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Coefficient of determination

A

R squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pearson correlation coefficient

A

One of the most widely used correlation coefficients. Graphically, this can be understood as “how close is the data to the line of best fit?”

r = 1 is perfect fit
r = 0 is no fit
r = -1 is perfect negative fit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

p-value

A

probability of the value of a test-statistic being at least as extreme as the one observed in our data under the null hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does beta behave if the estimates are consistent?

A

Betas converge to the true values with increasing sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

heteroscedasticity

A

refers to the circumstances in wich the variability of a variable is unequal across the range of a second variable that predicts it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

homoscedasticity

A

= having the same variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

logarithmic scale

A

= log scale
Often exponential growth curves are displayed on a log scale, otherwise they would increase too quickly to fit within a small graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

cross-entropy

A
  • commonly used in ML as a loss function
  • calculates the difference between two probability distributions for a given random variable
  • can be used to calculate the total entropy between the distributions
  • Cross-entropy builds upon the idea of entropy from information theory and calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sensitivity

A

= True Positive Rate

refers to the proportion of those who have the condition that received a positive result on the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Specificity

A

= True Negative Rate

refers to the proportion of those who do not have the condition that received a negative result on this test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sensitivity vs. Specificity

A

For all testing, both diagnostic and screening, there is usually a trade-off between sensitivity and specificity, such that higher sensitivities will mean lower specificities and vice versa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Standard Error

A

tells you how accurate the mean of any given sample from that population is likely to be compared to the true population mean. When the standard error increases, i.e. the means are more spread out, it becomes more likely that any given mean is an inaccurate representation of the true population mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Time-series

A

focuses on a sigle individual at multiple time intervals

19
Q

Panel data

A

focuses on mutliple individuals at multiple time intervals

20
Q

Equidispersion

A

special property of the poission distribution. It means that the variance equals the mean

21
Q

Negative binomial regression

A

DV is an observed count that follows the negative binomial distribution. DV has the possible values of non-negative integers 0,1,2,3

It is a generaliation of the Possion regression, which loosens the assumption of equidispersion.

22
Q

Negative binomial distribuion

A

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of failures (denoted r) occurs.

23
Q

RSME

A

Root Mean Squared Error
RMSE is the average deviation of the prediction from the actual values of the data.

It is the SD of the residuals (predition errors). Residualsare a measure of how far from the regression line the data points are.

24
Q

Pooled linear regression

A

Is a simple linear regression for panel data that does not take into account the possibility of unobserved individual specific effects.

25
Q

Endogenous Variables

A

have values that are determined by other variables in the system. A variable is said to be endogenous within the causal model M if its value is determined or influenced by one or more of the IV

26
Q

FIXED OR RANDOM Effects model?

A

Hausman test
To decide between fixed or random effects you can run a Hausman test where the null hypothesis is that the preferred model is random effects vs. the alternative the fixed
effects (see Green, 2008, chapter 9). It basically tests whether the unique errors (ui) are
correlated with the regressors, the null hypothesis is they are not.

Run a fixed effects model and save the estimates, then run a random model and save the estimates, then perform the test. If the p-value is significant (for example <0.05) then use fixed effects, if not use random effects.

16
> phtest(fixed, random)
Hausman Test
data: y ~ x1 
chisq = 3.674, df = 1, p-value = 0.05527
alternative hypothesis: one model is inconsistent
27
Q

What happens if you have endogenous regressors in a model?

A

It will cause ordinal least squares estimators to fail. One of the key assumptions of OLS is that there is no correlation between a predictor variable and the error term

28
Q

Interaction Term meaning

A

An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.

29
Q

Types of panel analytic models

A

(1) Pooled regression model (2) Fixed effect model and (3) Random effect model

30
Q

Fixed effect model

A

In panel data where longitudinal observations exist for the same subject, fixed effects represent the subject-specific means. In panel data analysis the term fixed effects estimator (also known as the within estimator) is used to refer to an estimator for the coefficients in the regression model including those fixed effects (one time-invariant intercept for each subject)

31
Q

Random effects model

A

If you think there are no omitted variables – or if you believe that the omitted
variables are uncorrelated with the explanatory variables that are in the model –
then a random effects model is probably best. It will produce unbiased estimates
of the coefficients, use all the data available, and produce the smallest standard
errors. More likely, however, is that omitted variables will produce at least some
bias in the estimates

32
Q

residuals

A

The residuals are the deviations of the estimated values through our model to the observed values.

33
Q

Gauss-Markov theorem

A

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed (only uncorrelated with mean zero and homoscedastic with finite variance).

34
Q

i.i.d

A

Independent and identically distributed

A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.

the outcomes we get from the flipping of a coin are independent and identically distributed. Independent because one outcome does not depend on the other outcome and identical because every sample comes from the same distribution (there is no change in the distribution when we flip a coin).

Identically distributed does not mean equiprobable. It is not required that the two random variables can only have the probability of 0.5 each or four random variables can only have the probability of 0.25 each in order for them to be i.i.d.

The data generating process is the same for all observations (identically distributed).
the observations are independent. In particular the order of the indexing (the order of the rows of the data table) can be considered to be arbitrary.

35
Q

Naive Bayes Assumptions

A

Attributes are independent and equally important

36
Q

When is Naive Bayes especially appropriate?

A

When the dimension of the feature set is high, making density estimation unattractive

37
Q

classification

A

prediction of a class label by means of the attribute

38
Q

regression

A

prediction of a numeric value by means of the attribute

39
Q

of possible trees

A
m = attributes
n = classes
n^n^m = # possible trees
40
Q

Bootstrapping

A

Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates.This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods

41
Q

Quantile

A

When you sort a set of data and divide it into equal parts so that each part contains the same number of values, these cut-off points are called quantiles.

42
Q

Decile

A

When a set of data is divided into ten equal parts, each of them is called a decile.

43
Q

significance level

A

In a hypothesis test, you would say that the significance level is the probability that we make the wrong decision and reject our H0.

44
Q

confidence level

A

The confidence level describes the probability that, if we would repeat an experiment multiple times, we would obtain the same results.