Practical Statistics Flashcards

1
Q

Deviations

A

The difference between the observed values and the estimate of location

errors, residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Variance

A

The sum of squared deviations from the mean divided by n-1 where n is the number of data values

mean-squared-error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Standard Deviation

A

The square root of the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Mean absolute deviation

A

The mean of the absolute values of the deviations from the mean

l1-norm, Manhattan norm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sample statistic

A

A metric calculated for a sample of data drawn from a larger population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Distribution

A

The frequency distribution of individual values in a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Sampling distribution

A

The frequency distribution of a sample statistic over many sample or resamples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Central limit theorem

A

The tendency of the sampling distribution to take on a normal shape as sample size rises

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Standard error

A

The variability (standard deviation) of a sample statistic over many samples (not to be confused with standard deviation, which by itself, refers to variability of individual data values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bootstrap sample

A

A sample taken with replacement from an observed data set

powerful tool for assessing the variability of a sample statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Resampling

A

The process of taking repeated samples from observed data; includes both bootstrap and permutation procedures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Confidence level

A

The percentage of confidence intervals, constructed in the same way from the same population, that are expected to contain the statistic of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Interval endpoints

A

The top and bottom of the confidence interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Error

A

The difference between a data point and a predicted or average value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Standardize

A

Subtract the mean and divide by the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

z-score

A

The result of standardizing an individual data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Standard normal

A

A normal distribution with mean = 0 and standard deviation = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Tail

A

The long narrow portion of a frequency distribution, where relatively extreme values occur at low frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Skew

A

Where one tail of a distribution is longer than the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Trial

A

An event with a discrete outcome (e.g. a coin flip)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Success

A

The outcome of interest for a trial

“1” (as opposed to “0”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Binomial

A

Having two outcomes

yes/no, 0/1, binary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Binomial Trial

A

A trial with two outcomes

Bernoulli trial

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Binomial distribution

A

Distribution of number of successes in n trials parameterized by p. Can be approximated by normal distribution with large n and p not too close to 0 or 1

Bernoulli distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Lambda

A

The rate (per unit of time or space) at which events occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Poisson distribution

A

The frequency distribution of the number of events in sampled units of time or space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Exponential distribution

A

The frequency distribution of the time or distance from one event to the next event

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Weibull distribution

A

A generalized version of the exponential distribution in which the event rate is allowed to shift over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Treatment

A

Something (drug, price, web headline) to which a subject is exposed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Treatment group

A

A group of subjects exposed to a specific treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Control group

A

A group of subjects exposed to no (or standard) treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Subjects

A

The items (web visitors, patients, etc) that are exposed to treatments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Test statistic

A

The metric used to measure the effect of the treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Null hypothesis

A

The hypothesis that chance is to blame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Alternative hypothesis

A

Counterpoint to the null (what you hope to prove)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

One-way test

A

Hypothesis test that counts chance results only in one direction (e.g. B is better than A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Two-way test

A

Hypothesis test that counts chance results in two directions (e.g. A is different from B; could be bigger or smaller)

38
Q

Permutation test

A

The procedure of combining two or more samples together and randomly (or exhaustively) reallocating the observations to resamples

Randomization test, rand permutation test, exact test

39
Q

Resampling

A

Drawing additional examples (“resamples”) from an observed data set

40
Q

p-value

A

Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results

not “What is the probability that this happened by chance?”

41
Q

Alpha

A

The probability threshold of “unusualness” that chance results must surpass for actual outcomes to be deemed statistically significant

typically 5% and 1%

42
Q

Type I error

A

Mistakenly concluding an effect is real (when it is due to chance)

43
Q

Type II error

A

Mistakenly concluding an effect is due to chance (when it is real)

44
Q

Multi-arm bandit

A

An imaginary slot machine with multiple arms for the customer to choose from, each with different payoffs, here taken to be an analogy for a multitreatment experiment

Alters traditional sampling process to incorporate information learned during the experiment and reduce the frequency of the inferior treatment

epsilon-greedy

45
Q

Arm

A

A treatment in an experiment (e.g. “headline A in a web test”)

46
Q

Win

A

The experimental analog of a win at the slot machine (e.g. “customer clicks on the link”)

47
Q

Effect size

A

The minimum size of the effect that you hope to be able to detect in a statistical test, such as “a 20% improvement to click rates”

Bigger the effect size, the fewer samples you probably need to detect it

48
Q

Power

A

The probability of detecting a given effect size with a given sample size

49
Q

Significance level

A

The statistical significance level at which the test will be conducted

alpha

50
Q

Response

A

The variable we are trying to predict

dependent variable, Y variable, target, outcome

51
Q

Independent variable

A

The variable used to predict the response

X variable, feature, attribute, predictor

52
Q

Record

A

The vector of predictor and outcome values for a specific individual or case

row, case, instance, example

53
Q

Intercept

A

The intercept of the regression line – that is, the predicted value when X = 0

b_0, B_0

54
Q

Regression coefficient

A

The slope of the regression line

slope, b_1, B_1, parameter estimates, weights

55
Q

Fitted values

A

The estimates Y_hat_i obtained from the regression lines

predicted values

56
Q

Residuals

A

The difference between the observed values and the fitted values

errors

57
Q

Least squares

A

The method of fitting a regression by minimizing the sum of squared residuals

ordinary least squares, OLS

58
Q

Root mean squared error

A

The square root of the average squared error of the regression (this is the most widely used metric to compare regression models)

RMSE

59
Q

Residual standard error

A

The same as the root mean squared error, but adjusted for degrees of freedom

RSE

60
Q

R-squared

A

The proportion of variance explained by the model, from 0 to 1

coefficient of determination, R^2

61
Q

t-statistic

A

The coefficient for a predictor, divided by the standard error of the coefficient, giving a metric to compare the importance of variables in the model

62
Q

Weighted regression

A

Regression with the records having different weights

63
Q

Correlated variables

A

When the predictor variables are highly correlated, it is difficult to interpret the individual coefficients

64
Q

Multicollinearity

A

When the predictor variables have perfect, or near-perfect, correlation, the regression can be unstable or impossible to compute

collinearity

65
Q

Confounding variables

A

An important predictor that, when omitted, leads to spurious relationships in a regression equation

66
Q

Main effects

A

The relationship between a predictor and the outcome variable, independent of other variables

67
Q

Interactions

A

An interdependent relationship between two or more predictors and the response

68
Q

Conditional probability

A

The probability of observing some event (say, X = i) given some other event (say, Y = i), written as P(X_i | Y_i)

69
Q

Posterior probability

A

The probability of an outcome after the predictor information has been incorporated (in contract to the prior probability of outcomes, not taking predictor information into account)

70
Q

Covariance

A

A measure of the extent to which one variable varies in concert with another (ie similar magnitude and direction)

71
Q

Discriminant function

A

The function that, when applied to the predictor variables, maximizes the separation of the classes

Fisher’s Linear Discriminant maximizes the “between” sum of squares relative to the “within” sum of squares

72
Q

Discriminant weights

A

The scores that result from the application of the discriminant function and are used to estimate probabilities of belonging to one class or another

73
Q

Logit

A

The function that maps class membership probability to a range from negative to positive infinity

log odds

74
Q

Odds

A

The ratio of “success” (1) to “not success” (0)

The probability of an event divided by the probability that the event will not occur

75
Q

Log odds

A

The response in the transformed model (now linear), which gets mapped back to a probability

76
Q

Logistic Regression

A

Analogous to multiple linear regression but the outcome is binary. Is a special instance of a “generalized linear model” (GLM). Fit with Maximum Likelihood Estimation

77
Q

Maximum Likelihood Estimation (MLE)

A

A process that tries to find the model that is most likely to have produced the data we see. Involves quasi-Newton optimization that iterates between a scoring step, based on the current parameters, and an update to the parameters to improve the fit

78
Q

Recall

A

tp / (tp + fn)

Sensitivity, TPR, hit-rate

79
Q

Precision

A

tp / (tp + fp)

80
Q

Specificity

A

tn / (tn + fp)

True negative rate

81
Q

F1 Score

A

Harmonic mean of the precision and recall

2 * Recall * Precision / (Recall + Precision)

82
Q

ROC curve

A

The plot of the true positive rate (TPR, recall, y-axis) against the false positive rate (FPR, x-axis), at various threshold settings

Some definitions use specificity (TNR) for the x-axis

83
Q

Bias error

A

Error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)

More tunable parameters -> lower bias -> higher variance

84
Q

Variance error

A

Error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting)

More tunable parameters -> lower bias -> higher variance

85
Q

Convex and non-convex functions

A

convex: one minimum
- important: an optimization algorithm(like gradient descent) wont get stuck in a local minimum

non-convex: some up and down valleys (local minimas) that aren’t as down as the overall down (global minum)
-optimization algorithms can get stuck in local minimum and it can be hard to tell when this happens

86
Q

Kullback Liebler divergence

A

A measure of how one probability distribution diverges from a second, expected probability distribution

KL-divergence

87
Q

Kolmogrov Smirnoff test

A

A nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test)

88
Q

ANOVA

A

Analysis of variation is a statistical method used to test differences between two or more means of variance

89
Q

PCA

A

Principle Component Analysis

  • orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
  • transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components
  • resulting vectors are an uncorrelated orthogonal basis set.
  • PCA is sensitive to the relative scaling of the original variables.
90
Q

p-value principles

A
  1. p-values can indicate how incompatible the data are with a specified statistical model
  2. p-values do no measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold
  4. Proper inference requires full reporting and transparency
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis