Practical Statistics Flashcards
Deviations
The difference between the observed values and the estimate of location
errors, residuals
Variance
The sum of squared deviations from the mean divided by n-1 where n is the number of data values
mean-squared-error
Standard Deviation
The square root of the variance
Mean absolute deviation
The mean of the absolute values of the deviations from the mean
l1-norm, Manhattan norm
Sample statistic
A metric calculated for a sample of data drawn from a larger population
Data Distribution
The frequency distribution of individual values in a data set
Sampling distribution
The frequency distribution of a sample statistic over many sample or resamples.
Central limit theorem
The tendency of the sampling distribution to take on a normal shape as sample size rises
Standard error
The variability (standard deviation) of a sample statistic over many samples (not to be confused with standard deviation, which by itself, refers to variability of individual data values)
Bootstrap sample
A sample taken with replacement from an observed data set
powerful tool for assessing the variability of a sample statistic
Resampling
The process of taking repeated samples from observed data; includes both bootstrap and permutation procedures
Confidence level
The percentage of confidence intervals, constructed in the same way from the same population, that are expected to contain the statistic of interest
Interval endpoints
The top and bottom of the confidence interval
Error
The difference between a data point and a predicted or average value
Standardize
Subtract the mean and divide by the standard deviation
z-score
The result of standardizing an individual data point
Standard normal
A normal distribution with mean = 0 and standard deviation = 1
Tail
The long narrow portion of a frequency distribution, where relatively extreme values occur at low frequency
Skew
Where one tail of a distribution is longer than the other
Trial
An event with a discrete outcome (e.g. a coin flip)
Success
The outcome of interest for a trial
“1” (as opposed to “0”)
Binomial
Having two outcomes
yes/no, 0/1, binary
Binomial Trial
A trial with two outcomes
Bernoulli trial
Binomial distribution
Distribution of number of successes in n trials parameterized by p. Can be approximated by normal distribution with large n and p not too close to 0 or 1
Bernoulli distribution
Lambda
The rate (per unit of time or space) at which events occur
Poisson distribution
The frequency distribution of the number of events in sampled units of time or space
Exponential distribution
The frequency distribution of the time or distance from one event to the next event
Weibull distribution
A generalized version of the exponential distribution in which the event rate is allowed to shift over time
Treatment
Something (drug, price, web headline) to which a subject is exposed
Treatment group
A group of subjects exposed to a specific treatment
Control group
A group of subjects exposed to no (or standard) treatment
Subjects
The items (web visitors, patients, etc) that are exposed to treatments
Test statistic
The metric used to measure the effect of the treatment
Null hypothesis
The hypothesis that chance is to blame
Alternative hypothesis
Counterpoint to the null (what you hope to prove)
One-way test
Hypothesis test that counts chance results only in one direction (e.g. B is better than A)
Two-way test
Hypothesis test that counts chance results in two directions (e.g. A is different from B; could be bigger or smaller)
Permutation test
The procedure of combining two or more samples together and randomly (or exhaustively) reallocating the observations to resamples
Randomization test, rand permutation test, exact test
Resampling
Drawing additional examples (“resamples”) from an observed data set
p-value
Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results
not “What is the probability that this happened by chance?”
Alpha
The probability threshold of “unusualness” that chance results must surpass for actual outcomes to be deemed statistically significant
typically 5% and 1%
Type I error
Mistakenly concluding an effect is real (when it is due to chance)
Type II error
Mistakenly concluding an effect is due to chance (when it is real)
Multi-arm bandit
An imaginary slot machine with multiple arms for the customer to choose from, each with different payoffs, here taken to be an analogy for a multitreatment experiment
Alters traditional sampling process to incorporate information learned during the experiment and reduce the frequency of the inferior treatment
epsilon-greedy
Arm
A treatment in an experiment (e.g. “headline A in a web test”)
Win
The experimental analog of a win at the slot machine (e.g. “customer clicks on the link”)
Effect size
The minimum size of the effect that you hope to be able to detect in a statistical test, such as “a 20% improvement to click rates”
Bigger the effect size, the fewer samples you probably need to detect it
Power
The probability of detecting a given effect size with a given sample size
Significance level
The statistical significance level at which the test will be conducted
alpha
Response
The variable we are trying to predict
dependent variable, Y variable, target, outcome
Independent variable
The variable used to predict the response
X variable, feature, attribute, predictor
Record
The vector of predictor and outcome values for a specific individual or case
row, case, instance, example
Intercept
The intercept of the regression line – that is, the predicted value when X = 0
b_0, B_0
Regression coefficient
The slope of the regression line
slope, b_1, B_1, parameter estimates, weights
Fitted values
The estimates Y_hat_i obtained from the regression lines
predicted values
Residuals
The difference between the observed values and the fitted values
errors
Least squares
The method of fitting a regression by minimizing the sum of squared residuals
ordinary least squares, OLS
Root mean squared error
The square root of the average squared error of the regression (this is the most widely used metric to compare regression models)
RMSE
Residual standard error
The same as the root mean squared error, but adjusted for degrees of freedom
RSE
R-squared
The proportion of variance explained by the model, from 0 to 1
coefficient of determination, R^2
t-statistic
The coefficient for a predictor, divided by the standard error of the coefficient, giving a metric to compare the importance of variables in the model
Weighted regression
Regression with the records having different weights
Correlated variables
When the predictor variables are highly correlated, it is difficult to interpret the individual coefficients
Multicollinearity
When the predictor variables have perfect, or near-perfect, correlation, the regression can be unstable or impossible to compute
collinearity
Confounding variables
An important predictor that, when omitted, leads to spurious relationships in a regression equation
Main effects
The relationship between a predictor and the outcome variable, independent of other variables
Interactions
An interdependent relationship between two or more predictors and the response
Conditional probability
The probability of observing some event (say, X = i) given some other event (say, Y = i), written as P(X_i | Y_i)
Posterior probability
The probability of an outcome after the predictor information has been incorporated (in contract to the prior probability of outcomes, not taking predictor information into account)
Covariance
A measure of the extent to which one variable varies in concert with another (ie similar magnitude and direction)
Discriminant function
The function that, when applied to the predictor variables, maximizes the separation of the classes
Fisher’s Linear Discriminant maximizes the “between” sum of squares relative to the “within” sum of squares
Discriminant weights
The scores that result from the application of the discriminant function and are used to estimate probabilities of belonging to one class or another
Logit
The function that maps class membership probability to a range from negative to positive infinity
log odds
Odds
The ratio of “success” (1) to “not success” (0)
The probability of an event divided by the probability that the event will not occur
Log odds
The response in the transformed model (now linear), which gets mapped back to a probability
Logistic Regression
Analogous to multiple linear regression but the outcome is binary. Is a special instance of a “generalized linear model” (GLM). Fit with Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE)
A process that tries to find the model that is most likely to have produced the data we see. Involves quasi-Newton optimization that iterates between a scoring step, based on the current parameters, and an update to the parameters to improve the fit
Recall
tp / (tp + fn)
Sensitivity, TPR, hit-rate
Precision
tp / (tp + fp)
Specificity
tn / (tn + fp)
True negative rate
F1 Score
Harmonic mean of the precision and recall
2 * Recall * Precision / (Recall + Precision)
ROC curve
The plot of the true positive rate (TPR, recall, y-axis) against the false positive rate (FPR, x-axis), at various threshold settings
Some definitions use specificity (TNR) for the x-axis
Bias error
Error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)
More tunable parameters -> lower bias -> higher variance
Variance error
Error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting)
More tunable parameters -> lower bias -> higher variance
Convex and non-convex functions
convex: one minimum
- important: an optimization algorithm(like gradient descent) wont get stuck in a local minimum
non-convex: some up and down valleys (local minimas) that aren’t as down as the overall down (global minum)
-optimization algorithms can get stuck in local minimum and it can be hard to tell when this happens
Kullback Liebler divergence
A measure of how one probability distribution diverges from a second, expected probability distribution
KL-divergence
Kolmogrov Smirnoff test
A nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test)
ANOVA
Analysis of variation is a statistical method used to test differences between two or more means of variance
PCA
Principle Component Analysis
- orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
- transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components
- resulting vectors are an uncorrelated orthogonal basis set.
- PCA is sensitive to the relative scaling of the original variables.
p-value principles
- p-values can indicate how incompatible the data are with a specified statistical model
- p-values do no measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold
- Proper inference requires full reporting and transparency
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis