Statistics Flashcards

1
Q

Accuracy

A

TP + TN / TP + TN + FP + FN

Number of correct predictions /
Number of all predictions

Good general report of model performance with BALANCED data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is accuracy alone not enough to evaluate classification models?

A

Consider benign versus malignant tumors. A typical set of random people would include more than 90% benign (0) because they’re much more common than malignant. If a model predicts all the examples as 0 without making any calculation, the accuracy is more than 90% which is useless. We need other measures called precision and recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Precision (PPV)

A

TP / TP + FP

Correct positives /
Positive tests

From all positive PREDICTED, how many are ACTUAL positive?

Focus on precision when you want to be confident in the YES the model gives you; that what’s your model pings is the real deal. It will miss some YES’s, but what it does ping as YES you can be confident in.

Applicant screening. Some viable applicants will get away, but when the model pings a viable applicant, you can be confident about it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Recall (Sensitivity/TPR)

A

TP / TP + FN

Correct positives /
Actual positives

From all ACTUALLY positive, how many we PREDICTED correctly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Increasing precision_______recall

A

Decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

F1 score

A

2*precision*recall /
precision + recall

A weighted average of precision and recall to combine the two numbers

Use when working with IMBALANCED data sets

Trying to classify tweets by sentiment, positive, negative, neutral, but data set was unbalanced with way more neutral. F1 score describes overall model performance (caring equally about all three classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Sensitivity (Recall / TPR)

A

TP / TP + FN
1 - FNR

Correct positives /
Actual positives

How good is the model at catching YES’s?

A sensitive test helps rule out a disease when the test is negative.
Highly SeNsitive = SNout = rule out

Use sens/spec when every instance of what you’re looking for is too precious to let slip by (illnesses, fraud, terrorist attacks) sensitivity focused model will catch ALL REAL terrorist attacks, ALL TRUE cases of heart disease, etc.
CAVEAT: there will be some false positives: innocent travelers identified as terrorists, some healthy people labeled as diseased

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Specificity (TNR)

A

TN / TN + FP
1-FPR

Correct negatives /
Actual negatives

How good is the model at catching NO’s?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Prevalence

A

The number of cases in a defined population at a single point in time. Expressed as a decimal or percentage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Positive predictive value (PPV) (Precision)

A

TP / TP + FP

Actual positive /
Tested positive

The probability that following a positive test result, that individual will TRULY have that disease. Also thought of as clinical relevance of a test.

Related to prevalence, whereas sensitivity and specificity are independent of prevalence.

As prevalence decreases, PPV decreases because there will be more false positives for every true positive

These enable you to rule in/out conditions but not definitively diagnose a condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Negative predictive value (NPV)

A

TN / TN + FN

Actual Negative /
Tested Negative

The probability that following a NEGATIVE test result, that individual will TRULY NOT have that disease. Also thought of as clinical relevance of a test.

Related to prevalence, whereas sensitivity and specificity are independent of prevalence.

As prevalence decreases, NPV increases because there will be more true negatives for every false negative

These enable you to rule in/out conditions but not definitively diagnose a condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Type I error

A

False Positive

REJECTING the NULL when it is TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Alpha level

(significance level)

A

Probability of REJECTING the NULL when it is TRUE (type I error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Beta level

A

Probability that you’ll fail to reject the null when it’s false (type II error)

i.e. ACCEPT the NULL when it’s FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Type II error

A

False Negative

ACCEPTING the NULL when it’s FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AUC ROC Curve

A

Tells how much the model is capable of distinguishing between classes.

X axis is FPR/Spec
Y axis is TPR/Sens

AUC-Area under the curve. Higher the value, better the model at predicting TP and TN. Better than model at distinguishing between patients with the disease and no disease

ROC- receiver operating characteristics. Probability curve

17
Q

FPR

A

1 - Specificity

18
Q

Bessel’s correction

A

use of n − 1 instead of n in the formula for the sample variance and sample standard deviation,[1] where n is the number of observations in a sample.

19
Q

bias (or bias function) of an estimator

A
  1. difference between this estimator’s expected value and the true value of the parameter being estimated.
  2. An estimator or decision rule with zero bias is called unbiased. In statistics, “bias” is an objective property of an estimator
  3. unbiased estimator is preferable to a biased estimator, although in practice, biased estimators (with generally small bias) are frequently used. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons:
    1. because an unbiased estimator does not exist without further assumptions about a population;
    2. because an estimator is difficult to compute (as in unbiased estimation of standard deviation);
    3. because an estimator is median-unbiased but not mean-unbiased (or the reverse);
    4. because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or
    5. because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful.
20
Q

Nominal Data

A

data that is used for naming or labelling variables, without any quantitative value.

“named” data

no intrinsic ordering to nominal data

Examples: country, gender, race, hair color

analyisis is done by grouping input variables into categories and calculating the percentage or mode of the distribution.

non-parametric tests

21
Q

Ordinal Data

A

type of categorical data with an order. The variables in ordinal data are listed in an ordered manner.

The ordinal variables are usually numbered, so as to indicate the order of the list. However,

the numbers are not mathematically measured or determined but are merely assigned as labels for opinions.

Example: Good, Neutral, Bad

analysed by computing the mode, median and other positional measures like quartiles, percentiles, etc.

usually analyzed with non-parametric tests. Although discouraged, ordinal data is sometimes analysed using parametric statistics,

22
Q

When to use Parametric Tests

A
  1. Interval or Ratio
  2. Normally distributed
  3. No outliers
  4. Equal variances
  5. Large samples (>30)
23
Q

When to use non-parametric tests

A
  1. Nominal or ordinal data
  2. not Normal distributed
  3. Outliers present
  4. Unequal variances
  5. small samples
24
Q

Sufficient Statistic

A

“no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter”

given a set of independent identically distributed data conditioned on an unknown parameter , a sufficient statistic is a function whose value contains all the information needed to compute any estimate of the parameter (e.g. a maximum likelihood estimate)

A statistic t = T(X) is sufficient for underlying parameter θ precisely if the conditional probability distribution of the data X, given the statistic t = T(X), does not depend on the parameter θ.

example, the sample mean is sufficient for the mean (μ) of a normal distribution with known variance. Once the sample mean is known, no further information about μ can be obtained from the sample itself. On the other hand, for an arbitrary distribution the median is not sufficient for the mean: even if the median of the sample is known, knowing the sample itself would provide further information about the population mean. For example, if the observations that are less than the median are only slightly less, but observations exceeding the median exceed it by a large amount, then this would have a bearing on one’s inference about the population mean.

25
Q

Statistical Inference

A

practice of forming judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling.

26
Q

Beta Distribution

A

PDF = ,

parameterized by two positive shape parameters, denoted by α and β, that control the shape of the distribution.

applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines.

In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli, binomial, negative binomial and geometric distributions. The beta distribution is a suitable model for the random behavior of percentages and proportions.

27
Q

Dirichlet Distribution

A

generalization to multiple variables of Beta distribution

28
Q

Gamma Distribution

A
29
Q

Wishart Distribution

A

a generalization to multiple dimensions of the gamma distribution

These distributions are of great importance in the estimation of covariance matrices in multivariate statistics. In Bayesian statistics, the Wishart distribution is the conjugate prior of the inverse covariance-matrix of a multivariate-normal random-vector.

30
Q

Exponential Family of Distributions

A

a parametric set of probability distributions. chosen for mathematical convenience, based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions

Most of the commonly used distributions form an exponential family or subset of an exponential family… Includes: normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, Wishart, inverse Wishart, geometric,binomial (with fixed number of trials), multinomial (with fixed number of trials), negative binomial (with fixed number of failures)

They have properties:

  1. have sufficient statistics that can summarize arbitrary amounts of independent identically distributed data using a fixed number of values
  2. have conjugate priors, an important property in Bayesian statistics.
  3. posterior predictive distribution of an exponential-family random variable with a conjugate prior can always be written in closed form (provided that the normalizing factor of the exponential-family distribution can itself be written in closed form)
  4. In the mean-field approximation in variational Bayes (used for approximating the posterior distribution in large Bayesian networks), the best approximating posterior distribution of an exponential-family node (a node is a random variable in the context of Bayesian networks) with a conjugate prior is in the same family as the node.
31
Q

Max Likelihood

A

A way to estimate parameter values, which are found such that they maximize the likelihood that the model describes the data that were actually observed.

32
Q

Error/Disturbance

A

Of an observation. The deviation between observed value and true value of a quantity (i.e. population mean)

33
Q

Residual

A

Difference between observed value and estimated value of quantity of interest (i.e. sample mean)

34
Q

Degrees of Freedom

A

degrees of freedom of an estimate of a parameter are equal to the number of independent scores that go into the estimate minus the number of parameters used as intermediate steps in the estimation of the parameter itself

(most of the time the sample variance has N − 1 degrees of freedom, since it is computed from N random scores minus the only 1 parameter estimated as intermediate step, which is the sample mean

number of dimensions of the domain of a random vector, or essentially the number of “free” components (how many components need to be known before the vector is fully determined).

in the context of linear models (linear regression, analysis of variance), where certain random vectors are constrained to lie in linear subspaces, and the number of degrees of freedom is the dimension of the subspace. The degrees of freedom are also commonly associated with the squared lengths (or “sum of squares” of the coordinates) of such vectors, and the parameters of chi-squared and other distributions

35
Q

Bessel’s Correction

A

n / (n-1)

the degrees of freedom in the residuals vector (residuals, not errors, because the population mean is unknown):

While there are n independent observations in the sample, there are only n − 1 independent residuals, as they sum to 0

approach to reduce the bias due to finite sample size.

The sum of squares of the distance from samples to the population mean will always be bigger than the sum of squares of the distance to the sample mean, except when the sample mean happens to be the same as the population mean, in which case the two are equal.

sum of squares of the deviations from the sample mean is too small to give an unbiased estimate of the population variance when the average of those squares is found. The smaller the sample size, the larger is the difference between the sample variance and the population variance.

three caveats to consider regarding Bessel’s correction:

  1. It does not yield an unbiased estimator of standard deviation.
  2. The corrected estimator often has a higher mean squared error (MSE) than the uncorrected estimator.[4] Furthermore, there is no population distribution for which it has the minimum MSE because a different scale factor can always be chosen to minimize MSE.
  3. It is only necessary when the population mean is unknown (and estimated as the sample mean). In practice, this generally happens.
36
Q

Bias

A

bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased.

Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes median-unbiased from the usual mean-unbiasedness property.

unbiased estimator is preferable to a biased estimator, although in practice, biased estimators (with generally small bias) are frequently used. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons:

  1. because an unbiased estimator does not exist without further assumptions about a population;
  2. because an estimator is difficult to compute (as in unbiased estimation of standard deviation);
  3. because an estimator is median-unbiased but not mean-unbiased (or the reverse);
  4. because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or
  5. because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful.

The bias of relative to is defined as:

unbiased if its bias is equal to zero for all values of parameter θ, or equivalently, if the expected value of the estimator matches that of the parameter.

37
Q

Confidence Interval (Frequentist stats)

A

means that with a large number of repeated samples, 95% of such calculated confidence intervals would include the true value of the parameter. In frequentist terms, the parameter is fixed (cannot be considered to have a distribution of possible values) and the confidence interval is random (as it depends on the random sample).