Exam info Flashcards

1
Q

NUMERICAL VARIABLES

A

Numerical:
• Continuous (entities get a distinct score), e.g. temperature,
body length.
• Discrete (counts), e.g.: number of defects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CATEGORICAL VARIABLES

A

Categorical (entities are divided into distinct categories):
• Binary variable (two outcomes), e.g. dead or alive.
• Ordinal variable, e.g. bad, intermediate, good.
• Nominal variable (order not important), e.g. whether someone is an omnivore,
vegetarian or vegan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HYPOTHESIS TESTING

A
  1. State the null-hypothesis H0 and the alternative Ha
  2. Collect evidence (data)
  3. Can H0 be maintained, given
    the evidence?
    if p-value <= 0.05 – Reject H0
    if p-value > 0.05 – Do not reject H0
  4. At the a% significance level, there is(not) sufficient statistical evidence to infer …
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of Hypothesis errors

A
  1. Type I error (α): Reject H0 when H0 is true – Jury convicts an innocent person.
  2. Type II error (β): Do not reject H0 when H0 is false – Jury acquits a guilty person.
  3. Correct decision: Reject H0 when H0 is false – Jury convicts a guilty person.
  4. Correct decision: Do not reject H0 when H0 is true – Jury acquits an innocent person.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Confidence interval

A

Confidence interval - consists of an interval of numbers produced
by a point estimate, and an associated confidence level specifying the probability that the interval contains the
population parameter.
• Confidence intervals have the general form:
Point Estimate +/- Margin of Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Statistical Inference

A

Methods for estimating/predicting and testing hypotheses about population
characteristics based on information contained in a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Population

A

A population is collection of all elements of interest for a particular study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parameter

A

parameter is a characteristic of a population
(e.g., such as the mean number of
customer service calls of all customers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sample

A

A sample is a representative subset of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Statistic

A

A statistic is a characteristic of a sample (e.g., mean number of customer
service calls of the 5000 customers in the sample (1.563)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sample Proportion

A

The sample proportion p, is the statistic used to measure the unknown value of the population proportion p.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Point estimation

A

Use of a single known value of a statistic to estimate the associated population
parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

p-value

A

probability of observing a sample statistic at least as extreme as the statistic actually observed,
if we assume that H0= is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

1-sample t-test

A

H0: μ = μ0
Can be used for a numerical variable

the test statistics is t from t distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Test for Proportion

A

H0: π = π0
Can be used for a categorical variable

the test statistics is Z from standard normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Two-sample t-test

A

H0: μ1 = μ2
Can be used for a numerical and a binary variable

the test statistics is t from t distribution

17
Q

Two-sample Z-test

A

H0: π1 = π2
Can be used for two binary variables

the test statistics is Z from standard normal distribution

18
Q

Chi-square test

A

H0: π1 = π2 = π3
or
H0: π1=ρ1; π2=ρ2; π3=ρ3
Can be used for two Categorical variables (with > 2 categories)

test statistic is χ^2 from chi-square distribution

19
Q

Analysis of Variance (ANOVA)

A

H0: μ1 = μ2 = μ3
Can be used for a numerical and a categorical ( with > 2 categories) variables.

test statistic is F=MSTR/MSE from F distribution

20
Q

Correlation Test

A

H0: ρ = 0
Can be used for two numerical variables

the test statistics is t

21
Q

k-Nearest Neighbor Algorithm

A

The k-Nearest Neighbor algorithm is an instance-based learning where training
set records are first stored. Next, the classification of a new unclassified record
is performed by comparing it to the most similar records in the training set.

22
Q

Consequences of smaller k

A
  • Choosing a small value for k may lead the algorithm to overfit the data.
  • Noise or outliers may unduly affect classification.
23
Q

Consequences of larger k

A

• Larger values will tend to smooth out idiosyncratic or obscure data values in the
training set.
• If k becomes too large, locally interesting values will be overlooked

24
Q

Overfitting

A

Overfitting occurs when the model tries to fit every possible trend/structure in the
training set.

25
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a binary classification model at all classification thresholds. The y-axis shows the True Positive Rate, which is the same thing as Sensitivity. The x-axis shows the False Positive Rate, which is the same thing as 1 - Specificity.
26
AUC
The Area Under the Curve (AUC) is often a measure of the quality of the classification models.
27
MSE
Mean Squared Error (MSE) is a measurement of predictive accuracy. Lower MSE means more accurate classification. (POSITIVE VALUES)
28
Decision Tree
Decision Tree - tree shaped algorithm used to determine course of action (each branch represents possible decision)
29
Nodes, root nodes, leaf nodes, decision rule
``` Nodes - test which splits into different categories Root node - node at the top of decision tree Leaf node - each external node in decision tree (a.k.a. category) Decision rule (association rule) - all possible paths in decision tree (IF ... and ... THEN ...) ```
30
Entropy
measure of messiness of data collection.
31
Information gain
decrease obtained in entropy by splitting data set based on some condition.
32
Dangers of extrapolation
``` Extrapolation - estimating or concluding something by assuming that existing trends will continue. 1. Analysts should restrict estimates and predictions to the values within the range of the values of x in dataset. ```
33
Residual standard error
The value of the Residual standard error indicates the size of the “typical” prediction error. It has POSITIVE VALUES in which the lower values could be a sign of better predictions.
34
Confusion Matrix
``` Table that shows number of correct and incorrect predictions made by classification model compared to actual outcomes (target value) in data. True Positive (TP) --- False Positive (FP) False Negative (FN) --- True Negative (TN) ```
35
R-squared statistic r^2
Measures how closely the linear regression fits the data (ranges from 0 to 1) - The model has the R-squared value of 0.7455 which means 75% of the variability of the target variable revenue is explained by our regression model.
36
Four assumptions of linear regression
``` Before implementing a model, the requisite model assumptions must be verified. The assumptions are: Linearity of residuals Independence of residuals Normal distribution of residuals Constant variance of residuals ```
37
Types of tests to validate partition for these types of target variables? 1. Continuous 2. Flag/Binary 3. Multinomial
1. Two-sample t-test for difference in means 2. Two-sample Z-test for difference in proportions 3. Test for homogeneity of proportions