1. State the null-hypothesis H0 and the alternative Ha 2. Collect evidence (data) 3. Can H0 be maintained, given the evidence? if p-value 0.05 -- Do not reject H0 4. At the a% significance level, there is(not) sufficient statistical evidence to infer ...

Exam info Flashcards by Luka Milosevic

NUMERICAL VARIABLES

Numerical:
• Continuous (entities get a distinct score), e.g. temperature,
body length.
• Discrete (counts), e.g.: number of defects

How well did you know this?

Not at all

Perfectly

CATEGORICAL VARIABLES

Categorical (entities are divided into distinct categories):
• Binary variable (two outcomes), e.g. dead or alive.
• Ordinal variable, e.g. bad, intermediate, good.
• Nominal variable (order not important), e.g. whether someone is an omnivore,
vegetarian or vegan

How well did you know this?

Not at all

Perfectly

HYPOTHESIS TESTING

State the null-hypothesis H0 and the alternative Ha
Collect evidence (data)
Can H0 be maintained, given
the evidence?
if p-value <= 0.05 – Reject H0
if p-value > 0.05 – Do not reject H0
At the a% significance level, there is(not) sufficient statistical evidence to infer …

How well did you know this?

Not at all

Perfectly

Types of Hypothesis errors

Type I error (α): Reject H0 when H0 is true – Jury convicts an innocent person.
Type II error (β): Do not reject H0 when H0 is false – Jury acquits a guilty person.
Correct decision: Reject H0 when H0 is false – Jury convicts a guilty person.
Correct decision: Do not reject H0 when H0 is true – Jury acquits an innocent person.

How well did you know this?

Not at all

Perfectly

Confidence interval

Confidence interval - consists of an interval of numbers produced
by a point estimate, and an associated confidence level specifying the probability that the interval contains the
population parameter.
• Confidence intervals have the general form:
Point Estimate +/- Margin of Error

How well did you know this?

Not at all

Perfectly

Statistical Inference

Methods for estimating/predicting and testing hypotheses about population
characteristics based on information contained in a sample.

How well did you know this?

Not at all

Perfectly

Population

A population is collection of all elements of interest for a particular study

How well did you know this?

Not at all

Perfectly

Parameter

parameter is a characteristic of a population
(e.g., such as the mean number of
customer service calls of all customers).

How well did you know this?

Not at all

Perfectly

Sample

A sample is a representative subset of the population.

How well did you know this?

Not at all

Perfectly

Statistic

A statistic is a characteristic of a sample (e.g., mean number of customer
service calls of the 5000 customers in the sample (1.563)).

How well did you know this?

Not at all

Perfectly

Sample Proportion

The sample proportion p, is the statistic used to measure the unknown value of the population proportion p.

How well did you know this?

Not at all

Perfectly

Point estimation

Use of a single known value of a statistic to estimate the associated population
parameter.

How well did you know this?

Not at all

Perfectly

p-value

probability of observing a sample statistic at least as extreme as the statistic actually observed,
if we assume that H0= is true.

How well did you know this?

Not at all

Perfectly

1-sample t-test

H0: μ = μ0
Can be used for a numerical variable

the test statistics is t from t distribution

How well did you know this?

Not at all

Perfectly

Test for Proportion

H0: π = π0
Can be used for a categorical variable

the test statistics is Z from standard normal distribution

How well did you know this?

Not at all

Perfectly

Two-sample t-test

Study These Flashcards

H0: μ1 = μ2
Can be used for a numerical and a binary variable

the test statistics is t from t distribution

Two-sample Z-test

Study These Flashcards

H0: π1 = π2
Can be used for two binary variables

the test statistics is Z from standard normal distribution

Chi-square test

Study These Flashcards

H0: π1 = π2 = π3
or
H0: π1=ρ1; π2=ρ2; π3=ρ3
Can be used for two Categorical variables (with > 2 categories)

test statistic is χ^2 from chi-square distribution

Analysis of Variance (ANOVA)

Study These Flashcards

H0: μ1 = μ2 = μ3
Can be used for a numerical and a categorical ( with > 2 categories) variables.

test statistic is F=MSTR/MSE from F distribution

Correlation Test

Study These Flashcards

H0: ρ = 0
Can be used for two numerical variables

the test statistics is t

k-Nearest Neighbor Algorithm

Study These Flashcards

The k-Nearest Neighbor algorithm is an instance-based learning where training
set records are first stored. Next, the classification of a new unclassified record
is performed by comparing it to the most similar records in the training set.

Consequences of smaller k

Study These Flashcards

Choosing a small value for k may lead the algorithm to overfit the data.
Noise or outliers may unduly affect classification.

Consequences of larger k

Study These Flashcards

• Larger values will tend to smooth out idiosyncratic or obscure data values in the
training set.
• If k becomes too large, locally interesting values will be overlooked

Overfitting

Study These Flashcards

Overfitting occurs when the model tries to fit every possible trend/structure in the
training set.

ROC curve

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a binary classification model at all classification thresholds. The y-axis shows the True Positive Rate, which is the same thing as Sensitivity. The x-axis shows the False Positive Rate, which is the same thing as 1 - Specificity.

AUC

The Area Under the Curve (AUC) is often a measure of the quality of the classification models.

MSE

Mean Squared Error (MSE) is a measurement of predictive accuracy. Lower MSE means more accurate classification. (POSITIVE VALUES)

Decision Tree

Decision Tree - tree shaped algorithm used to determine course of action (each branch represents possible decision)

Nodes, root nodes, leaf nodes, decision rule

``` Nodes - test which splits into different categories Root node - node at the top of decision tree Leaf node - each external node in decision tree (a.k.a. category) Decision rule (association rule) - all possible paths in decision tree (IF ... and ... THEN ...) ```

Entropy

measure of messiness of data collection.

Information gain

decrease obtained in entropy by splitting data set based on some condition.

Dangers of extrapolation

``` Extrapolation - estimating or concluding something by assuming that existing trends will continue. 1. Analysts should restrict estimates and predictions to the values within the range of the values of x in dataset. ```

Residual standard error

The value of the Residual standard error indicates the size of the “typical” prediction error. It has POSITIVE VALUES in which the lower values could be a sign of better predictions.

Confusion Matrix

``` Table that shows number of correct and incorrect predictions made by classification model compared to actual outcomes (target value) in data. True Positive (TP) --- False Positive (FP) False Negative (FN) --- True Negative (TN) ```

R-squared statistic r^2

Measures how closely the linear regression fits the data (ranges from 0 to 1) - The model has the R-squared value of 0.7455 which means 75% of the variability of the target variable revenue is explained by our regression model.

Four assumptions of linear regression

``` Before implementing a model, the requisite model assumptions must be verified. The assumptions are: Linearity of residuals Independence of residuals Normal distribution of residuals Constant variance of residuals ```

Types of tests to validate partition for these types of target variables? 1. Continuous 2. Flag/Binary 3. Multinomial

1. Two-sample t-test for difference in means 2. Two-sample Z-test for difference in proportions 3. Test for homogeneity of proportions

Exam info Flashcards

(37 cards)