Exam info Flashcards
NUMERICAL VARIABLES
Numerical:
• Continuous (entities get a distinct score), e.g. temperature,
body length.
• Discrete (counts), e.g.: number of defects
CATEGORICAL VARIABLES
Categorical (entities are divided into distinct categories):
• Binary variable (two outcomes), e.g. dead or alive.
• Ordinal variable, e.g. bad, intermediate, good.
• Nominal variable (order not important), e.g. whether someone is an omnivore,
vegetarian or vegan
HYPOTHESIS TESTING
- State the null-hypothesis H0 and the alternative Ha
- Collect evidence (data)
- Can H0 be maintained, given
the evidence?
if p-value <= 0.05 – Reject H0
if p-value > 0.05 – Do not reject H0 - At the a% significance level, there is(not) sufficient statistical evidence to infer …
Types of Hypothesis errors
- Type I error (α): Reject H0 when H0 is true – Jury convicts an innocent person.
- Type II error (β): Do not reject H0 when H0 is false – Jury acquits a guilty person.
- Correct decision: Reject H0 when H0 is false – Jury convicts a guilty person.
- Correct decision: Do not reject H0 when H0 is true – Jury acquits an innocent person.
Confidence interval
Confidence interval - consists of an interval of numbers produced
by a point estimate, and an associated confidence level specifying the probability that the interval contains the
population parameter.
• Confidence intervals have the general form:
Point Estimate +/- Margin of Error
Statistical Inference
Methods for estimating/predicting and testing hypotheses about population
characteristics based on information contained in a sample.
Population
A population is collection of all elements of interest for a particular study
Parameter
parameter is a characteristic of a population
(e.g., such as the mean number of
customer service calls of all customers).
Sample
A sample is a representative subset of the population.
Statistic
A statistic is a characteristic of a sample (e.g., mean number of customer
service calls of the 5000 customers in the sample (1.563)).
Sample Proportion
The sample proportion p, is the statistic used to measure the unknown value of the population proportion p.
Point estimation
Use of a single known value of a statistic to estimate the associated population
parameter.
p-value
probability of observing a sample statistic at least as extreme as the statistic actually observed,
if we assume that H0= is true.
1-sample t-test
H0: μ = μ0
Can be used for a numerical variable
the test statistics is t from t distribution
Test for Proportion
H0: π = π0
Can be used for a categorical variable
the test statistics is Z from standard normal distribution
Two-sample t-test
H0: μ1 = μ2
Can be used for a numerical and a binary variable
the test statistics is t from t distribution
Two-sample Z-test
H0: π1 = π2
Can be used for two binary variables
the test statistics is Z from standard normal distribution
Chi-square test
H0: π1 = π2 = π3
or
H0: π1=ρ1; π2=ρ2; π3=ρ3
Can be used for two Categorical variables (with > 2 categories)
test statistic is χ^2 from chi-square distribution
Analysis of Variance (ANOVA)
H0: μ1 = μ2 = μ3
Can be used for a numerical and a categorical ( with > 2 categories) variables.
test statistic is F=MSTR/MSE from F distribution
Correlation Test
H0: ρ = 0
Can be used for two numerical variables
the test statistics is t
k-Nearest Neighbor Algorithm
The k-Nearest Neighbor algorithm is an instance-based learning where training
set records are first stored. Next, the classification of a new unclassified record
is performed by comparing it to the most similar records in the training set.
Consequences of smaller k
- Choosing a small value for k may lead the algorithm to overfit the data.
- Noise or outliers may unduly affect classification.
Consequences of larger k
• Larger values will tend to smooth out idiosyncratic or obscure data values in the
training set.
• If k becomes too large, locally interesting values will be overlooked
Overfitting
Overfitting occurs when the model tries to fit every possible trend/structure in the
training set.
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a binary classification model at all classification thresholds.
The y-axis shows the True Positive Rate, which is the same thing as Sensitivity.
The x-axis shows the False Positive Rate, which is the same thing as 1 - Specificity.
AUC
The Area Under the Curve (AUC) is often a measure of the quality of the classification models.
MSE
Mean Squared Error (MSE) is a measurement of predictive accuracy. Lower MSE means more accurate classification. (POSITIVE VALUES)
Decision Tree
Decision Tree - tree shaped algorithm used to determine course of action (each branch represents possible decision)
Nodes, root nodes, leaf nodes, decision rule
Nodes - test which splits into different categories Root node - node at the top of decision tree Leaf node - each external node in decision tree (a.k.a. category) Decision rule (association rule) - all possible paths in decision tree (IF ... and ... THEN ...)
Entropy
measure of messiness of data collection.
Information gain
decrease obtained in entropy by splitting data set based on some condition.
Dangers of extrapolation
Extrapolation - estimating or concluding something by assuming that existing trends will continue. 1. Analysts should restrict estimates and predictions to the values within the range of the values of x in dataset.
Residual standard error
The value of the Residual standard error indicates the size of the “typical” prediction error.
It has POSITIVE VALUES in which the lower values could be a sign of better predictions.
Confusion Matrix
Table that shows number of correct and incorrect predictions made by classification model compared to actual outcomes (target value) in data. True Positive (TP) --- False Positive (FP) False Negative (FN) --- True Negative (TN)
R-squared statistic r^2
Measures how closely the linear regression fits the data (ranges from 0 to 1)
- The model has the R-squared value of 0.7455 which means 75% of the variability of the target variable revenue is explained by our regression model.
Four assumptions of linear regression
Before implementing a model, the requisite model assumptions must be verified. The assumptions are: Linearity of residuals Independence of residuals Normal distribution of residuals Constant variance of residuals
Types of tests to validate partition for these types of target variables?
- Continuous
- Flag/Binary
- Multinomial
- Two-sample t-test for difference in means
- Two-sample Z-test for difference in proportions
- Test for homogeneity of proportions