Exam info Flashcards
NUMERICAL VARIABLES
Numerical:
• Continuous (entities get a distinct score), e.g. temperature,
body length.
• Discrete (counts), e.g.: number of defects
CATEGORICAL VARIABLES
Categorical (entities are divided into distinct categories):
• Binary variable (two outcomes), e.g. dead or alive.
• Ordinal variable, e.g. bad, intermediate, good.
• Nominal variable (order not important), e.g. whether someone is an omnivore,
vegetarian or vegan
HYPOTHESIS TESTING
- State the null-hypothesis H0 and the alternative Ha
- Collect evidence (data)
- Can H0 be maintained, given
the evidence?
if p-value <= 0.05 – Reject H0
if p-value > 0.05 – Do not reject H0 - At the a% significance level, there is(not) sufficient statistical evidence to infer …
Types of Hypothesis errors
- Type I error (α): Reject H0 when H0 is true – Jury convicts an innocent person.
- Type II error (β): Do not reject H0 when H0 is false – Jury acquits a guilty person.
- Correct decision: Reject H0 when H0 is false – Jury convicts a guilty person.
- Correct decision: Do not reject H0 when H0 is true – Jury acquits an innocent person.
Confidence interval
Confidence interval - consists of an interval of numbers produced
by a point estimate, and an associated confidence level specifying the probability that the interval contains the
population parameter.
• Confidence intervals have the general form:
Point Estimate +/- Margin of Error
Statistical Inference
Methods for estimating/predicting and testing hypotheses about population
characteristics based on information contained in a sample.
Population
A population is collection of all elements of interest for a particular study
Parameter
parameter is a characteristic of a population
(e.g., such as the mean number of
customer service calls of all customers).
Sample
A sample is a representative subset of the population.
Statistic
A statistic is a characteristic of a sample (e.g., mean number of customer
service calls of the 5000 customers in the sample (1.563)).
Sample Proportion
The sample proportion p, is the statistic used to measure the unknown value of the population proportion p.
Point estimation
Use of a single known value of a statistic to estimate the associated population
parameter.
p-value
probability of observing a sample statistic at least as extreme as the statistic actually observed,
if we assume that H0= is true.
1-sample t-test
H0: μ = μ0
Can be used for a numerical variable
the test statistics is t from t distribution
Test for Proportion
H0: π = π0
Can be used for a categorical variable
the test statistics is Z from standard normal distribution
Two-sample t-test
H0: μ1 = μ2
Can be used for a numerical and a binary variable
the test statistics is t from t distribution
Two-sample Z-test
H0: π1 = π2
Can be used for two binary variables
the test statistics is Z from standard normal distribution
Chi-square test
H0: π1 = π2 = π3
or
H0: π1=ρ1; π2=ρ2; π3=ρ3
Can be used for two Categorical variables (with > 2 categories)
test statistic is χ^2 from chi-square distribution
Analysis of Variance (ANOVA)
H0: μ1 = μ2 = μ3
Can be used for a numerical and a categorical ( with > 2 categories) variables.
test statistic is F=MSTR/MSE from F distribution
Correlation Test
H0: ρ = 0
Can be used for two numerical variables
the test statistics is t
k-Nearest Neighbor Algorithm
The k-Nearest Neighbor algorithm is an instance-based learning where training
set records are first stored. Next, the classification of a new unclassified record
is performed by comparing it to the most similar records in the training set.
Consequences of smaller k
- Choosing a small value for k may lead the algorithm to overfit the data.
- Noise or outliers may unduly affect classification.
Consequences of larger k
• Larger values will tend to smooth out idiosyncratic or obscure data values in the
training set.
• If k becomes too large, locally interesting values will be overlooked
Overfitting
Overfitting occurs when the model tries to fit every possible trend/structure in the
training set.