Exam Flashcards
what is statistical learning (also known as machine learning)?
relies on the idea that algorithms can learn from data
supervised learning?
is task-driven and the data is labelled
target variable (supervised learning)
a variable that we need to gain more information on, or predict the value of.
true values of the target variable are called??>
labels
predictors?
they are used in predictive analytics to make predictions on the target
target variables could be:
continuous or discrete
discrete
can have two levels (binary target) or multiple
classification
is used to predict the value of a discrete target variable, given predictor variable values.
continuous target
can have large number of possible outcomes
regression:
is used to predict the value of a continuous target variable
Cross-validation
technique that evaluates predictive models by partitioning the original model into training set and testing set to evaluate it
training set
to build (train) the model
testing set
to evaluate it
overfitting
when the algorithm predicts the training data so well it does not generalize to other models well
classification
when the value to be predicted is a categorical variable, the supervised learning is of type classification
Regression
when the value to be predicted is a numerical variable the supervised learning is of type regression
Unsupervised Learning
there is no target variable algorithm need to come up with the assignment based on data
Clustering
no know classes or categories. algorithm tries to learn of similarities and discover groups of similar data points
association
tries to find relationships between different variables.
parametric
rely on the estimation of parameters of a function, or set of functions, for the purpose of prediction.
non-parametric
do not rely on parameter estimation in order to predict outcome
hyper-parameters
a non parametric model may still involve the determination of settings
Inherent Error
unavoidable. also called ‘noise’ or ‘irreducible error’
Bias
due to over-simplifications.
variance
due to over-complication. overly complex model will be unable to perfectly generalize and correctly predict the target variable
K-nearest Neighbors
algorithm assigns each data point to a class based on the class of its nearest points
classification report
provides information on different aspects of the classifier
precision
proportion of correct positive (event) predictions to all positive predictions
Therefore = TP/(TP+FP)
Recall
recall for class x indicates the proportion of correct positive predictions to all true positive cases.
= TP/(TP+FN)
F1-score
the harmonic mean of the precision and recall for each class
support
support indicates the number of each class we had in our testing data