Exam Flashcards
what is statistical learning (also known as machine learning)?
relies on the idea that algorithms can learn from data
supervised learning?
is task-driven and the data is labelled
target variable (supervised learning)
a variable that we need to gain more information on, or predict the value of.
true values of the target variable are called??>
labels
predictors?
they are used in predictive analytics to make predictions on the target
target variables could be:
continuous or discrete
discrete
can have two levels (binary target) or multiple
classification
is used to predict the value of a discrete target variable, given predictor variable values.
continuous target
can have large number of possible outcomes
regression:
is used to predict the value of a continuous target variable
Cross-validation
technique that evaluates predictive models by partitioning the original model into training set and testing set to evaluate it
training set
to build (train) the model
testing set
to evaluate it
overfitting
when the algorithm predicts the training data so well it does not generalize to other models well
classification
when the value to be predicted is a categorical variable, the supervised learning is of type classification
Regression
when the value to be predicted is a numerical variable the supervised learning is of type regression
Unsupervised Learning
there is no target variable algorithm need to come up with the assignment based on data
Clustering
no know classes or categories. algorithm tries to learn of similarities and discover groups of similar data points
association
tries to find relationships between different variables.
parametric
rely on the estimation of parameters of a function, or set of functions, for the purpose of prediction.
non-parametric
do not rely on parameter estimation in order to predict outcome
hyper-parameters
a non parametric model may still involve the determination of settings
Inherent Error
unavoidable. also called ‘noise’ or ‘irreducible error’
Bias
due to over-simplifications.
variance
due to over-complication. overly complex model will be unable to perfectly generalize and correctly predict the target variable
K-nearest Neighbors
algorithm assigns each data point to a class based on the class of its nearest points
classification report
provides information on different aspects of the classifier
precision
proportion of correct positive (event) predictions to all positive predictions
Therefore = TP/(TP+FP)
Recall
recall for class x indicates the proportion of correct positive predictions to all true positive cases.
= TP/(TP+FN)
F1-score
the harmonic mean of the precision and recall for each class
support
support indicates the number of each class we had in our testing data
accuracy
number of correct predictions over the total number of predictions
ROC Curve (Receiver Operating Characteristic curve)
the ROC curve is a plot that depicts how the true positive rate changes with respect to the false positive rate
in ROC FalsePositiveRate should be
close to 0
in ROC TruePositiveRate should be
close to 1
Scaling
helps us bring all features into the same scale
Logistic Regression
the result will be often mapped to a binary outcome
Logistic regression falls into?
supervised learning of the classification type
Probabilities need to satisfy two conditions
Always be positive
always be between 0 and 1
odds of an event
is the probability of that event over its complement.
While probability of an event is always between 0 and 1
the odds could be any non-negative value
b0,b1 (Logistic Regression)
are the estimators of the model. also called predicted weights or the coefficients for each of the features
Data profiling
understanding what the data entails and identify anomalies, missing values, inconsistencies, etc.
data cleansing
activities include imputing missing values, removing missing values, addressing outliers, fixing variables that have inconsistent data
data structuring
bringing data into a structured form used for the analysis
data transformation
data may need to be transformed rescaled or normalized
Data collection
if data is not provided to you we need to collect it
Simple random sampling
each member of the population has the exact same probability of being selected in the sample
Systematic sampling
members of the population are selected based on a system (set of rules)
stratified sampling
population is divided into homogeneous slices (strata). Within each slice simple random sampling is performed and he results are combined (reduces sampling bias and improves accuracy of sampling)
Cluster sampling
the population is divided into subgroups, such that each cluster is a good representative of the population.
lower fence
Q1 - 1.5IQR
upper fence
Q3 - 1.5IQR
IQR
Q3 - Q1
data point is an outlier if
it is smaller than the lower fence or larger than the upper fence.
Dummy variables
we do one-hot encoding, variables created using one-hot will be used in place of the categorical variable
Label encoding
each category of the categorical variable is assigned a number based on some order
Regression
is a mathematical relationship between the features of a problem and the target variable that is to be predicted.
Linear regression
is a parametric method, requires a response variable (target) and one or multiple predictor variables (features)
the least squares method
produces a line that minimizes the sum of squared error
y and y hat
y is the actual value of the target variable
y-hat is the predicted value of the target variable
e (the residual)
is the difference between y and y hat
R^2
coefficient of determination
MSE
Mean Squared Error
RMSE
Root Mean Squared Error
Coef of Determination
is an indicator that determines the goodness of our model’s fit to the data, always between 0 and 1, a higher value is preferred
Mean Squared Error
measure that evaluates the average of the squared deviation between the values of the target and the predicted values of the target. Smaller values of MSE are preferred. a value of 0 is ideal but not possible
root mean squared error
RMSE is the average amount of deviation of data points from the regression line
Adj R^2
explicitly accounts for the number of explanatory variables. It is common to use adjusted R^2 for model selection because it imposes a penalty for any additional explanatory variable that is included in the analysis. Only increases when a new variable is added to the model that contributes to the prediction.
decision trees
the repeated splitting of nodes until we reach pure subsets is the building block of the classification and regression trees (CART) algorithm
When the target variable is categorical
the decision tree is a classification tree
when the target is numerical
the decision tree is a regression tree
Gini Index
measures the degree of impurity of a set of classes in the target variable
K mean algorithm step one
randomly pick k centroids from the sample points as initial cluster centers
K mean algorithm step 2
assign each sample to the nearest centroid
K mean algorithm step 3
Move the centroids to the center of the samples that were assigned to it
k mean algorithm step 4
repeat step 2 and 3 until maximum number of iterations is reached
elbow method
find the value of k, where the decrease in inertia slows down as k increases.
inertia
sum of squared distances between data points in each cluster and their cluster centre