Mean squared error: - A measure of the quality of a model/estimator - The average squared difference between the estimated values and the actual value. - It is the "second moment" (about the origin) (L2) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the truth).

Machine Learning Flashcards by Tyler Bittner

Most common cost functions for linear regression

MSE - Mean Squared Error (or OLS - Ordinary Least Squares), MAE - Mean Absolute Error, Huber Loss Function

How well did you know this?

Not at all

Perfectly

MSE

Mean squared error:

A measure of the quality of a model/estimator
The average squared difference between the estimated values and the actual value.
It is the “second moment” (about the origin) (L2) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the truth).

How well did you know this?

Not at all

Perfectly

Cost function for Logistic Regression

Log loss or cross-entropy

How well did you know this?

Not at all

Perfectly

Sigmoid function

1 / (1 + e^-z)

How well did you know this?

Not at all

Perfectly

Logistic Regression

A classification algorithm used to assign probabilities to a discrete set of classes.

How well did you know this?

Not at all

Perfectly

Monotonic function

Always increasing or always decreasing

How well did you know this?

Not at all

Perfectly

Softmax function

Used to normalize results in multi-class logistic regression. Transforms a vector of predictions (real numbers) so that each is in the interval of [0, 1] and all add up to 1 so they can be interpreted as probabilities.
(aka "normalized exponential function" or softargmax)

How well did you know this?

Not at all

Perfectly

Popular regression algorithms

Ordinary Least Squares Regression (OLSR)
Linear Regression
Logistic Regression
Stepwise Regression
Multivariate Adaptive Regression Splines (MARS)
Locally Estimated Scatterplot Smoothing (LOESS)

How well did you know this?

Not at all

Perfectly

Popular instance-based algorithms

k-Nearest Neighbors (kNN)
Learning Vector Quantization (LVQ) (is also neural-network-inspired)
Self-Organizing Map (MAP)
Locally Weighted Learning (LWL)
Support Vector Machine (SVM)

How well did you know this?

Not at all

Perfectly

Popular regularization algorithms

Ridge Regression
Least Absolute Shrinkage and Selection Operator (LASSO)
Elastic Net
Least-Angle Regression

How well did you know this?

Not at all

Perfectly

Popular decision tree algorithms

Classification and Regression Tree (CART)
Iterative Dichotomiser 3 (ID3)
C4.5 and C5.0 (diff versions of a powerful approach)
Chi-squared Automatic Interaction Detection (CHAID)
Decision Stump
M5
Conditional Decision Trees

How well did you know this?

Not at all

Perfectly

Popular Bayesian algorithms

Naive Bayes
Gaussian Naive Bayes
Multinomial Naive Bayes
Averaged One-Dependence Estimators (AODE)
Bayesian Belief Network (BBN)
Bayesian Network (BN)

How well did you know this?

Not at all

Perfectly

Popular clustering algorithms

k-Means
k-Medians
Expectation Maximization (EM)
Hierarchical Clustering

How well did you know this?

Not at all

Perfectly

Popular association rule learning algorithms

Apriori algorithm

- Eclat algorithm

How well did you know this?

Not at all

Perfectly

Popular ensemble algorithms

Boosting
Bootstrapped Aggregation (Bagging)
AdaBoost
Weighted Average (Blending)
Stacked Generalization (Stacking)
Gradient Boosting Machines (GBM)
Gradient Boosted Regression Trees (GBRT)
Random Forest

How well did you know this?

Not at all

Perfectly

Popular ensemble algorithms

Boosting
Bootstrapped Aggregation (Bagging)
AdaBoost
Weighted Average (Blending)
Stacked Generalization (Stacking)
Gradient Boosting Machines (GBM)
Gradient Boosted Regression Trees (GBRT)
Random Forest

Describe regression algorithms

TODO

Data cleaning/prep checklist

Dups rows or values?
Missing values? What strategy to use to handle it?
Does any data need recoded?
Does any data need transformed from categorical to dummy variables?

Data exploration checklist

Summary statistics
Correlations
Subsets

Machine learning DevOps challenges

High heterogeneity
## High composability

Machine learning DevOps challenges

High heterogeneity
High composability
More options for performance and success metrics
Iteration - models may require frequent retraining and redeployment
Infrastructure - varied and dynamic loads, evolving ecosystem
Scalability from unpredictable loads & high performance demands
Auditability - Need to explain the “black box”

Properties of standardized data set

mean of zero
unit variance (std dev = 1)
normal (Gaussian) distribution [usually]

Gini Impurity formula for a node

IG = 1 - (probability of condition one)^2 - (probability of condition two)^2

Gini Impurity formula for a condition

= weighted avg of Gini Impurity for leaf nodes

= SUM for all nodes[ (% of items classified by node) * (IG of node) ]

Bootstrapping

Bootstrapping is any test or metric that relies on random sampling with replacement.

Bagging (or Bootstrap Aggregating)

The procedure of training each individual learner on different bootstrapped subsets of the data and then averaging the predictions.

normalization vs. standardization

Normalization usually means to scale a variable to have a values between 0 and 1, while standardization transforms data to have a mean of zero and a standard deviation of 1.

UEDOPE problem-solving process

* Understand* the problem: analyze, ask Qs, restate it * Examples*: write > 3 quality examples incl. edge cases * Design* solution: break out steps, outline brute force solution, analyze complexity * Optimize*: brainstorm on paper & walk thru examples, try diff data structures, algos, & strategies (BUD, space/time trade-offs, reverse-engineer intuition) * Pseudocode*: then walk thru w/ examples * Execute*: code, test, refactor