Machine Learning Flashcards by Zakarias Larsson

Hierarchical clustering is most likely used when the problem involves

Classifying unlebeled data
and when number of categories are unknown

How well did you know this?

Not at all

Perfectly

What is Supervised machine learning

Involves training an algorithm to take a set of inputs (x variables) and find a model that best relates them to outputs (Y variables)

Training algorithm - Set of inputs - find models that relates to outputs.

How well did you know this?

Not at all

Perfectly

What is unsupervised machine learning

Same as supervised learning, but does not make use of labeled training data.

We give it data and expect the algorithm to make sense of it.

How well did you know this?

Not at all

Perfectly

What is Overfitting

ML models can produce overly complex models that may fit the training data too well and thereby not generalize new data well.

The prediction model of the traning sample (in-Sample data) is too complex.

The traning Data does not work well with the new data

How well did you know this?

Not at all

Perfectly

Name Supervised ML Algorithms

Penalized regression
Support Vector Machine (SVM)
K - Nearest Neighbor
Classification and Regression Trees (CART)
Ensemble learning
Random Forest

How well did you know this?

Not at all

Perfectly

Name unsupervised ML Algortihms

Principle component analysis
K-Mean clustering
Hierarachical clustering

How well did you know this?

Not at all

Perfectly

High Bias Error in ML

High Bias Error means the model does not fit the training data well.

How well did you know this?

Not at all

Perfectly

High Variance Error in ML

High variance error means the model does not predict well on the test data

How well did you know this?

Not at all

Perfectly

Name Dimension Reduction in ML

Principle component analysis (unsupervised ML)

Penalized Regression
(Supervised ML)

How well did you know this?

Not at all

Perfectly

What does Penalized Regression do?

Simmilar to maximizing adjusted R square.
Demension Reduction
Eliminates/minimazie overfitting

Regression coefficients are chosen to minimize the sum of the squared error, plus a penalty term that increases with the number of included features

How well did you know this?

Not at all

Perfectly

What is SVM

Support Vector Machine

It is Classification, Regression, and Outlier detection
Classifying data that is not complex or non-linear.

Is a linear classifier that determines the hyperplane that optimally seperates the observation into two sets of data points.

Does not requier any hyperparameter.

Maximize the probability of making a correct prediction by determining the boundry that is furthest from all observation.

Outliers do not affect either the support vectors or the discriminant boundry.

How well did you know this?

Not at all

Perfectly

What is K-Nearest Neighbor

Classification

Classify new observation by finding similarities in the existing data.

Makes no assumption about the distribution of the data.
It is non-parametric.

KNN results can be sensitive to inclusion of irrelevant or correlated featuers, so it may be neccessary to select featuers manually.
Thereby removing less irrelevant information.

How well did you know this?

Not at all

Perfectly

What is CART

Classification and Regression Trees

Part of supervised ML

Typically applied when the target is binary.

If the goal is regression, the prediction would be the mean of the values of the terminal node.

Makes no assumption about the characteristics of the traning data, so if left unconstrained, potentially it can perfectly learn the traning data.

To avoid overfitting, regulation paramterers can be added, such as the maximum dept of the tree.

How well did you know this?

Not at all

Perfectly

What are the 3 types of layer in Neural Network

Input layer
Hidden layer
Output layer

How well did you know this?

Not at all

Perfectly

What are non-linear functions more susceptiable to?

Variance error and overfitting

How well did you know this?

Not at all

Perfectly

What are linear functions more susceptiable to?

Study These Flashcards

Bias error and underfitting

The main distinction between clustering and classification algorithms is that

Study These Flashcards

The groups in clustering are determined by the data

Classification they are determined by the analyst/researcher

What is K-Means clustering in ML?

Study These Flashcards

K-means partitions observations into a fixed number, k, of non-overlaping cluster.

Each cluster is characterized by its centroid, and each observation is assigned by the algorithm to the cluster with the centroid to which that observation is closest.

High bias error and high variance error are indicative of…

Study These Flashcards

Underfitting

High bias error = model does not fit on the traning data.

High variance = Model does not predict well on test data.

Both combination results in a underfitted model.

Low bias error but high variance error is indicative of ..

Study These Flashcards

Overfitting

Bias error = model does not fit the traning data well.

Variance error = Model does not predict well on test data.

What are linear models more susceptible to?

Study These Flashcards

Bias Error (underfitting)

What are non-linear models more prone to?

Study These Flashcards

Variance Error
(overfitting)

What is Principal Components Analysis

Study These Flashcards

It is part of unsupervised ML
Dimension Reduction

Use to reduce highly correlaed featuers of data into few main uncorrelated composite variables.

What are the 3 types of error in ML?

Study These Flashcards

Bias error
Variance error
Base error

What is variance error in ML?

Variance Error or how much the model’s results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance causing overfitting and ↑ out of-sample error.

What is Bias error in ML?

Bias Error or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and ↑ in-sample error. (Adding more training samples will not improve the model)

What is Bias error in ML?

Base Error due to randomness in the data. (Out-of-sample accuracy increases as the training sample size increases)

Name 2 ways to Preventing Overfitting in Supervised Machine Learning

**Ocean’s Razor**: The problem solving principle that the simplest solution tends to be the correct one. In supervised ML, it means preventing the algorithm from getting too complex during selection and training by limiting the no. of features and penalizing algorithms that are too complex or too flexible by constraining them to include only parameters that reduce out-of-sample error. **K-Fold Cross Validation**: This strategy comes from the principle of avoiding sampling bias. The challenge is having a large enough data set to make both training and testing possible on representative samples.

Machine Learning Flashcards

(28 cards)