11_Machine Learning Flashcards by Julien Heck

Overfitting

Training model overfitted to training data: Unable to generalize with new data
Training model fails to generalize: Accounting for slightly different but close enough data.
Causes of Overfitting:
- Not enough training data
  - Need more variety of samples
- Too many features
  - Too complex
- Model fitted to unnecessary features unique to training data, a.k.a “Noise”
Solving for Overfitting:
- Use more data:
  - Add more training data
  - More varied data allows for better generalization
- Make the model less complex:
  - Use less (but more relevant) features = Feature Selection
  - Combine multiple co-dependant/redundant features into a single representative feature
    - This also helps reduce model training time
- Remove noise
  - Increase regularization parameters
- Regularization
- Early Stopping
- Cross Validation
- Dropout Methods
If data is scarce:
- Use independent test data
- Cross Validation

How well did you know this?

Not at all

Perfectly

Hyperparameters

Selection: Hyperparameter values needs to be specified before training begins
Types of Hyperparameters:
- Model hyperparameters relate directly to the model that is selected.
- Algorithm hyperparameters relate to the training of the model.
Training and Tuning: The process of finding the optimal, or near optimal values for hyperparameters.
Not related to training data!
Examples:
- Batch size
- Training epochs
- Number of hidden layers in neural network
- Number of nodes in hidden layers in neural network
- Regularization type
- Regularization rate
- Learning rate aka steps size

How well did you know this?

Not at all

Perfectly

Feature Engineering

Transform data so it is fit for Machine Learning.

Imputation (for missing data)
Outliers and Feature Clipping
- If your data set contains extreme outliers, you might try feature clipping, which caps all feature values above (or below) a certain value to fixed value
One-hot Encoding (for categorical data)
- One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany dataset chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you’ll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000
Linear Scaling
Log Scaling
Bucketing/Bucketization
- Transformation of numeric features into categorical features, using a set of thresholds (e.g. latitude in equally spaced buckets to predict house prices)
Feature prioritization
Feature Crosses
- A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.

How well did you know this?

Not at all

Perfectly

Regularization

Training: minimise (loss (data | model)
Regularization: complexity (model)
- L2 Regularization term
- L1 Regularization term
Training with Regularization: minimise (loss (data | model) + Lambda * complexity (model)
Adds a penalty to a model as it becomes more complex
Penalizing parameters = better generalization
Cuts out noise and unimportant data, to avoid overfitting

Regularization Types

L1 and L2 regularization - Different approaches to tuning out noise. Each has different use case and purpose

L1 - Lasso Regression: Simplicity. Assigns greater importance to more influential features
- Shrinks less important features influence to zero
- Good for models with many features, some more important than others
- Example: Choosing features to predict likelihood of home selling:
  - House price more influential feature than carpet color
L2 - Ridge Regression: Sparcity. Performs better when all the input features influence the output, and with all weights being roughly equal size.

How well did you know this?

Not at all

Perfectly

Techniques Glossary

Precision: formula to check how accurate the model is when most of the output are positives.
Recall: formula to check how accurate the model is when most of the output are negatives.
Gradient Descent: optimization algorithm to find the minimal value of a function. Gradient descent is used to find the minimal RMSE or cost function
Dropout Regularization: regularization method to remove random selection of a fixed number of units in a neural network layer. More units dropped out, the stronger the regularization.

How well did you know this?

Not at all

Perfectly