11_Machine Learning Flashcards
1
Q
Overfitting
- Training model overfitted to training data: Unable to generalize with new data
- Training model fails to generalize: Accounting for slightly different but close enough data.
-
Causes of Overfitting:
-
Not enough training data
- Need more variety of samples
-
Too many features
- Too complex
- Model fitted to unnecessary features unique to training data, a.k.a “Noise”
-
Not enough training data
- Solving for Overfitting:
- Use more data:
- Add more training data
- More varied data allows for better generalization
- Make the model less complex:
- Use less (but more relevant) features = Feature Selection
- Combine multiple co-dependant/redundant features into a single representative feature
- This also helps reduce model training time
- Remove noise
- Increase regularization parameters
- Regularization
- Early Stopping
- Cross Validation
- Dropout Methods
- Use more data:
- If data is scarce:
- Use independent test data
- Cross Validation
A
3
Q
Hyperparameters
- Selection: Hyperparameter values needs to be specified before training begins
-
Types of Hyperparameters:
- Model hyperparameters relate directly to the model that is selected.
- Algorithm hyperparameters relate to the training of the model.
- Training and Tuning: The process of finding the optimal, or near optimal values for hyperparameters.
- Not related to training data!
- Examples:
- Batch size
- Training epochs
- Number of hidden layers in neural network
- Number of nodes in hidden layers in neural network
- Regularization type
- Regularization rate
- Learning rate aka steps size
A
4
Q
Feature Engineering
Transform data so it is fit for Machine Learning.
- Imputation (for missing data)
- Outliers and Feature Clipping
- If your data set contains extreme outliers, you might try feature clipping, which caps all feature values above (or below) a certain value to fixed value
-
One-hot Encoding (for categorical data)
- One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany dataset chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you’ll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000
- Linear Scaling
- Log Scaling
-
Bucketing/Bucketization
- Transformation of numeric features into categorical features, using a set of thresholds (e.g. latitude in equally spaced buckets to predict house prices)
- Feature prioritization
-
Feature Crosses
- A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.
A
5
Q
Regularization
- Training: minimise (loss (data | model)
-
Regularization: complexity (model)
- L2 Regularization term
- L1 Regularization term
- Training with Regularization: minimise (loss (data | model) + Lambda * complexity (model)
- Adds a penalty to a model as it becomes more complex
- Penalizing parameters = better generalization
- Cuts out noise and unimportant data, to avoid overfitting
Regularization Types
L1 and L2 regularization - Different approaches to tuning out noise. Each has different use case and purpose
-
L1 - Lasso Regression: Simplicity. Assigns greater importance to more influential features
- Shrinks less important features influence to zero
- Good for models with many features, some more important than others
- Example: Choosing features to predict likelihood of home selling:
- House price more influential feature than carpet color
- L2 - Ridge Regression: Sparcity. Performs better when all the input features influence the output, and with all weights being roughly equal size.
A
8
Q
Techniques Glossary
- Precision: formula to check how accurate the model is when most of the output are positives.
- Recall: formula to check how accurate the model is when most of the output are negatives.
- Gradient Descent: optimization algorithm to find the minimal value of a function. Gradient descent is used to find the minimal RMSE or cost function
- Dropout Regularization: regularization method to remove random selection of a fixed number of units in a neural network layer. More units dropped out, the stronger the regularization.
A