ML Concepts Flashcards
Feature crossing
- Create new features from existing features by taking their cross product
- A feature cross is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together.
Example of crossing two features such as country and language
f1: [USA, China, England]
f2: [English, Chinese]
Generates 6 new features
- USA-English
- USA-Chinese
- China-English
Feature selection
- Goal is to reduce the number of features to only those most useful
- can use GBDT to select a subset of features based on their importance
Feature extraction
Reduce the number of features by creating new features from existing ones. The new features have more predictive power.
Unsupervised learning
no labels
Dimensionality Reduction
Techniques for reducing the number of input variables in training data.
Over-fitting
The model gives accurate predictions for training data but not new data
Under-fitting
- Model is unable to capture the relationship between inputs and output variables accurately
- High error rate for both training data and unseen data
Pros of Feature Crossing
- Captures pair-wise, second-order feature interactions
Cons of Feature Crossing
- Requires human with domain knowledge to choose pairs
- Won’t capture all complex interactions
- If original features are sparse, the cardinality of the crossed features can become much larger leading to even more sparsity
One hot encoding
- Technique used to represent categorical variables as numerical values in a machine learning model
Advantages of one hot encoding
- It allows the use of categorical variables in models that require numerical input.
- It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).
What would a categorical feature for fruit (apple, mango, banana) and associated price look like using one hot encoding
A vector. The columns of the vector would be apple, mango, banana, and price. The fruit columns would contain one or zero. The price column would contain a numerical value.
Disadvantages of one hot encoding
- One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality (extra cols), sparsity (most cols are zero), and overfitting (poor predictions). It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding.
Embeddings
- Way to encode a categorical feature
- Alternative to one-hot encoding which can generate very sparse vectors
- Map high-dim vectors to low-dim vectors
- Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words.
- Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space.