ML Concepts Flashcards

1
Q

Feature crossing

A
  • Create new features from existing features by taking their cross product
  • A feature cross is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example of crossing two features such as country and language

f1: [USA, China, England]
f2: [English, Chinese]

A

Generates 6 new features
- USA-English
- USA-Chinese
- China-English

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Feature selection

A
  • Goal is to reduce the number of features to only those most useful
  • can use GBDT to select a subset of features based on their importance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Feature extraction

A

Reduce the number of features by creating new features from existing ones. The new features have more predictive power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unsupervised learning

A

no labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Dimensionality Reduction

A

Techniques for reducing the number of input variables in training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Over-fitting

A

The model gives accurate predictions for training data but not new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Under-fitting

A
  • Model is unable to capture the relationship between inputs and output variables accurately
  • High error rate for both training data and unseen data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pros of Feature Crossing

A
  • Captures pair-wise, second-order feature interactions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cons of Feature Crossing

A
  • Requires human with domain knowledge to choose pairs
  • Won’t capture all complex interactions
  • If original features are sparse, the cardinality of the crossed features can become much larger leading to even more sparsity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

One hot encoding

A
  • Technique used to represent categorical variables as numerical values in a machine learning model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Advantages of one hot encoding

A
  • It allows the use of categorical variables in models that require numerical input.
  • It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What would a categorical feature for fruit (apple, mango, banana) and associated price look like using one hot encoding

A

A vector. The columns of the vector would be apple, mango, banana, and price. The fruit columns would contain one or zero. The price column would contain a numerical value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Disadvantages of one hot encoding

A
  • One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality (extra cols), sparsity (most cols are zero), and overfitting (poor predictions). It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Embeddings

A
  • Way to encode a categorical feature
  • Alternative to one-hot encoding which can generate very sparse vectors
  • Map high-dim vectors to low-dim vectors
  • Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words.
  • Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Feature Extraction

A

Create new features from existing ones for dimensionality reduction

17
Q

Model Categories

A

Supervised (labels)
Unsupervised (no labels)
Reinforncement learning (trial and error; this is how your roomba learns the shape of your living room)

18
Q

Types of supervised learning

A
  • Classification and regression
19
Q

How do you handle missing values in training data?

A
  • Feature imputation techniques
  • Use defaults
  • Use the mean, median, or mode
  • regression imputation (predict values based on correlated features)
  • k-nearest neighbor (use avg of nearest neighbors)
20
Q

What is embedding learning

A

Learning an n-dim vector for each unique value a categorical feature may take

21
Q

Upsampling

A
  • Strategy to handle data with imbalanced classes by replicating or generating new samples from the minority class to achieve a more balanced distribution
  • Reduces model bias for the majority class
22
Q

Why deep neural networks

A
  • One layer is good enough for many tasks but that single layer might have to be impractically large
  • Fewer workers
  • Shallow networks often perform poorly on high dim spaces like natural language
23
Q

Downsampling

A

Strategy to handle data with imbalanced classes by randomly removing samples from the majority class

24
Q

Epochs

A
  • During an epoch, the model sequentially processes each training sample, calculates loss, updates its parameters based on the gradients
  • The number of times the model iterates through the entire training set