General Flashcards

1
Q

What are “labels”?

A

The thing we’re predicting. e.g. the “y” variable in a simple linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a “feature”?

A

An input variable, e.g. the “x” in a simple linear regression. There may be multiple input variables, indicated as {x1, x2, … xN}.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an “example”?

A

An instance of data, either “labeled” or “unlabeled”. Labeled examples are used to train the model, which can then predict the labels on unlabeled examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “training” and “inference”?

A

Training is using labeled data to create, or “learn”, the model. Inference is using the model to make predictions on unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is “regression” and “classification”?

A

A regression model predicts continuous values, e.g. prices. A classification model predicts discrete values, e.g. spam or not spam, or the animal in an image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is “linear regression”?

A

Linear regression is a method for finding the straight line or hyperplane that best fits a set of points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is “L2 Loss”?

A

Also known as “squared error”, this is simply the sum of the “(prediction - actual)2” for the examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How would you write the equation for a simple linear regression model?

A

y’ = b + w1 x1, where y’ is the predicted label, b is the bias (y intercept), w1 is weight of feature 1, and x1 is a feature (a known input). With more features, “b + w1 x1 + w2 x2…”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the process of trying many example models to find one that minimizes loss called?

A

Empirical risk minimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is MSE?

A

Mean square error. The sum of the squared losses for example example, divided by the number of examples. It is commonly used, but not the only function, or best for all cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does it mean that a model has “converged”?

A

In computing the model, the loss has reach a point with no or little improvement in each iteration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the shape of the “loss” vs “weight” plot for regression problems?

A

Convex (i.e. bowl). They only have one minimum (i.e. one place with a slope is exactly 0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the symbol “Δ”?

A

The uppercase form of the Greek letter delta, it is used to represent “change in”. e.g. in the “slop intercept” equation “y = mx + b”, then “Δy = m Δx”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is “∂y / ∂x”

A

The partial derivative of f with respect to x (the derivative of f considered as a function of x alone).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is “∇f”

A

The gradient of a function, denoted as shown, is the vector of partial derivatives with respect to all of the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a “gradient”?

A

The gradient is a multi-variable generalization of the derivative. Like the derivative, the gradient represents the slope of the tangent of the graph of the function. More precisely, the gradient points in the direction of the greatest rate of increase of the function, and its magnitude is the slope of the graph in that direction

17
Q

What is the gradient decent algorithm?

A

The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible. To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient’s magnitude to the starting point.

18
Q

What is the “learning rate”?

A

Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.

19
Q

What are hyperparameters?

A

Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. eg. If you pick a learning rate that is too small, learning will take too long.

20
Q

In gradient descent, what is a “batch”?

A

The total number of examples you use to calculate the gradient in a single iteration.

21
Q

What is SGD and “mini-batch SGD”?

A

Stochastic Gradient Decent uses one random example for each iteration of training. This is fast but can be noisy. “mini-batch” chooses a number of random samples (typically 10 - 1000), which has less noise and is still much more efficient that “full batch”.

22
Q

What is root mean squared error? What is nice about it?

A

It is just the square root of MSE (mean squared error). A nice property of RMSE is that it can be interpreted on the same scale as the original targets

23
Q

What two sets are the data typically divided into?

A

The training set, used to build the model, and the test set, used to validate the accuracy of the model. The subsets should be consistent throughout training, and never train on the test data.

24
Q

What other subset might be added to the training and test sets, and why?

A

A validation set can be used to avoid overfitting. Tweak the hyperparameters until the validation set does well, then verify against the test set that it holds.

25
Q

What is “one-hot encoding”?

A

Take a highly variable column, like a “street name” string, and represent as a sparse vector, where all values are 0 except the index representing that street.

26
Q

What is “feature engineering”?

A

Turning raw data into a feature vector. Machine learning models typically expect examples to be represented as real-numbered vectors. The vector is constructed by deriving features for each field, then concatenating them all together.

27
Q

What is “scaling”, and when/why is it useful?

A

This puts feature values within a similar range (e.g. -1 to 1, or 0 to 1). This helps gradient decent converge quicker, avoids NaNs due to out of range operations, and avoids some features having an outsided impact.

28
Q

What is the “Z score”?

A

The number of standard deviations from the mean a value is. It is a useful way of scaling. (e.g. “z-score = (value - mean) / std dev.” ).

29
Q

What is a “feature cross”?

A

A feature cross is a “synthetic feature” formed by multiplying (crossing) two or more features

30
Q

What is the value of a “feature cross”?

A

It allows the encoding of non-linearity in a linear system.

31
Q

What are feature crosses commonly used for?

A

Combining one-hot encodings as a type of conjunction, e.g. country=use & lang=es, or lat=32 & long=45. Creating a feature crosses on 5 different lat & longs, results in a 25-element one-hot encoding for the lat & long.