Final Exam Prep Flashcards

1
Q

What is SLR?

A

simple linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the equation for simple linear regression?

A

H(x) = w_0 + w_1x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the equation for the constant model?

A

H(x) = h

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the ith observation?

A

(x_i, y_i)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is H?

A

the hypothesis function, used to make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are summary statistics?

A

summarize a collection of numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are examples of summary statistics?

A

mean, median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a loss function?

A

quantifies how bad a prediction is for a single data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is squared loss?

A

L_sq(y_i, H(x_i)) = (y_i - H(x_i))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does y_i represent?

A

actual values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does H(x_i) represent?

A

predicted values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is R?

A

the average loss for all points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is another name for R?

A

risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does MSE stand for?

A

mean squared error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the equation for MSE?

A

R_sq(h) = 1/n En i=1 (y_i - h)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is our goal when calculating the MSE?

A

to find the h that minimizes R_sq(h)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the definition for MSE?

A

the average squares loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the h in H(x) = h?

A

it is a parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

If c(x) = a(x) + b(x), what is the derivative?

A

d/dx c(x) = d/dx a(x) + d/dx b(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the value that minimizes MSE?

A

the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the definition for convexitivity?

A

there is a minimum that is differentiable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is steps does the modeling recipe consist of?

A

choose a model, choose a loss function, minimize average loss to find optimal model parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is does MAE stand for?

A

mean absolute error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the MAE calculate?

A

average absolute loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the equation for MAE?

A

R_abs(h) = 1/n En i=1 |y_i - h|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the h that minimizes MAE?

A

the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What can we say about our data R_abs(h*) is not unique?

A

n is even

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How do we get a unique value for R_abs(h*)?

A

n has to be odd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

_____ is sensitive to outliers!

A

mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

_____ is robust to outliers!

A

median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is empirical risk minimization?

A

formal name for minimizing average loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What minimizes Linfinity loss?

A

the midrange

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What minimizes 0,1 loss?

A

the mode!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a feature?

A

an attribute of the data (columns)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What type of values can features be?

A

numerical, categorical, boolean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What happens when we make MSE zero?

A

we are overfitting to the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why is overfitting to our data bad?

A

because we want our model to generalize well to unseen data and make good predictions in the real world

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is an example of a quadratic regression equation?

A

H(x) = w_0 + w_1x^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is an example of an exponential regression equation?

A

H(x) = w_0e^w_1x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is does w_0 represent?

A

intercept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What does w_1 represent?

A

slope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is the equation for the loss surface?

A

R_sq(w_0, w_1) = 1/n En i=1 (y_i - (w_0 + w_1x_i))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the least squares solution for w_0?

A

(En i=1 (x_i - xbar)(y_i - ybar))/(En i=1 x_i - xbar)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the least squares solution for w_1?

A

ybar - w_1*xbar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is the resulting line for the least squares solution?

A

the regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What does “fitting to the data mean”?

A

the process of finding optimal parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What equation would we use to make predictions?

A

H(x) = w_0 + w_1*x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is r?

A

the correlation coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What does the correlation coefficient measure?

A

the strength of the linear association of two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the range for r?

A

-1 < r < 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What can we tell about our data is r is negative?

A

there is a negative association; left down to right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What can we tell about our data is r is positive?

A

there is a positive association, bottom left up to right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What happens the closer r is to + or -1?

A

the correlation is stronger in those areas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How do we calculate the standard deviation?

A

(x_i - mean) / (standard deviation of x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

How do we calculate r?

A

1/n En i=1 (x_i - mean of x / SD of X)(y_i - mean of y / SD of Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What happens as y spreads out?

A

SD of y increases and the slope gets steeper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What happens as x gets more spread out?

A

SD of x increases and slope gets more shallow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is the equivalent of finding models that minimize MSE in terms of r?

A

finding models that maximize r^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is the equation for R_sq(w_0, w_1)?

A

(SD of y)^2 * (1-r^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is a more flexible version of the constant model?

A

the simple linear regression model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What can be said if A and B are two matrices?

A

AB != BA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is a vector?

A

an ordered collection of n number in R^n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What is another name for length of a vector?

A

the l_2 norm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is the equation for the length of a vector?

A

||v|| = sqrt(v_1^2 + v_2^2 + … + v_n^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What do vectors have?

A

a magnitude and a direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is the dot product of two vectors?

A

uv = u_1v_1 + u_2 * v_2 ….

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is the result of the dot product?

A

a scalar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What is a scalar?

A

a single numebr

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What is another way we can calculate the dot product?

A

||u||||v|| cos theta

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

When are two vectors orthogonal?

A

if and only if their dot product is 0, and vice versa

71
Q

What does is mean if the dot product of two vectors is 0?

A

the angle between them is also 0

72
Q

How do we get the sum of two vectors?

A

using an element wise sum

73
Q

If c is a scalar and we are trying to multiply it with a vector, how do we compute that?

A

multiply each element by the scalar

74
Q

What is a linear combination?

A

any vector of the form a_1v_1 + a_2v_2 + … + a_d v_d

75
Q

What is a span?

A

the set of all vectors that can be created using linear combinations of those vectors

76
Q

What is the vector in span x that is closest to y?

A

the orthogonal projection of y onto spanx

77
Q

What is the equation for the projection error?

A

e = y - wx

78
Q

How do we get the w* for the projection error?

A

x * y / x * x

79
Q

What is the equation for the error vector?

A

||e|| = ||y - wx||

80
Q

What does w*x represent?

A

the orthogonal projection of y onto span x

81
Q

What is an n x d matrix?

A

a table of numbers with n rows and d columns

82
Q

When can we add two matrices?

A

when they have the same dimensions

83
Q

If A(B + C) then

A

AB + AC

84
Q

If (AB)C then

A

A(BC)

85
Q

If (A + B)^T then

A

A^T + B^T

86
Q

If (AB)^T then

A

B^T A^T

87
Q

How can a vector be explained as a matrix?

A

it is a matrix with n rows and 1 column

88
Q

How can we think of matrix-vector multiplication?

A

a linear combination of the columns of A using the weights in v

89
Q

What is the span of the columns of X consist of of?

A

all of the vectors that can be written in the form Xw

90
Q

What does the w represent in Xw?

A

the weights, also known as the parameter vector

91
Q

What is the normal equations?

A

X^TXw* = X^Ty

92
Q

What is w* for the normal equations?

A

(X^TX)^-1X^Ty

93
Q

When does the normal equations have a w*?

A

when X^TX is full rank

94
Q

What happens if X^TX is not full rank?

A

w* has infinite solutions

95
Q

What is the observation vector?

A

vector of all observed values, y

96
Q

What is a hypothesis vector?

A

vector of predicted values, h

97
Q

What is the error vecor?

A

the vector of all errors between the observed and predicted values, e

98
Q

What is a design matrix?

A

a matrix where all the values for each feature are in columns and the first column is all ones

99
Q

What is the equation for the error vector?

A

e_i = y_i - H(x_i)

100
Q

What is the formula for the norm of a vector?

A

||v|| = sqrt(v_1^2 + v_2^2 + … + v_n^2)

101
Q

What is the parameter vector?

A

a vector with all the parameter values, i.e. slope, intercept

102
Q

What is a function of a vector?

A

a function of multiple variables

103
Q

What is the gradient with respect to w?

A

R_sq(w) = dR_sq/dw = vector of derivatives for each parameter

104
Q

What is multiple linear regression?

A

linear regression with multiple features

105
Q

What is an augmented feature vector?

A

like a design matrix but all values are transposed

106
Q

What is the equation for the augmented feature vector?

A

w*Aug(x)

107
Q

What is a hypothesis function for multiple linear regression?

A

H(x) = w_1 1/x_2 + w_2 sinx + w_3e^x

108
Q

What is feature engineering?

A

the process of creating new features out of existing information in our dataset

109
Q

How many parameters can we have?

A

as many as we want as long as our hypothesis function is linear in params

110
Q

Where is our minimum if the slope of the tangent line is positive?

A

to the left and we need to decrease t

111
Q

Where is our minimum if the slope of the tangent line is negative?

A

the the right and we need to increase t

112
Q

What is the equation for gradient descent?

A

t1 = t_0 - df/dt (t_0)

113
Q

What is t_0?

A

our initial guess

114
Q

What is alpha?

A

the learning rate; the step size

115
Q

How many times do we repeat gradient descent?

A

as many times as we can until convergence

116
Q

What is gradient descent?

A

a method for finding the input to a function f that minimizes the function

117
Q

What is a numerical method?

A

a technique for approximating the solution to a mathematical problem

118
Q

How can we tell if a function is convex?

A

if a tangent line doesn’t exist that can go under the line at any point

119
Q

What are examples of convex functions?

A

|x - 4|, e^x, (x -3)^24

120
Q

What is an example of a function that is not convex?

A

sqrt(x-1)

121
Q

What does it mean for a function to be concave?

A

it is the negative of a convex function

122
Q

What is the second derivative test?

A

checking if a function is twice differentiable, if it is greater than or equal to 0 it is convex and if it is less than or equal to 0 it is concave

123
Q

What happens to the gradient descent if f(t) is convex and differentiable?

A

it converges to a global minimum as long as the step size is small enough

124
Q

How can we tell if a function converges?

A

when the derivative is 0

125
Q

What happens if the derivative of a function is 0 only in one place?

A

it has a global minimum

126
Q

Can gradient descent work on nonconvex functions?

A

yes but it is not guaranteed to find a global minimum

127
Q

What is an experiment?

A

some process whose outcome is random

128
Q

What are some examples of an experiment?

A

flipping a coin, rolling a die

129
Q

What is a set?

A

an unordered collection of items

130
Q

How are sets denoted?

A

using {}

131
Q

What does |A| denote?

A

number of elements in set A

132
Q

What is a sample space?

A

finite or countable set of possible outcomes of an experiment

133
Q

What is a probability distribution?

A

assignment of probabilities to outcomes in S

134
Q

What must the probability be for each s in S?

A

0 <= p(s) <= 1

135
Q

What do the probabilities have to sum up to?

A

1

136
Q

What can be said about about the sample spaces and probability distributions for flipping a fair coin and a biased coin?

A

they have the same sample spaces but different probability distributions

137
Q

What is an Event E?

A

a subset of the sample space

138
Q

What is a uniform distribution?

A

it assigns the probability of 1/n to each element of S

139
Q

What is the addition rule if A and B are mutually exclusive?

A

P(A or B) = P(A u B) = P(A) + P(B)

140
Q

What does it mean for two events to be mutually exclusive?

A

they cannot happen simultaneously

141
Q

What is the addition rule if two events are not mutually exclusive?

A

P(A u B) = P(A) + P(B) - P(A and B)

142
Q

What is the multiplication rule?

A

P(A and B) = P(A n B) = P(A) * P(B|A)

143
Q

What is the complement rule?

A

P(not A) = 1 - P(A)

144
Q

What is conditional probability?

A

P(B|A) means “the probability that B happens, given that A happened.”

145
Q

What happens if A and B are independent from one another?

A

P(B|A) = P(B)

146
Q

What is the intuition behind independency?

A

A and B are independent if knowing that A happened gives you no additional information about B and vice versa

147
Q

What is Simpson’s Paradox?

A

inverse trend when data is joined versus analyzing numbers individually

148
Q

What does it mean to sample with replacement?

A

drawing one element uniformly at random and returning it to the list, repeat

149
Q

What does it mean to sample without replacement?

A

drawing one element uniformly at random, repeat

150
Q

What are the traits of sequences?

A

list, order matters, repetitions allowed (with replacement), elements in listed order

151
Q

What are the traits of sets?

A

collection of elements, order does not matter, no repetitions allowed (without replacement), elements in no particular order

152
Q

What is the symbol for permutations?

A

P(n, k)

153
Q

What are the traits for permutations?

A

order matters, no repetitions allowed, counts the # of sequences of k distinct elements chosen from n possible elements

154
Q

What are the traits for combinations?

A

order does not matter, no repetitions allowed, counts the # of sets of size k chosen from n possible elements

155
Q

What is the symbol for combinations?

A

C(n, k)

156
Q

What is the equation for permutations?

A

n! / (n - k)!

157
Q

What is the equation for combinations?

A

n! / k! (n - k)!

158
Q

What does Bayes’ Theorem follow?

A

from the multiplication rule or conditional probability

159
Q

What is the first equation for Bayes’ Theorem?

A

P(B|A) = P(A|B) * P(B) / P(A)

160
Q

What is the final equation for Bayes’ Theorem?

A

P(A|B) * P(B) / P(B) * P(A|B) + P(notB) * P(A|notB)

161
Q

What does Bayes’ Theorem describe?

A

how to update the probability of one event given that another has occurred

162
Q

What is P(A) in Bayes’ Theorem?

A

prior belief that A happens

163
Q

What is P(A|B) in Bayes’ Theorem?

A

our updated belief that A happens, now that we know B happens

164
Q

What is an example of when updating our beliefs does not matter?

A

flipping a coin and getting a head will not affect what your second toss is

165
Q

What probability does an event have if both are mutually exclusive and independent?

A

at least one of them must have a zero probability

166
Q

Can events be independent in the real world?

A

almost never

167
Q

What is conditional independence?

A

events that become independent upon learning some new information and vice versa

168
Q

What is classification?

A

making predictions based on examples

169
Q

What is Bayes’ Theorem for classification?

A

P(class|features) = P(features|class) * P(class) / P(features)

170
Q

How do we select which classification wins?

A

whichever has the larger numerator

171
Q

How do we estimate each term in classification?

A

based on the training data

172
Q

What is spam filtering?

A

spam from ham (good, non-spam email)

173
Q

What is the bag of words model?

A

ignores location of words within an email and frequency of words

174
Q

What is smoothing?

A

better handling of previously unseen data