Final Exam Prep Flashcards

(174 cards)

1
Q

What is SLR?

A

simple linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the equation for simple linear regression?

A

H(x) = w_0 + w_1x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the equation for the constant model?

A

H(x) = h

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the ith observation?

A

(x_i, y_i)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is H?

A

the hypothesis function, used to make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are summary statistics?

A

summarize a collection of numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are examples of summary statistics?

A

mean, median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a loss function?

A

quantifies how bad a prediction is for a single data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is squared loss?

A

L_sq(y_i, H(x_i)) = (y_i - H(x_i))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does y_i represent?

A

actual values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does H(x_i) represent?

A

predicted values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is R?

A

the average loss for all points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is another name for R?

A

risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does MSE stand for?

A

mean squared error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the equation for MSE?

A

R_sq(h) = 1/n En i=1 (y_i - h)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is our goal when calculating the MSE?

A

to find the h that minimizes R_sq(h)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the definition for MSE?

A

the average squares loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the h in H(x) = h?

A

it is a parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

If c(x) = a(x) + b(x), what is the derivative?

A

d/dx c(x) = d/dx a(x) + d/dx b(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the value that minimizes MSE?

A

the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the definition for convexitivity?

A

there is a minimum that is differentiable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is steps does the modeling recipe consist of?

A

choose a model, choose a loss function, minimize average loss to find optimal model parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is does MAE stand for?

A

mean absolute error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the MAE calculate?

A

average absolute loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the equation for MAE?
R_abs(h) = 1/n En i=1 |y_i - h|
26
What is the h that minimizes MAE?
the median
27
What can we say about our data R_abs(h*) is not unique?
n is even
28
How do we get a unique value for R_abs(h*)?
n has to be odd
29
_____ is sensitive to outliers!
mean
30
_____ is robust to outliers!
median
31
What is empirical risk minimization?
formal name for minimizing average loss
32
What minimizes Linfinity loss?
the midrange
33
What minimizes 0,1 loss?
the mode!
34
What is a feature?
an attribute of the data (columns)
35
What type of values can features be?
numerical, categorical, boolean
36
What happens when we make MSE zero?
we are overfitting to the data
37
Why is overfitting to our data bad?
because we want our model to generalize well to unseen data and make good predictions in the real world
38
What is an example of a quadratic regression equation?
H(x) = w_0 + w_1x^2
39
What is an example of an exponential regression equation?
H(x) = w_0e^w_1x
40
What is does w_0 represent?
intercept
41
What does w_1 represent?
slope
42
What is the equation for the loss surface?
R_sq(w_0, w_1) = 1/n En i=1 (y_i - (w_0 + w_1x_i))^2
43
What is the least squares solution for w_0?
(En i=1 (x_i - xbar)(y_i - ybar))/(En i=1 x_i - xbar)^2
44
What is the least squares solution for w_1?
ybar - w_1*xbar
45
What is the resulting line for the least squares solution?
the regression line
46
What does "fitting to the data mean"?
the process of finding optimal parameters
47
What equation would we use to make predictions?
H*(x) = w_0* + w_1*x
48
What is r?
the correlation coefficient
49
What does the correlation coefficient measure?
the strength of the linear association of two variables
50
What is the range for r?
-1 < r < 1
51
What can we tell about our data is r is negative?
there is a negative association; left down to right
52
What can we tell about our data is r is positive?
there is a positive association, bottom left up to right
53
What happens the closer r is to + or -1?
the correlation is stronger in those areas
54
How do we calculate the standard deviation?
(x_i - mean) / (standard deviation of x)
55
How do we calculate r?
1/n En i=1 (x_i - mean of x / SD of X)(y_i - mean of y / SD of Y)
56
What happens as y spreads out?
SD of y increases and the slope gets steeper
57
What happens as x gets more spread out?
SD of x increases and slope gets more shallow
58
What is the equivalent of finding models that minimize MSE in terms of r?
finding models that maximize r^2
59
What is the equation for R_sq(w_0*, w_1*)?
(SD of y)^2 * (1-r^2)
60
What is a more flexible version of the constant model?
the simple linear regression model
61
What can be said if A and B are two matrices?
AB != BA
62
What is a vector?
an ordered collection of n number in R^n
63
What is another name for length of a vector?
the l_2 norm
64
What is the equation for the length of a vector?
||v|| = sqrt(v_1^2 + v_2^2 + ... + v_n^2)
65
What do vectors have?
a magnitude and a direction
66
What is the dot product of two vectors?
u*v = u_1*v_1 + u_2 * v_2 ....
67
What is the result of the dot product?
a scalar
68
What is a scalar?
a single numebr
69
What is another way we can calculate the dot product?
||u||||v|| cos theta
70
When are two vectors orthogonal?
if and only if their dot product is 0, and vice versa
71
What does is mean if the dot product of two vectors is 0?
the angle between them is also 0
72
How do we get the sum of two vectors?
using an element wise sum
73
If c is a scalar and we are trying to multiply it with a vector, how do we compute that?
multiply each element by the scalar
74
What is a linear combination?
any vector of the form a_1v_1 + a_2v_2 + ... + a_d v_d
75
What is a span?
the set of all vectors that can be created using linear combinations of those vectors
76
What is the vector in span x that is closest to y?
the orthogonal projection of y onto spanx
77
What is the equation for the projection error?
e = y - wx
78
How do we get the w* for the projection error?
x * y / x * x
79
What is the equation for the error vector?
||e|| = ||y - wx||
80
What does w*x represent?
the orthogonal projection of y onto span x
81
What is an n x d matrix?
a table of numbers with n rows and d columns
82
When can we add two matrices?
when they have the same dimensions
83
If A(B + C) then
AB + AC
84
If (AB)C then
A(BC)
85
If (A + B)^T then
A^T + B^T
86
If (AB)^T then
B^T A^T
87
How can a vector be explained as a matrix?
it is a matrix with n rows and 1 column
88
How can we think of matrix-vector multiplication?
a linear combination of the columns of A using the weights in v
89
What is the span of the columns of X consist of of?
all of the vectors that can be written in the form Xw
90
What does the w represent in Xw?
the weights, also known as the parameter vector
91
What is the normal equations?
X^TXw* = X^Ty
92
What is w* for the normal equations?
(X^TX)^-1X^Ty
93
When does the normal equations have a w*?
when X^TX is full rank
94
What happens if X^TX is not full rank?
w* has infinite solutions
95
What is the observation vector?
vector of all observed values, y
96
What is a hypothesis vector?
vector of predicted values, h
97
What is the error vecor?
the vector of all errors between the observed and predicted values, e
98
What is a design matrix?
a matrix where all the values for each feature are in columns and the first column is all ones
99
What is the equation for the error vector?
e_i = y_i - H(x_i)
100
What is the formula for the norm of a vector?
||v|| = sqrt(v_1^2 + v_2^2 + ... + v_n^2)
101
What is the parameter vector?
a vector with all the parameter values, i.e. slope, intercept
102
What is a function of a vector?
a function of multiple variables
103
What is the gradient with respect to w?
R_sq(w) = dR_sq/dw = vector of derivatives for each parameter
104
What is multiple linear regression?
linear regression with multiple features
105
What is an augmented feature vector?
like a design matrix but all values are transposed
106
What is the equation for the augmented feature vector?
w*Aug(x)
107
What is a hypothesis function for multiple linear regression?
H(x) = w_1 1/x_2 + w_2 sinx + w_3e^x
108
What is feature engineering?
the process of creating new features out of existing information in our dataset
109
How many parameters can we have?
as many as we want as long as our hypothesis function is linear in params
110
Where is our minimum if the slope of the tangent line is positive?
to the left and we need to decrease t
111
Where is our minimum if the slope of the tangent line is negative?
the the right and we need to increase t
112
What is the equation for gradient descent?
t1 = t_0 - df/dt (t_0)
113
What is t_0?
our initial guess
114
What is alpha?
the learning rate; the step size
115
How many times do we repeat gradient descent?
as many times as we can until convergence
116
What is gradient descent?
a method for finding the input to a function f that minimizes the function
117
What is a numerical method?
a technique for approximating the solution to a mathematical problem
118
How can we tell if a function is convex?
if a tangent line doesn't exist that can go under the line at any point
119
What are examples of convex functions?
|x - 4|, e^x, (x -3)^24
120
What is an example of a function that is not convex?
sqrt(x-1)
121
What does it mean for a function to be concave?
it is the negative of a convex function
122
What is the second derivative test?
checking if a function is twice differentiable, if it is greater than or equal to 0 it is convex and if it is less than or equal to 0 it is concave
123
What happens to the gradient descent if f(t) is convex and differentiable?
it converges to a global minimum as long as the step size is small enough
124
How can we tell if a function converges?
when the derivative is 0
125
What happens if the derivative of a function is 0 only in one place?
it has a global minimum
126
Can gradient descent work on nonconvex functions?
yes but it is not guaranteed to find a global minimum
127
What is an experiment?
some process whose outcome is random
128
What are some examples of an experiment?
flipping a coin, rolling a die
129
What is a set?
an unordered collection of items
130
How are sets denoted?
using {}
131
What does |A| denote?
number of elements in set A
132
What is a sample space?
finite or countable set of possible outcomes of an experiment
133
What is a probability distribution?
assignment of probabilities to outcomes in S
134
What must the probability be for each s in S?
0 <= p(s) <= 1
135
What do the probabilities have to sum up to?
1
136
What can be said about about the sample spaces and probability distributions for flipping a fair coin and a biased coin?
they have the same sample spaces but different probability distributions
137
What is an Event E?
a subset of the sample space
138
What is a uniform distribution?
it assigns the probability of 1/n to each element of S
139
What is the addition rule if A and B are mutually exclusive?
P(A or B) = P(A u B) = P(A) + P(B)
140
What does it mean for two events to be mutually exclusive?
they cannot happen simultaneously
141
What is the addition rule if two events are not mutually exclusive?
P(A u B) = P(A) + P(B) - P(A and B)
142
What is the multiplication rule?
P(A and B) = P(A n B) = P(A) * P(B|A)
143
What is the complement rule?
P(not A) = 1 - P(A)
144
What is conditional probability?
P(B|A) means "the probability that B happens, given that A happened."
145
What happens if A and B are independent from one another?
P(B|A) = P(B)
146
What is the intuition behind independency?
A and B are independent if knowing that A happened gives you no additional information about B and vice versa
147
What is Simpson's Paradox?
inverse trend when data is joined versus analyzing numbers individually
148
What does it mean to sample with replacement?
drawing one element uniformly at random and returning it to the list, repeat
149
What does it mean to sample without replacement?
drawing one element uniformly at random, repeat
150
What are the traits of sequences?
list, order matters, repetitions allowed (with replacement), elements in listed order
151
What are the traits of sets?
collection of elements, order does not matter, no repetitions allowed (without replacement), elements in no particular order
152
What is the symbol for permutations?
P(n, k)
153
What are the traits for permutations?
order matters, no repetitions allowed, counts the # of sequences of k distinct elements chosen from n possible elements
154
What are the traits for combinations?
order does not matter, no repetitions allowed, counts the # of sets of size k chosen from n possible elements
155
What is the symbol for combinations?
C(n, k)
156
What is the equation for permutations?
n! / (n - k)!
157
What is the equation for combinations?
n! / k! (n - k)!
158
What does Bayes' Theorem follow?
from the multiplication rule or conditional probability
159
What is the first equation for Bayes' Theorem?
P(B|A) = P(A|B) * P(B) / P(A)
160
What is the final equation for Bayes' Theorem?
P(A|B) * P(B) / P(B) * P(A|B) + P(notB) * P(A|notB)
161
What does Bayes' Theorem describe?
how to update the probability of one event given that another has occurred
162
What is P(A) in Bayes' Theorem?
prior belief that A happens
163
What is P(A|B) in Bayes' Theorem?
our updated belief that A happens, now that we know B happens
164
What is an example of when updating our beliefs does not matter?
flipping a coin and getting a head will not affect what your second toss is
165
What probability does an event have if both are mutually exclusive and independent?
at least one of them must have a zero probability
166
Can events be independent in the real world?
almost never
167
What is conditional independence?
events that become independent upon learning some new information and vice versa
168
What is classification?
making predictions based on examples
169
What is Bayes' Theorem for classification?
P(class|features) = P(features|class) * P(class) / P(features)
170
How do we select which classification wins?
whichever has the larger numerator
171
How do we estimate each term in classification?
based on the training data
172
What is spam filtering?
spam from ham (good, non-spam email)
173
What is the bag of words model?
ignores location of words within an email and frequency of words
174
What is smoothing?
better handling of previously unseen data