Final Exam Prep Flashcards
What is SLR?
simple linear regression
What is the equation for simple linear regression?
H(x) = w_0 + w_1x
What is the equation for the constant model?
H(x) = h
What is the ith observation?
(x_i, y_i)
What is H?
the hypothesis function, used to make predictions
What are summary statistics?
summarize a collection of numbers
What are examples of summary statistics?
mean, median
What is a loss function?
quantifies how bad a prediction is for a single data point
What is squared loss?
L_sq(y_i, H(x_i)) = (y_i - H(x_i))^2
What does y_i represent?
actual values
What does H(x_i) represent?
predicted values
What is R?
the average loss for all points
What is another name for R?
risk
What does MSE stand for?
mean squared error
What is the equation for MSE?
R_sq(h) = 1/n En i=1 (y_i - h)^2
What is our goal when calculating the MSE?
to find the h that minimizes R_sq(h)
What is the definition for MSE?
the average squares loss
What is the h in H(x) = h?
it is a parameter
If c(x) = a(x) + b(x), what is the derivative?
d/dx c(x) = d/dx a(x) + d/dx b(x)
What is the value that minimizes MSE?
the mean
What is the definition for convexitivity?
there is a minimum that is differentiable
What is steps does the modeling recipe consist of?
choose a model, choose a loss function, minimize average loss to find optimal model parameters.
What is does MAE stand for?
mean absolute error
What does the MAE calculate?
average absolute loss
What is the equation for MAE?
R_abs(h) = 1/n En i=1 |y_i - h|
What is the h that minimizes MAE?
the median
What can we say about our data R_abs(h*) is not unique?
n is even
How do we get a unique value for R_abs(h*)?
n has to be odd
_____ is sensitive to outliers!
mean
_____ is robust to outliers!
median
What is empirical risk minimization?
formal name for minimizing average loss
What minimizes Linfinity loss?
the midrange
What minimizes 0,1 loss?
the mode!
What is a feature?
an attribute of the data (columns)
What type of values can features be?
numerical, categorical, boolean
What happens when we make MSE zero?
we are overfitting to the data
Why is overfitting to our data bad?
because we want our model to generalize well to unseen data and make good predictions in the real world
What is an example of a quadratic regression equation?
H(x) = w_0 + w_1x^2
What is an example of an exponential regression equation?
H(x) = w_0e^w_1x
What is does w_0 represent?
intercept
What does w_1 represent?
slope
What is the equation for the loss surface?
R_sq(w_0, w_1) = 1/n En i=1 (y_i - (w_0 + w_1x_i))^2
What is the least squares solution for w_0?
(En i=1 (x_i - xbar)(y_i - ybar))/(En i=1 x_i - xbar)^2
What is the least squares solution for w_1?
ybar - w_1*xbar
What is the resulting line for the least squares solution?
the regression line
What does “fitting to the data mean”?
the process of finding optimal parameters
What equation would we use to make predictions?
H(x) = w_0 + w_1*x
What is r?
the correlation coefficient
What does the correlation coefficient measure?
the strength of the linear association of two variables
What is the range for r?
-1 < r < 1
What can we tell about our data is r is negative?
there is a negative association; left down to right
What can we tell about our data is r is positive?
there is a positive association, bottom left up to right
What happens the closer r is to + or -1?
the correlation is stronger in those areas
How do we calculate the standard deviation?
(x_i - mean) / (standard deviation of x)
How do we calculate r?
1/n En i=1 (x_i - mean of x / SD of X)(y_i - mean of y / SD of Y)
What happens as y spreads out?
SD of y increases and the slope gets steeper
What happens as x gets more spread out?
SD of x increases and slope gets more shallow
What is the equivalent of finding models that minimize MSE in terms of r?
finding models that maximize r^2
What is the equation for R_sq(w_0, w_1)?
(SD of y)^2 * (1-r^2)
What is a more flexible version of the constant model?
the simple linear regression model
What can be said if A and B are two matrices?
AB != BA
What is a vector?
an ordered collection of n number in R^n
What is another name for length of a vector?
the l_2 norm
What is the equation for the length of a vector?
||v|| = sqrt(v_1^2 + v_2^2 + … + v_n^2)
What do vectors have?
a magnitude and a direction
What is the dot product of two vectors?
uv = u_1v_1 + u_2 * v_2 ….
What is the result of the dot product?
a scalar
What is a scalar?
a single numebr
What is another way we can calculate the dot product?
||u||||v|| cos theta
When are two vectors orthogonal?
if and only if their dot product is 0, and vice versa
What does is mean if the dot product of two vectors is 0?
the angle between them is also 0
How do we get the sum of two vectors?
using an element wise sum
If c is a scalar and we are trying to multiply it with a vector, how do we compute that?
multiply each element by the scalar
What is a linear combination?
any vector of the form a_1v_1 + a_2v_2 + … + a_d v_d
What is a span?
the set of all vectors that can be created using linear combinations of those vectors
What is the vector in span x that is closest to y?
the orthogonal projection of y onto spanx
What is the equation for the projection error?
e = y - wx
How do we get the w* for the projection error?
x * y / x * x
What is the equation for the error vector?
||e|| = ||y - wx||
What does w*x represent?
the orthogonal projection of y onto span x
What is an n x d matrix?
a table of numbers with n rows and d columns
When can we add two matrices?
when they have the same dimensions
If A(B + C) then
AB + AC
If (AB)C then
A(BC)
If (A + B)^T then
A^T + B^T
If (AB)^T then
B^T A^T
How can a vector be explained as a matrix?
it is a matrix with n rows and 1 column
How can we think of matrix-vector multiplication?
a linear combination of the columns of A using the weights in v
What is the span of the columns of X consist of of?
all of the vectors that can be written in the form Xw
What does the w represent in Xw?
the weights, also known as the parameter vector
What is the normal equations?
X^TXw* = X^Ty
What is w* for the normal equations?
(X^TX)^-1X^Ty
When does the normal equations have a w*?
when X^TX is full rank
What happens if X^TX is not full rank?
w* has infinite solutions
What is the observation vector?
vector of all observed values, y
What is a hypothesis vector?
vector of predicted values, h
What is the error vecor?
the vector of all errors between the observed and predicted values, e
What is a design matrix?
a matrix where all the values for each feature are in columns and the first column is all ones
What is the equation for the error vector?
e_i = y_i - H(x_i)
What is the formula for the norm of a vector?
||v|| = sqrt(v_1^2 + v_2^2 + … + v_n^2)
What is the parameter vector?
a vector with all the parameter values, i.e. slope, intercept
What is a function of a vector?
a function of multiple variables
What is the gradient with respect to w?
R_sq(w) = dR_sq/dw = vector of derivatives for each parameter
What is multiple linear regression?
linear regression with multiple features
What is an augmented feature vector?
like a design matrix but all values are transposed
What is the equation for the augmented feature vector?
w*Aug(x)
What is a hypothesis function for multiple linear regression?
H(x) = w_1 1/x_2 + w_2 sinx + w_3e^x
What is feature engineering?
the process of creating new features out of existing information in our dataset
How many parameters can we have?
as many as we want as long as our hypothesis function is linear in params
Where is our minimum if the slope of the tangent line is positive?
to the left and we need to decrease t
Where is our minimum if the slope of the tangent line is negative?
the the right and we need to increase t
What is the equation for gradient descent?
t1 = t_0 - df/dt (t_0)
What is t_0?
our initial guess
What is alpha?
the learning rate; the step size
How many times do we repeat gradient descent?
as many times as we can until convergence
What is gradient descent?
a method for finding the input to a function f that minimizes the function
What is a numerical method?
a technique for approximating the solution to a mathematical problem
How can we tell if a function is convex?
if a tangent line doesn’t exist that can go under the line at any point
What are examples of convex functions?
|x - 4|, e^x, (x -3)^24
What is an example of a function that is not convex?
sqrt(x-1)
What does it mean for a function to be concave?
it is the negative of a convex function
What is the second derivative test?
checking if a function is twice differentiable, if it is greater than or equal to 0 it is convex and if it is less than or equal to 0 it is concave
What happens to the gradient descent if f(t) is convex and differentiable?
it converges to a global minimum as long as the step size is small enough
How can we tell if a function converges?
when the derivative is 0
What happens if the derivative of a function is 0 only in one place?
it has a global minimum
Can gradient descent work on nonconvex functions?
yes but it is not guaranteed to find a global minimum
What is an experiment?
some process whose outcome is random
What are some examples of an experiment?
flipping a coin, rolling a die
What is a set?
an unordered collection of items
How are sets denoted?
using {}
What does |A| denote?
number of elements in set A
What is a sample space?
finite or countable set of possible outcomes of an experiment
What is a probability distribution?
assignment of probabilities to outcomes in S
What must the probability be for each s in S?
0 <= p(s) <= 1
What do the probabilities have to sum up to?
1
What can be said about about the sample spaces and probability distributions for flipping a fair coin and a biased coin?
they have the same sample spaces but different probability distributions
What is an Event E?
a subset of the sample space
What is a uniform distribution?
it assigns the probability of 1/n to each element of S
What is the addition rule if A and B are mutually exclusive?
P(A or B) = P(A u B) = P(A) + P(B)
What does it mean for two events to be mutually exclusive?
they cannot happen simultaneously
What is the addition rule if two events are not mutually exclusive?
P(A u B) = P(A) + P(B) - P(A and B)
What is the multiplication rule?
P(A and B) = P(A n B) = P(A) * P(B|A)
What is the complement rule?
P(not A) = 1 - P(A)
What is conditional probability?
P(B|A) means “the probability that B happens, given that A happened.”
What happens if A and B are independent from one another?
P(B|A) = P(B)
What is the intuition behind independency?
A and B are independent if knowing that A happened gives you no additional information about B and vice versa
What is Simpson’s Paradox?
inverse trend when data is joined versus analyzing numbers individually
What does it mean to sample with replacement?
drawing one element uniformly at random and returning it to the list, repeat
What does it mean to sample without replacement?
drawing one element uniformly at random, repeat
What are the traits of sequences?
list, order matters, repetitions allowed (with replacement), elements in listed order
What are the traits of sets?
collection of elements, order does not matter, no repetitions allowed (without replacement), elements in no particular order
What is the symbol for permutations?
P(n, k)
What are the traits for permutations?
order matters, no repetitions allowed, counts the # of sequences of k distinct elements chosen from n possible elements
What are the traits for combinations?
order does not matter, no repetitions allowed, counts the # of sets of size k chosen from n possible elements
What is the symbol for combinations?
C(n, k)
What is the equation for permutations?
n! / (n - k)!
What is the equation for combinations?
n! / k! (n - k)!
What does Bayes’ Theorem follow?
from the multiplication rule or conditional probability
What is the first equation for Bayes’ Theorem?
P(B|A) = P(A|B) * P(B) / P(A)
What is the final equation for Bayes’ Theorem?
P(A|B) * P(B) / P(B) * P(A|B) + P(notB) * P(A|notB)
What does Bayes’ Theorem describe?
how to update the probability of one event given that another has occurred
What is P(A) in Bayes’ Theorem?
prior belief that A happens
What is P(A|B) in Bayes’ Theorem?
our updated belief that A happens, now that we know B happens
What is an example of when updating our beliefs does not matter?
flipping a coin and getting a head will not affect what your second toss is
What probability does an event have if both are mutually exclusive and independent?
at least one of them must have a zero probability
Can events be independent in the real world?
almost never
What is conditional independence?
events that become independent upon learning some new information and vice versa
What is classification?
making predictions based on examples
What is Bayes’ Theorem for classification?
P(class|features) = P(features|class) * P(class) / P(features)
How do we select which classification wins?
whichever has the larger numerator
How do we estimate each term in classification?
based on the training data
What is spam filtering?
spam from ham (good, non-spam email)
What is the bag of words model?
ignores location of words within an email and frequency of words
What is smoothing?
better handling of previously unseen data