Week 1 Flashcards
How did Arthur Samuel described machine learning?
The field of study that fives the computers the ability to learn without being explicitly programmed. This is a traditional old definition.
What is a more modern definition of machine learning and by whom is it given?
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E. Tom Mitchell
Make an example of Machine Learning.
Playing checkers:
E = the experience of playing many games of checkers.
T = the tasks of playing checkers
P = the probability that the computer will win the next game.
What are two broad classification of machine learning problems?
supervised learning: we are given a data set and we already know what the correct output should look like.
unsupervised learning: allows us to approach the problem with little or no idea of what our data should look like. We can drive structure from data where we don’t necessarily know the effect of variables.
How are supervised learning algorithms classified?
They are classified into regression and classification. In a regression problem we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continues function. In a classification problem we are interested trying to predict results in a discrete output. In other words we are trying to map input variables into discrete categories.
How are the data structured in unsupervised learning and what is the feedback?
We can derive this structure by clustering the data based on relationship among the variables in the data.
what is uni variant linear regression?
uni variant linear regression or one variable linear regression is used when we want to predict a single output value y from a single input value x. We’re doing supervised learning here, so that means we already have an idea about what the input/output cause and effect should be.
what is the equation of hypothesis in uni variant linear regression?
y_bar = (h_theta)(x) = theta_0 + thata_1 * x
Which variables do we choose in our hypothesis?
We choose the values of thetas. We will be trying different values of theta_0 and theta_1 to try to find out values which provide the best possible fit for the most representative straight line through the data points mapped on the x-y plane.
What is a cost function?
We can measure the accuracy of our hypothesis by using a cost function. This takes an average of all the results of the hypothesis with input from x’s compared to the actual output from y’s.
J(theta_0, theta_1) = (1 / 2m ) Sigma__i = 1 to m__(y_bar_i -y_i)^2 = (1 / 2m) Sigma__i = 1 to m__(h_tetha(x_i) - y_i)^2
This function is otherwise called “Squared error function” or “Mean Squared error”.
why do the mean in linear regression cost function is halved as 1/2m instead of 1/m?
As a convenience for the computation of the gradient descent. The derivation term of the square function will cancel out the 1/2 term.
In visual terms what are the interpretations of data, hypothesis and cost function?
Our training data set is scattered on the x-y plane. We are trying to make straight line ( defined by h_theta(x)) which is our hypothesis and passes through this scattered set of data. Our objective is to get the best possible line which will be such so that the average vertical distance of the scattered points from the line will be least.
why is the function sum of squares?
It might be easier to think of this as measuring the distance of two points. In this case we are measuring the distance of two multi dimensional values. ( i.e the observed value y_i and the estimated value y_bar_i). The sum of squares isn’t the only possible cost function, but it has many nice properties:
the negative and positive punishments are the same.
It has only one global minimum which guarantees convergence. (cubic has more than one)
It is derivable (absolute value function is not)
if it was cubic the punishment would grow very fast for large values.
What is gradient descent good for?
To estimate parameters in hypothesis function.
What are fields?
parameters of our hypothesis: theta_zero and theta_1.