Lec 4 | Learning Flashcards
It provides a computer with data, rather than explicit instructions. Using these data, the computer learns to recognize patterns and becomes able to execute tasks on its own.
Machine Learning
It is a task where a computer learns a function that maps inputs to outputs based on a dataset of input-output pairs.
Supervised Learning
This is a supervised learning task where the function maps an input to a discrete output. In other terms, it is the task learning a function mapping an input point to a discrete category.
Classification
- An algorithm, given an input,chooses the class of the nearest data point to that input.
- One way of solving a task by assigning the variable in question the value of the closest observation
Nearest-Neighbor Classification
How do you get around the limitations of nearest-neighbor classification?
One way to get around the limitations of nearest-neighbor classification is by using k-nearest-neighbors classification.
An algorithm that, given an input, chooses the most common class out of the k nearest data points to that input
k-nearest-neighbor classification
What is a drawback of using k-nearest-neighbor classification?
A drawback is that, using a naive approach, the algorithm will have to measure the distance of every single point to the point in question, which is computationally expensive. This can be sped up by using data structures that enable finding neighbors more quickly or by pruning irrelevant observation.
Another way of going about a calssification problem is by looking at the data as a whole and trying to create a decision boundary. In two-dimensional data, we can draw a line between the two types of observations. Every additional data point will be classified based on the side of the line on which it is plotted.
Perceptron Learning
What is the drawback of Perceptron Learning? And how will we compromise?
The drawback to this approach is that data are messy, and it is rare that one can draw a line and neatly divide the classes into two observations without any mistakes. Often, we will compromise, drawing a boundary that separates the observations correctly more often than not, but still occasionally misclassifies them.
What is the perceptron learning rule?
Given data point (x, y), update each weight according to:
wi = wi + α(y - hw(x)) × xi
or
wi = wi + α(actual value - estimate) × xi
What is an important takeaway from the perceptron learning rule?
The important takeaway from this rule is that for each data point, we adjust the weights to make our function more accurate.
The details, which are not as critical to our point, are that each weight is set to be equal to itself plus some value in parentheses.
apil ba nanag ikaduja aning takeaway????
It switches from 0 to 1 once the estimated value crosses some threshold.
Threshold function
What is a downside of using a threshold function?
The problem with this type of function is that it is unable to express uncertainty.
Threshold function switches from 0 to 1 and it can only be equal to 0 or to 1
hard threshold
A logistic function can yield a real number between 0 and 1, which will express confidence in the estimate.
soft threshold
Another approach to classification is ____________________. This approach uses an additional vector (support vector) near the decision boundary to make the best decision when separating the data.
Support Vector Machine
A boundary that maximizes the distance between any of the data points. This is a type of boundary, which is as far as possible from the two groups it separates.
Maximum Margin Separator
Give a benefit of a support vector machine.
They can represent decision boundaries with more than two dimensions, as well as non-linear decision boundaries.
It is a supervised learning task of a function that maps an input point to a continuous value, some real number. This differs from classification in that classification problems map an input to discrete values (Rain or No Rain).
Regression
Functions that express how poorly our hypothesis performs.
A way to quantify the utility lost by any of the decision rules above. The less accurate the prediction, the larger the loss.
Loss functions
Loss functions
This function gains value when the prediction isn’t correct and doesn’t gain value when it is correct
0-1 Loss Function
Give function/code:
0-1 Loss Function
L(actual, predicted): 0 if actual = predicted 1 otherwise
Give Function/code
L1 Loss Function
L(actual, predicted) = | actual - predicted |
Give Function/Code:
L2 Loss Function
L(actual, predicted) = (actual - predicted)^2
What do you do if you are interested in quantifying for each prediction how much it differed from the bserved value?
We do this by taking either the absolute value or the squared value of the observed value minus the predicted value (i.e. how far the prediction was from the observed value).
A model that fits too closely to a particular data set and therefore may fail to generalize to future data
Overfitting
The process of penalizing hypotheses that are more complex to favor simpler, more general hypotheses
Regularization
Where do we use regularization?
We use regularization to avoid overfitting.
Formulae
In regularization, we estimate the cost of the hypothesis function h by adding up its loss and a measure of its complexity.
cost(h) = loss(h) + λcomplexity(h)
It is a constant that we can use to modulate how strongly to penalize for complexity in our cost function. The higher ________ is, the more costly complexity is.
Lambda (λ)
It splits data into a training set and a
test set, such that learning happens on the training set and is evaluated on the test set
holdout cross-validation
Give a way to test if the model is overfitted.
Holdout Cross Validation
What is the downside of holdout cross validation? And how do you deal with it’s downside?
The downside of holdout cross validation is that we don’t get to train the model on half the data, since it is used for evaluation purposes.
A way to deal with this is using k-Fold Cross-Validation.
It splits data into k sets, and experimenting k times, using each set as a test set once, and using remaining data as training set
k-fold cross-validation
As often is the case with Python, there are multiple libraries that allow us to conveniently use machine learning algorithms. One of such libraries is ________________ .
scikit-learn
It is another approach to machine learning, where after each action, the agent gets feedback in the form of reward or punishment (a positive or a negative numerical value).
Reinforcement Learning
What is the learning process of reinforcement learning?
The learning process starts by the environment providing a state to the agent. Then, the agent performs an action on the state. Based on this action, the environment will return a state and a reward to the agent, where the reward can be positive, making the behavior more likely in the future, or negative (i.e. punishment), making the behavior less likely in the future.
Where can we use Reinforcement Learning?
This type of algorithm can be used to train walking robots, for example, where each step returns a positive number (reward) and each fall a negative number (punishment).
model for decision-making, representing
states, actions, and their rewards
Markov Decision Process
Reinforcement learning can be viewed as a Markov decision process, having the following properties:
- set of states S
- Set of actions Actions(S)
- Transition model P(s’ | s, a)
- Reward function R(s, a, s’)
A method for learning a function Q(s, a), estimate of the value of performing action a in state s.
Q-learning
Give Pseudocode
Q-Learning Overview
* Start with Q(s, a) = 0 for all s, a * When we taken an action and receive a reward: * Estimate the value of Q(s, a) based on current reward and expected future rewards * Update Q(s, a) to take into account old estimate as well as our new estimate
Update changeable to:
Q(s, a) ← Q(s, a) + α(new value estimate - Q(s, a))
Q(s, a) ← Q(s, a) + α((r + future reward estimate) - Q(s, a))
Q(s, a) ← Q(s, a) + α((r + maxa’ Q(s’, a’)) - Q(s, a))
Q(s, a) ← Q(s, a) + α((r + γ maxa’ Q(s’, a’)) - Q(s, a))
An algorithm completely discounts the future estimated rewards, instead always choosing the action a in current state s that has the highest Q(s, a).
Greedy Decision-Making
Explore vs. Exploit
A greedy algorithm always________________, taking the actions that are already established to bring to good outcomes. However, it will always follow the same path to the solution, never finding a better path.
Exploits
Explore vs. Exploit
________________________, on the other hand, means that the algorithm may use a previously unexplored route on its way to the target, allowing it to discover more efficient solutions along the way.
Explore
What to use to implement the concept of exploration and exploitation?
ε-greedy
ε means epsilon
In this type of algorithm, we set ε equal to how often we want to move randomly. With probability 1-ε, the algorithm chooses the best move (exploitation). With probability ε, the algorithm chooses a random move (exploration).
ε (epsilon) greedy
It allows us to approximate Q(s, a) using various other features, rather than storing one value for each state-action pair. Thus, the algorithm becomes able to recognize which moves are similar enough so that their estimated value should be similar as well, and use this heuristic in its decision making.
function approximation
pseudocode(?)
ε-greedy
- Set ε equal to how often we want to move randomly.
- With probability 1 - ε, choose estimated best move.
- With probability ε, choose a random move.
Downside of using ε-greedy
This approach becomes more computationally demanding when a game has multiple states and possible actions, such as chess. It is infeasible to generate an estimated value for every possible move in every possible state.
given input data without any additional feedback, learn patterns
unsupervised learning
An unsupervised learning task that takes the input data and organizes the set of objects into groups in such a way that similar objects tend to be in the same group
Clustering
What are some Clustering Applications?
- Genetic research
- Image segmentation
- Market research
- Medical imaging
- Social network analysis
An algorithm for clustering data based on repeatedly assigning points to clusters and updating those clusters’ centers.
k-means Clustering
How does k-means Clustering work?
It maps all data points in a space, and then randomly places k cluster centers in the space (it is up to the programmer to decide how many; this is the starting state we see on the left). Each cluster center is simply a point in the space. Then, each cluster gets assigned all the points that are closest to its center than to any other center (this is the middle picture). Then, in an iterative process, the cluster center moves to the middle of all these points (the state on the right), and then points are reassigned again to the clusters whose centers are now closest to them. When, after repeating the process, each point remains in the same cluster it was before, we have reached an equilibrium and the algorithm is over, leaving us with points divided between clusters.
CS50 QUIZ
ategorize the following: A social network’s AI uses existing tagged photos of people to identify when those people appear in new photos.
- This is an example of supervised learning
- This is an example of reinforcement learning
- This is an example of unsupervised learning
- This is not an example of machine learning
This is an example of supervised learning
CS50 Quiz
Imagine a regression AI that makes the following predictions for the following 5 data points. What is the total L2 loss across all of these data points (i.e., the sum of all the individual L2 losses for each data point)?
- The true output is 2 and the AI predicted 4.
- The true output is 4 and the AI predicted 5.
- The true output is 4 and the AI predicted 3.
- The true output is 5 and the AI predicted 2.
- The true output is 6 and the AI predicted 5.
16
CS50 Quiz
If Hypothesis 1 has a lower L1 loss and a lower L2 loss than Hypothesis 2 on a set of training data, why might Hypothesis 2 still be a preferable hypothesis?
- Hypothesis 1 might be the result of regularization.
- Hypothesis 1 might be the result of overfitting.
- Hypothesis 1 might be the result of loss.
- Hypothesis 1 might be the result of cross-validation.
- Hypothesis 1 might be the result of regression.
Hypothesis 1 might be the result of overfitting
CS50 Quiz
In the ε-greedy approach to action selection in reinforcement learning, which of the following values of ε makes the approach identical to a purely greedy approach?
- ε = 0
- ε = 0.25
- ε = 0.5
- ε = 0.75
- ε = 1
ε = 0