Lec 4 | Learning Flashcards by CYRA DAGOY

It provides a computer with data, rather than explicit instructions. Using these data, the computer learns to recognize patterns and becomes able to execute tasks on its own.

Machine Learning

How well did you know this?

Not at all

Perfectly

It is a task where a computer learns a function that maps inputs to outputs based on a dataset of input-output pairs.

Supervised Learning

How well did you know this?

Not at all

Perfectly

This is a supervised learning task where the function maps an input to a discrete output. In other terms, it is the task learning a function mapping an input point to a discrete category.

Classification

How well did you know this?

Not at all

Perfectly

An algorithm, given an input,chooses the class of the nearest data point to that input.
One way of solving a task by assigning the variable in question the value of the closest observation

Nearest-Neighbor Classification

How well did you know this?

Not at all

Perfectly

How do you get around the limitations of nearest-neighbor classification?

One way to get around the limitations of nearest-neighbor classification is by using k-nearest-neighbors classification.

How well did you know this?

Not at all

Perfectly

An algorithm that, given an input, chooses the most common class out of the k nearest data points to that input

k-nearest-neighbor classification

How well did you know this?

Not at all

Perfectly

What is a drawback of using k-nearest-neighbor classification?

A drawback is that, using a naive approach, the algorithm will have to measure the distance of every single point to the point in question, which is computationally expensive. This can be sped up by using data structures that enable finding neighbors more quickly or by pruning irrelevant observation.

How well did you know this?

Not at all

Perfectly

Another way of going about a calssification problem is by looking at the data as a whole and trying to create a decision boundary. In two-dimensional data, we can draw a line between the two types of observations. Every additional data point will be classified based on the side of the line on which it is plotted.

Perceptron Learning

How well did you know this?

Not at all

Perfectly

What is the drawback of Perceptron Learning? And how will we compromise?

The drawback to this approach is that data are messy, and it is rare that one can draw a line and neatly divide the classes into two observations without any mistakes. Often, we will compromise, drawing a boundary that separates the observations correctly more often than not, but still occasionally misclassifies them.

How well did you know this?

Not at all

Perfectly

What is the perceptron learning rule?

Given data point (x, y), update each weight according to:
wi = wi + α(y - hw(x)) × xi

wi = wi + α(actual value - estimate) × xi

How well did you know this?

Not at all

Perfectly

What is an important takeaway from the perceptron learning rule?

The important takeaway from this rule is that for each data point, we adjust the weights to make our function more accurate.

The details, which are not as critical to our point, are that each weight is set to be equal to itself plus some value in parentheses.

apil ba nanag ikaduja aning takeaway????

How well did you know this?

Not at all

Perfectly

It switches from 0 to 1 once the estimated value crosses some threshold.

Threshold function

How well did you know this?

Not at all

Perfectly

What is a downside of using a threshold function?

The problem with this type of function is that it is unable to express uncertainty.

How well did you know this?

Not at all

Perfectly

Threshold function switches from 0 to 1 and it can only be equal to 0 or to 1

hard threshold

How well did you know this?

Not at all

Perfectly

A logistic function can yield a real number between 0 and 1, which will express confidence in the estimate.

soft threshold

How well did you know this?

Not at all

Perfectly

Another approach to classification is ____________________. This approach uses an additional vector (support vector) near the decision boundary to make the best decision when separating the data.

Support Vector Machine

How well did you know this?

Not at all

Perfectly

A boundary that maximizes the distance between any of the data points. This is a type of boundary, which is as far as possible from the two groups it separates.

Maximum Margin Separator

How well did you know this?

Not at all

Perfectly

Give a benefit of a support vector machine.

They can represent decision boundaries with more than two dimensions, as well as non-linear decision boundaries.

How well did you know this?

Not at all

Perfectly

It is a supervised learning task of a function that maps an input point to a continuous value, some real number. This differs from classification in that classification problems map an input to discrete values (Rain or No Rain).

Regression

How well did you know this?

Not at all

Perfectly

Functions that express how poorly our hypothesis performs.

A way to quantify the utility lost by any of the decision rules above. The less accurate the prediction, the larger the loss.

Loss functions

How well did you know this?

Not at all

Perfectly

Loss functions

This function gains value when the prediction isn’t correct and doesn’t gain value when it is correct

0-1 Loss Function

How well did you know this?

Not at all

Perfectly

Give function/code:

0-1 Loss Function

L(actual, predicted):
          0 if actual = predicted
          1 otherwise

How well did you know this?

Not at all

Perfectly

Give Function/code

L1 Loss Function

L(actual, predicted) = | actual - predicted |

How well did you know this?

Not at all

Perfectly

Give Function/Code:

L2 Loss Function

Study These Flashcards

L(actual, predicted) = (actual - predicted)^2

What do you do if you are interested in quantifying for each prediction how much it differed from the bserved value?

We do this by taking either the absolute value or the squared value of the observed value minus the predicted value (i.e. how far the prediction was from the observed value).

A model that fits too closely to a particular data set and therefore may fail to generalize to future data

Overfitting

The process of penalizing hypotheses that are more complex to favor simpler, more general hypotheses

Regularization

Where do we use regularization?

We use regularization to avoid overfitting.

# Formulae In regularization, we estimate the cost of the hypothesis function h by adding up its loss and a measure of its complexity.

cost(h) = loss(h) + λcomplexity(h)

It is a constant that we can use to modulate how strongly to penalize for complexity in our cost function. The higher ________ is, the more costly complexity is.

Lambda (λ)

It splits data into a training set and a test set, such that learning happens on the training set and is evaluated on the test set

holdout cross-validation

Give a way to test if the model is overfitted.

Holdout Cross Validation

What is the downside of holdout cross validation? And how do you deal with it's downside?

The downside of holdout cross validation is that we don’t get to train the model on half the data, since it is used for evaluation purposes. A way to deal with this is using k-Fold Cross-Validation.

It splits data into k sets, and experimenting k times, using each set as a test set once, and using remaining data as training set

k-fold cross-validation

As often is the case with Python, there are multiple libraries that allow us to conveniently use machine learning algorithms. One of such libraries is ________________ .

scikit-learn

It is another approach to machine learning, where after each action, the agent gets feedback in the form of reward or punishment (a positive or a negative numerical value).

Reinforcement Learning

What is the learning process of reinforcement learning?

The learning process starts by the environment providing a state to the agent. Then, the agent performs an action on the state. Based on this action, the environment will return a state and a reward to the agent, where the reward can be positive, making the behavior more likely in the future, or negative (i.e. punishment), making the behavior less likely in the future.

Where can we use Reinforcement Learning?

This type of algorithm can be used to train walking robots, for example, where each step returns a positive number (reward) and each fall a negative number (punishment).

model for decision-making, representing states, actions, and their rewards

Markov Decision Process

Reinforcement learning can be viewed as a Markov decision process, having the following properties:

* set of states S * Set of actions Actions(S) * Transition model P(s’ | s, a) * Reward function R(s, a, s’)

A method for learning a function Q(s, a), estimate of the value of performing action a in state s.

Q-learning

# Give Pseudocode Q-Learning Overview

``` * Start with Q(s, a) = 0 for all s, a * When we taken an action and receive a reward: * Estimate the value of Q(s, a) based on current reward and expected future rewards * Update Q(s, a) to take into account old estimate as well as our new estimate ``` ## Footnote Update changeable to: Q(s, a) ← Q(s, a) + α(new value estimate - Q(s, a)) Q(s, a) ← Q(s, a) + α((r + future reward estimate) - Q(s, a)) Q(s, a) ← Q(s, a) + α((r + maxa' Q(s', a')) - Q(s, a)) Q(s, a) ← Q(s, a) + α((r + γ maxa' Q(s', a')) - Q(s, a))

An algorithm completely discounts the future estimated rewards, instead always choosing the action a in current state s that has the highest Q(s, a).

Greedy Decision-Making

# Explore vs. Exploit A greedy algorithm always________________, taking the actions that are already established to bring to good outcomes. However, it will always follow the same path to the solution, never finding a better path.

Exploits

# Explore vs. Exploit ________________________, on the other hand, means that the algorithm may use a previously unexplored route on its way to the target, allowing it to discover more efficient solutions along the way.

Explore

What to use to implement the concept of exploration and exploitation?

ε-greedy ## Footnote ε means epsilon

In this type of algorithm, we set ε equal to how often we want to move randomly. With probability 1-ε, the algorithm chooses the best move (exploitation). With probability ε, the algorithm chooses a random move (exploration).

ε (epsilon) greedy

It allows us to approximate Q(s, a) using various other features, rather than storing one value for each state-action pair. Thus, the algorithm becomes able to recognize which moves are similar enough so that their estimated value should be similar as well, and use this heuristic in its decision making.

function approximation

# pseudocode(?) ε-greedy

* Set ε equal to how often we want to move randomly. * With probability 1 - ε, choose estimated best move. * With probability ε, choose a random move.

Downside of using ε-greedy

This approach becomes more **computationally demanding** when a game has multiple states and possible actions, such as chess. **It is infeasible to generate an estimated value for every possible move in every possible state.**

given input data without any additional feedback, learn patterns

unsupervised learning

An unsupervised learning task that takes the input data and organizes the set of objects into groups in such a way that similar objects tend to be in the same group

Clustering

What are some Clustering Applications?

* Genetic research * Image segmentation * Market research * Medical imaging * Social network analysis

An algorithm for clustering data based on repeatedly assigning points to clusters and updating those clusters' centers.

k-means Clustering

How does k-means Clustering work?

It maps all data points in a space, and then randomly places k cluster centers in the space (it is up to the programmer to decide how many; this is the starting state we see on the left). Each cluster center is simply a point in the space. Then, each cluster gets assigned all the points that are closest to its center than to any other center (this is the middle picture). Then, in an iterative process, the cluster center moves to the middle of all these points (the state on the right), and then points are reassigned again to the clusters whose centers are now closest to them. When, after repeating the process, each point remains in the same cluster it was before, we have reached an equilibrium and the algorithm is over, leaving us with points divided between clusters.

# CS50 QUIZ ategorize the following: A social network’s AI uses existing tagged photos of people to identify when those people appear in new photos. * This is an example of supervised learning * This is an example of reinforcement learning * This is an example of unsupervised learning * This is not an example of machine learning

This is an example of supervised learning

# CS50 Quiz Imagine a regression AI that makes the following predictions for the following 5 data points. What is the total L2 loss across all of these data points (i.e., the sum of all the individual L2 losses for each data point)? 1. The true output is 2 and the AI predicted 4. 2. The true output is 4 and the AI predicted 5. 3. The true output is 4 and the AI predicted 3. 4. The true output is 5 and the AI predicted 2. 5. The true output is 6 and the AI predicted 5.

# CS50 Quiz If Hypothesis 1 has a lower L1 loss and a lower L2 loss than Hypothesis 2 on a set of training data, why might Hypothesis 2 still be a preferable hypothesis? * Hypothesis 1 might be the result of regularization. * Hypothesis 1 might be the result of overfitting. * Hypothesis 1 might be the result of loss. * Hypothesis 1 might be the result of cross-validation. * Hypothesis 1 might be the result of regression.

Hypothesis 1 might be the result of overfitting

# CS50 Quiz In the ε-greedy approach to action selection in reinforcement learning, which of the following values of ε makes the approach identical to a purely greedy approach? * ε = 0 * ε = 0.25 * ε = 0.5 * ε = 0.75 * ε = 1

ε = 0

Lec 4 | Learning Flashcards

(59 cards)