Supervised learning Flashcards by Alexandra Vaduva

What is the difference between the perceptron and a linear pattern associator?

the output neurons use a

continuous activaton function like the sigmoid

How well did you know this?

Not at all

Perfectly

What does the continous output allow to do?

Quantify the error

How well did you know this?

Not at all

Perfectly

What is the perceptron convergence theorem?

For any linearly separable problem, the perceptron will find the solution in a finite number of steps.

How well did you know this?

Not at all

Perfectly

What is the perceptron’s network composed of?

N inputs that encode the pattern presented using values xi and a single output neuron encoding the response with bipolar (or binary) values

How well did you know this?

Not at all

Perfectly

With what kind of output units can the Delta Rule be used?

output units that use a continuous and differentiable output function, like the sigmoid

How well did you know this?

Not at all

Perfectly

What is the function of the delta rule?

the mean squared error between desired output and actual output

How well did you know this?

Not at all

Perfectly

How are the weights modified during learning?

in a direction opposite to that of the

gradient of the cost function (gradient descent)

How well did you know this?

Not at all

Perfectly

Describe the steps of learning with the delta rule in supervised learning.

Input neurons are clamped to the input values
Activation flows to the output neurons
Output neurons’ activations are computed
The output pattern is compared with the desired output
The discrepancy between the two patterns is computed (error signal)
Connection weights are modified ( delta rule) in order to reduce the error
(minimizing the cost function E, which
depends uniquely on the values of the connection weights W. Thus, weights are modified in a direction opposite to that of the gradient of the cost function
The procedure is repeated for all examples that form the training set (learning epoch) and it is further repeated (for many epochs) until the error becomes 0 or the error stops decreasing.

How well did you know this?

Not at all

Perfectly

How can linearly inseperable problems be solved (by what?)

multi-layer networks

How well did you know this?

Not at all

Perfectly

Why are multi layer networks called universal aproximators?

Because a network with at least one hidden layer can, at least in principle, approximate any X-Y (input-output) function (if we properly choose the weight values and the number of hidden units)

How well did you know this?

Not at all

Perfectly

What is a multi-layer network?

One that has one or more intermediate layers of neurons (hidden layers) that use a non-linear activation function (like the sigmoid)

How well did you know this?

Not at all

Perfectly

What is the error back-propagation algorithm?

it’s an extension of the delta rule (generalized delta rule) that allows learning in multi layer networks

How well did you know this?

Not at all

Perfectly

Describe the steps of error back propagation

Input neurons are clamped to the input values
Activation flows to the hidden neurons -> output neurons
The output pattern is compared with the desired output
The discrepancy between the two patterns is computed (error signal)
computing the changes according to the gradient of the error function for these connections

=> we sum all the errors for the output errors connected and we propagate them backwards - multiply each error by the weight values itself

For the errors of the hidden units – the error is computed by summing the weighted error terms for each of the output unit (for each output unit we compute an error term and then propagating these errors backwards by multiplying the error term for each output by each weight value and then summing them all)
Once I have the error for the hidden unit, I can apply the delta rule again because I have the inputs

At the output level – delta rule
At the level of each previous layer – generalized delta rule

How well did you know this?

Not at all

Perfectly

What is the difference between a small and large learning rate?

small: slow + local minima
large: fast + imprecise

How well did you know this?

Not at all

Perfectly

What does the momentum do?

adds a fraction of the previous weight update to the current one
it’s in the same direction, this will increase the size of the step taken towards the minimum. When the gradient changes direction, momentum will smooth the variation

How well did you know this?

Not at all

Perfectly

What are 2 necessary conditions to obtain good generalization performance?

Study These Flashcards

Input variables should contain sufficient information related to the target and the training set should contain a sufficient number of training examples that constitute a representative sample of the population

Generalization vs overfitting!

Study These Flashcards

producing an appropriate output for an input pattern that was not included in the training set
VS. performance on the training patterns (training set) keeps improving but performance drops in terms of generalization (on an independent test set)

name 3 solutions for overfitting

Study These Flashcards

limiting the number of hidden units, early stopping, weight decay

What is the training test?

Study These Flashcards

examples for learning. They are used to find the values of the connection weights

What is a validation test?

Study These Flashcards

examples for “tuning” learning parameters (known as hyper-parameters, such as learning rate, momentum, weight decay), number of hidden neurons, and for deciding when to stop learning

What is a test set?

Study These Flashcards

examples for assessing the performance of the final model

When the data is not enough to partition it into separate training and test sets, what method do we use?

Study These Flashcards

cross-validation

Explain k-fold cross validation.

Study These Flashcards

One round of cross validation involves partitioning a sample of data into training set and validation set (or testing set). Multiple rounds of cross cross-validation are performed using different partitions, and the validation results are averaged over the rounds.

the dataset is divided into k parts (folds) of equal size (oftenk=10). At each cross-validation cycle, the k-th fold of the dataset is excluded from training to be used as testing test. Final performance is the mean across all cycles(i.e., across all k folds).

What is the ROC (Receiver Operating Characteristic) curve?

Study These Flashcards

built by plotting True Positive rate vs False Positive rate for various values of one parameter.

What is the Area Under the Curve (AUC)?

ranges between 0 and 1 and it represents how good is the classifier (AUC=1 means that the predictions are 100% correct).

What is supervised learning?

The system is also given desiredoutputs y 1 , y 2 , , .., and its goal is to learn to produce the correct output given a new input. Note that there is an external teacher.

What is unsupervised learning?

Unsupervised learning: The goal of the system is to build representations from that can be used for reasoning, decision making, predicting things, communicating etc. Note that there is not a specific task.

What is reinforcement learning?

The system can also produce actions which affect the state of the world, and receives rewards (or punishments) r 1 , r 2 , Its goal is to learn to act in a way that maximises rewards in the long term.

Supervised learning Flashcards

(28 cards)