Supervised learning Flashcards

1
Q

What is the difference between the perceptron and a linear pattern associator?

A

the output neurons use a

continuous activaton function like the sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the continous output allow to do?

A

Quantify the error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the perceptron convergence theorem?

A

For any linearly separable problem, the perceptron will find the solution in a finite number of steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the perceptron’s network composed of?

A

N inputs that encode the pattern presented using values xi and a single output neuron encoding the response with bipolar (or binary) values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

With what kind of output units can the Delta Rule be used?

A

output units that use a continuous and differentiable output function, like the sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the function of the delta rule?

A

the mean squared error between desired output and actual output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are the weights modified during learning?

A

in a direction opposite to that of the

gradient of the cost function (gradient descent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the steps of learning with the delta rule in supervised learning.

A
  • Input neurons are clamped to the input values
  • Activation flows to the output neurons
  • Output neurons’ activations are computed
  • The output pattern is compared with the desired output
  • The discrepancy between the two patterns is computed (error signal)
  • Connection weights are modified ( delta rule) in order to reduce the error
    (minimizing the cost function E, which
    depends uniquely on the values of the connection weights W. Thus, weights are modified in a direction opposite to that of the gradient of the cost function
  • The procedure is repeated for all examples that form the training set (learning epoch) and it is further repeated (for many epochs) until the error becomes 0 or the error stops decreasing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can linearly inseperable problems be solved (by what?)

A

multi-layer networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are multi layer networks called universal aproximators?

A

Because a network with at least one hidden layer can, at least in principle, approximate any X-Y (input-output) function (if we properly choose the weight values and the number of hidden units)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a multi-layer network?

A

One that has one or more intermediate layers of neurons (hidden layers) that use a non-linear activation function (like the sigmoid)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the error back-propagation algorithm?

A

it’s an extension of the delta rule (generalized delta rule) that allows learning in multi layer networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the steps of error back propagation

A
  • Input neurons are clamped to the input values
  • Activation flows to the hidden neurons -> output neurons
    The output pattern is compared with the desired output
  • The discrepancy between the two patterns is computed (error signal)
  • computing the changes according to the gradient of the error function for these connections

=> we sum all the errors for the output errors connected and we propagate them backwards - multiply each error by the weight values itself

  • For the errors of the hidden units – the error is computed by summing the weighted error terms for each of the output unit (for each output unit we compute an error term and then propagating these errors backwards by multiplying the error term for each output by each weight value and then summing them all)
  • Once I have the error for the hidden unit, I can apply the delta rule again because I have the inputs

At the output level – delta rule
At the level of each previous layer – generalized delta rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between a small and large learning rate?

A

small: slow + local minima
large: fast + imprecise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the momentum do?

A
  • adds a fraction of the previous weight update to the current one
  • it’s in the same direction, this will increase the size of the step taken towards the minimum. When the gradient changes direction, momentum will smooth the variation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are 2 necessary conditions to obtain good generalization performance?

A

Input variables should contain sufficient information related to the target and the training set should contain a sufficient number of training examples that constitute a representative sample of the population

17
Q

Generalization vs overfitting!

A

producing an appropriate output for an input pattern that was not included in the training set
VS. performance on the training patterns (training set) keeps improving but performance drops in terms of generalization (on an independent test set)

18
Q

name 3 solutions for overfitting

A

limiting the number of hidden units, early stopping, weight decay

19
Q

What is the training test?

A

examples for learning. They are used to find the values of the connection weights

20
Q

What is a validation test?

A

examples for “tuning” learning parameters (known as hyper-parameters, such as learning rate, momentum, weight decay), number of hidden neurons, and for deciding when to stop learning

21
Q

What is a test set?

A

examples for assessing the performance of the final model

22
Q

When the data is not enough to partition it into separate training and test sets, what method do we use?

A

cross-validation

23
Q

Explain k-fold cross validation.

A

One round of cross validation involves partitioning a sample of data into training set and validation set (or testing set). Multiple rounds of cross cross-validation are performed using different partitions, and the validation results are averaged over the rounds.

the dataset is divided into k parts (folds) of equal size (oftenk=10). At each cross-validation cycle, the k-th fold of the dataset is excluded from training to be used as testing test. Final performance is the mean across all cycles(i.e., across all k folds).

24
Q

What is the ROC (Receiver Operating Characteristic) curve?

A

built by plotting True Positive rate vs False Positive rate for various values of one parameter.

25
Q

What is the Area Under the Curve (AUC)?

A

ranges between 0 and 1 and it represents how good is the classifier (AUC=1 means that the predictions are 100% correct).

26
Q

What is supervised learning?

A

The system is also given desiredoutputs
y 1 , y 2 , , .., and its goal is to learn to produce the correct output given a new input. Note that there is an external teacher.

27
Q

What is unsupervised learning?

A

Unsupervised learning:
The goal of the system is to
build representations
from that can be used for reasoning,
decision making, predicting things, communicating
etc. Note that there is not a specific task.

28
Q

What is reinforcement learning?

A

The system can also produce
actions which affect the state of the world, and receives rewards (or punishments) r 1 , r 2 , Its goal is to learn to act in a way that maximises rewards in the long term.