5: Learning Flashcards

Question

How does sensibility apply to generalisation?

Answer 1

Where: - the relationship between input and output is unknown - little available data - data contain noise

Answer 2

When the model created from an Ai system analysing data is too simple to explain the variance in the data and cannot generalise to fit it correctly.

Answer 3

When the model created from an Ai system analysing data is too complex in explaining the variance in the data, missing the actual underlying patterns in the data. Here, it pays too much attention to noise and detail.

Answer 4

Removing irrelevant neurons that have no effect from a neural network to make it less complex.

Answer 5

Systematically and repeatedly adding neurons to a neural network by some approach or algorithm while doing so appears to remain to be beneficial.

Answer 6

An added input to a neuron fixed at 1 weighted such that it is equal to the threshold of the neuron, Θ. This means that the output of the neuron then only depends on the other "actual" inputs and their weights, allowing adaptation in neurons that can yield greater flexibility in learning.

Answer 7

Make Θ = 0.5 and g(x) = 1 when x > 0 and 0 when x <= 0. Three inputs; first is bias unit fixed to -0.5 weight and +1 value to remove Θ threshold. Make both weights 0.6, i.e. bigger than Θ, so if either or both true neuron gives +1.

Answer 8

Make Θ = anything < 0 and g(x) = 1 when x > 0 and 0 when x <= 0. Two inputs; first is bias unit fixed to -0.5 weight and +1 value to remove Θ threshold. Make both weights 0.6, i.e. bigger than Θ, so if either or both true neuron gives +1.

Answer 9

The input space here will be n-dimensional.

Answer 10

Its first layer has 2 nodes, its second has 1 node, and its third has 1 node.

Answer 11

That all systems that can represent the logical elements that make up computers (AND, OR, NOT, etc) can form any logical expression that a digital computer can. This is true of neural nets, but they can also do more since output given for every analogue input, not just digital binary ones. Analogue inputs allow for an infinite number of IO mappings in a finite number of weights (and neurons). Feedforward networks are capable of any I/O mapping; recurrent networks of any I/S/O mapping (S being state since context is present in recurrent networks since neurons can connect to themselves).

Answer 12

False. They can. They can express logical gates like AND, OR, and NOT, and then build up logical expressions from them.

Answer 13

False. An output is given for every analogue input, not just the digital binary values

Answer 14

True. ??? why

Answer 15

A single layer network with step activation (i.e. threshold) units capable of binary response, i.e. 0 or 1.

Answer 16

That learning will converge in finite time if a solution exists.

Answer 17

??? aka backprop and BP

Answer 18

outp = F(inp, w) inp = input vector for a pattern p, outp = output for inp, w = weight state

Answer 19

??? Least Mean Squared

Answer 20

E = (0.5) Σ [(outp – tp) ^ 2] outp = the output for a pattern p, tp = target for pattern p The 1/2 is to make differentiation easy btw

Answer 21

In 2D, a series of elliptical contours representing error values, or in 3D as a 2D elliptical bowl in 3D space.

Answer 22

m = ΔE/Δx, since E = y on the graph. Between 2 x values, x and x + Δx, m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx ``` Since E(x) = x ^ 2, m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx = [(x + Δx) ^ 2 – (x ^ 2)] / Δx = [x ^ 2 + 2x * Δx + Δx ^ 2 – x ^ 2] / Δx = 2x + Δx = 2x ```

Answer 23

m = ΔE/Δx, since E = y on the graph. Between 2 x values, x and x + Δx, m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx ``` Since E(x) = x ^ 2, m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx = [(x + Δx) ^ 2 – (x ^ 2)] / Δx = [x ^ 2 + 2x * Δx + Δx ^ 2 – x ^ 2] / Δx = 2x + Δx = 2x ``` So the gradient at any point is 2x.

Answer 24

The neural weights

Answer 25

Δxt = – α (dE/dx)t, x(t+1) = xt + Δxt

Answer 26

each weight leads from 1 input to 1 output unit, so weight change for weight connected to unit A will not affect unit B

Answer 27

E = Σ Ep, where Ep = 0.5 * (outp – tp) ^ 2, i.e. E = 0.5 * Σp (outp – tp) ^ 2

Answer 28

Δwi = [– α * δE] / δwi

Answer 29

δE / δwi = Σp (δEp / δoutp) * (δoutp / δexp) * (δexp / δwi) = Σp (outp – tp) * (outp * (1-outp)) * inip

Answer 30

Suggested change for ith weight, Δwi, = [-α * δE] / δwi δE / δwi= Σp (outp – tp) * (outp * (1-outp)) * inip

Answer 31

δE / δwjk = Σp (outkp - tkp) * outkp(1 - outkp) * outjp Note: for a single layer, outjp = injp

Answer 32

δE / δwij = Σk Σp (outkp - tkp) * outkp(1 - outkp) * wjk * outjp(1 - outjp) * outip Note: if unit i is an input unit then outip = inip

Answer 33

δE / δwui = Σk Σp (outkp - tkp) * outkp(1 - outkp) * wjk * outjp(1 - outjp) * wij * outip(1 - outip) * outup Note: if unit u is an input unit then outup = inup

Answer 34

For layers further back, far R.H.S. outup is replaced with wui·outup·(1-outup) times output from previous layer outtp and so on

Answer 35

The far R.H.S. out will be the in from the input unit in this case

Answer 36

Analogous to physical momentum, keep weight changing in same direction until overcame by large change from large error

Answer 37

Δwij (t) = – α δE(t) + β Δwij (t-1) momentum coefficient, β is between 0 and 1 t is some measure of time. t comes immediately after t - 1.

Answer 38

– Makes bigger transitions when gradients point consistently in one direction. – Simulates the ball accelerating down a constant incline or down a hill. – Reduces time for learning when gradients are shallow, e.g. on plateaus.

Answer 39

– Adding a component that points in the previous transition direction damps oscillations on ravines – as long as momentum coefficient < 1. – Can speed up travel along the ravine bottom as it does on plateaus.

Answer 40

– May possibly allow gradient descent to shoot over shallow local minima. – But could also cause gradient descent to shoot over global minimum. – A momentum coefficient that will allow learning to shoot over local minima and not the global minimum may not exist. – In any case, the optimal momentum setting is not known a priori. – So momentum does not really overcome local minima other than by luck.

Answer 41

Steepest gradient descent is used to guide the learning from random initial weight states to weight states providing outputs closer to the given targets given suitable neural topologies.

Answer 42

Yes. There are explicit supervised target output values.

5: Learning Flashcards

(108 cards)