5: Learning Flashcards
What is NETTalk?
???
What is a neural network?
???
What is the PDP model?
Parallel Distributed Processing ???
What is symbolic AI?
???
How do neural networks differ from symbolic AI?
???
What are some advantages of symbolic-based AI systems?
– A symbolic algorithm can execute anything expressed as following a
sequence of formal rules.
– Large amounts of memorised information can be copied and retrieved
accurately ad infinitum.
– Information processing is relatively fast and highly accurate.
What are some disadvantages of symbolic-based AI systems?
– Maybe not everything can be feasibly expressed as following a sequence of
formal rules. The Chinese Room, various solution searches, meaning.
– Symbolic retrieval of memories can be brittle in being all-or-none.
– Many Real World situations are novel and so require adaptation rather than
fast pre-set actions. Example: every-day situations.
Of symbolic and neural network AI systems, which is most similar to the organisation of the brain? How? Comment on the duplicity of neuron organisation.
Neural networks. They are modelled on the organisation of neurons in the brain and allow for parallel rather than serial organisation. The brain has much simpler and slower individual processing units than computers yet its computation in many areas is better, suggesting the organisation of the brain is better.
What constitutes a neural network?
A collection of interconnected neurons (or units). Some receive environmental input and some of the others give output to the environment.
What are hidden units? What are they aka?
Neurons/units in neural networks that have neither input nor output connections.
How are neurons modelled artificially in neural networks?
Binary threshold unit: compute excitation as the weighted sum of inputs, and if excitation is above a certain threshold then consider the neuron “excited” and is activated. When activated, the neuron is in the active state so will output 1 rather than 0.
What is the formula for calculating the output of an artificial neuron (BTU)?
Outj = g(Σ w(ij) in(i) - Θ); g(x) = 1 where x > 0; g (x) = 0 where x <= 0
g(x) is the activation function, here being a step function (“stepping” at 0)
Θ is the threshold
j is the jth threshold unit (with a unique Θ)
w(ij) is the weight of the ith input to the jth threshold unit
in(i) is the ith input to the jth threshold unit
What is an activation function?
A normalising function that defines the output of a neuron given the calculated activation from the inputs to the neuron and their weights as part of a threshold unit.
Name and describe 3 activation functions.
- Step function
- output 1 once activation reaches certain number, 0 otherwise - Sigmoid
- calculate output as part of sigmoid curve
- g(x) = 1/1+exp(-x) - Rectified Linear Unit
- output has threshold activation as with step function, then increasing linearly for further increases in activation
- e.g. with threshold of 0:
when x <= 0, g(x) = 0
when x > 0, g(x) = x
What is Feedforward Architecture?
???
What is supervised learning?
???
What is recurrent architecture?
???
What are network layers in neural networks?
???
What is the difference between lateral and feedforward connections?
???
For a feedforward-based neural network of n layers, how many are hidden?
n - 2. Since you can “see” the input and output layers, and all others only connect to each other or input and output layers, so are hidden.
What is Strictly Layered Architecture?
A neural network system in which there are no lateral connections and each neuron may only connect to others in adjacent layers.
What does it mean for a network to be “fully connected”?
Each neuron is connected to all others it is able to be connected to; which other neurons each neuron can be connected to is limited by the architecture of the network.
What is the concept of Feedforward Pass?
The way in which input patterns go through layers in feedforward networks in series - i.e. layer-by-layer, whereas within each layer the signal is propagated in parallel to all neurons in the layer simultaneously (from the previous layer or input).
What is the concept of generalisation?
???
How does sensibility apply to generalisation?
???
When is generalisation useful in real-world applications?
Where:
- the relationship between input and output is unknown
- little available data
- data contain noise
What is underfitting?
When the model created from an Ai system analysing data is too simple to explain the variance in the data and cannot generalise to fit it correctly.
What is overfitting?`
When the model created from an Ai system analysing data is too complex in explaining the variance in the data, missing the actual underlying patterns in the data. Here, it pays too much attention to noise and detail.
What is model complexity?
???
What is pruning?
Removing irrelevant neurons that have no effect from a neural network to make it less complex.
What is growing?
Systematically and repeatedly adding neurons to a neural network by some approach or algorithm while doing so appears to remain to be beneficial.
What is an error function?
???
What is weight decay?
???
How do you implement weight decay to regularise the function?
???
What is validation with respect to neural networks?
???
How do you perform validation with neural networks?
???
What is early stopping?
???
What is generalisation error?
???
What does a small generalisation error suggest? Why?
???
How can you find a good neural generaliser?
???
What is a bias unit? Why are they used?
An added input to a neuron fixed at 1 weighted such that it is equal to the threshold of the neuron, Θ. This means that the output of the neuron then only depends on the other “actual” inputs and their weights, allowing adaptation in neurons that can yield greater flexibility in learning.
How do you implement an AND gate with a neuron?
???
How do you implement an OR gate with a neuron?
Make Θ = 0.5 and g(x) = 1 when x > 0 and 0 when x <= 0.
Three inputs; first is bias unit fixed to -0.5 weight and +1 value to remove Θ threshold. Make both weights 0.6, i.e. bigger than Θ, so if either or both true neuron gives +1.
How do you implement a NOT gate with a neuron?
Make Θ = anything < 0 and g(x) = 1 when x > 0 and 0 when x <= 0.
Two inputs; first is bias unit fixed to -0.5 weight and +1 value to remove Θ threshold. Make both weights 0.6, i.e. bigger than Θ, so if either or both true neuron gives +1.
What is an input space?
???
What is a hyperplane?
???
What is linear separation?
???
What is Excitation Algebra?
???
What is the Zero Excitation Line?
???
How many dimensions are in an input space for a neuron with n inputs?
The input space here will be n-dimensional.
How can you implement XOR with neurons in neural networks?
???
What does it mean for a neural network to have a 2-1-1 architecture?
Its first layer has 2 nodes, its second has 1 node, and its third has 1 node.
What is a normal vector?
???
What is the idea of Universal Computation?
That all systems that can represent the logical elements that make up computers (AND, OR, NOT, etc) can form any logical expression that a digital computer can. This is true of neural nets, but they can also do more since output given for every analogue input, not just digital binary ones. Analogue inputs allow for an infinite number of IO mappings in a finite number of weights (and neurons). Feedforward networks are capable of any I/O mapping; recurrent networks of any I/S/O mapping (S being state since context is present in recurrent networks since neurons can connect to themselves).
True/false: neurons can’t represent all logical expressions that a digital computer can using a 2-layer architecture
False. They can. They can express logical gates like AND, OR, and NOT, and then build up logical expressions from them.
True/false: a neural net can’t do more than represent logical expressions.
Why?
False. An output is given for every analogue input, not just the digital binary values
True/false: In neural networks, an infinite number of I/O associations can be stored using a finite number of weights. Why?
True.
??? why
What is a Perceptron?
A single layer network with step activation (i.e. threshold) units capable of binary response, i.e. 0 or 1.
What is the Learning Algorithm for perceptrons?
???
What is the Convergence theorem for the perceptron learning algorithm?
That learning will converge in finite time if a solution exists.
What is Backpropagation?
???
aka backprop and BP
How do you calculate the output for a single output unit?
outp = F(inp, w)
inp = input vector for a pattern p, outp = output for inp, w = weight state
What is an input vector?
???
What is a pattern?
???
What is a weight state?
???
What is LMS error?
???
Least Mean Squared
How do you calculate the error for a single output unit?
E = (0.5) Σ [(outp – tp) ^ 2]
outp = the output for a pattern p, tp = target for pattern p
The 1/2 is to make differentiation easy btw
What is weight space?
???
What is error-weight space?
???
What is an error-weight surface?
???
What is an error-weight surface like near a local minimum?
In 2D, a series of elliptical contours representing error values, or in 3D as a 2D elliptical bowl in 3D space.
What is Steepest Gradient Descent?
???
What is hill-climbing?
???
How does Steepest Gradient Descent work?
???
Where do gradients arise from?
???
How do you calculate the gradient between 2 x values?
m = ΔE/Δx, since E = y on the graph.
Between 2 x values, x and x + Δx,
m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx
Since E(x) = x ^ 2, m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx = [(x + Δx) ^ 2 – (x ^ 2)] / Δx = [x ^ 2 + 2x * Δx + Δx ^ 2 – x ^ 2] / Δx = 2x + Δx = 2x
How do you calculate the gradient between 2 x values?
m = ΔE/Δx, since E = y on the graph.
Between 2 x values, x and x + Δx,
m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx
Since E(x) = x ^ 2, m = ΔE/Δx = [E(x + Δx) - E(x)] / Δx = [(x + Δx) ^ 2 – (x ^ 2)] / Δx = [x ^ 2 + 2x * Δx + Δx ^ 2 – x ^ 2] / Δx = 2x + Δx = 2x
So the gradient at any point is 2x.
Why is the gradient always in the direction of the error?
???
Why do we move in the direction opposite to the gradient on an error graph thing(?) to correct the error?
???
What is the learning rate? How is it notated?
???
What is x equivalent to on the error graph thing(?)?
The neural weights
How do you find the corrective step from the learning rate and gradient?
Δxt = – α (dE/dx)t, x(t+1) = xt + Δxt
What is gradient descent?
???
What is single layer gradient descent?
???
Why can you do gradient descent for each output separately in single layer feedforward networks?
each weight leads from 1 input to 1 output unit, so weight change for weight connected to unit A will not affect unit B
How do you define LMS error for Single layer gradient descent?
E = Σ Ep, where Ep = 0.5 * (outp – tp) ^ 2, i.e. E = 0.5 * Σp (outp – tp) ^ 2
How do you calculate the corrective change for a weight to reduce the error on a weight-error surface?
Δwi = [– α * δE] / δwi
How can you find error-weight gradients for weights wi leading to that output unit and then subsequently use these gradients to perform gradient descent?
δE / δwi = Σp (δEp / δoutp) * (δoutp / δexp) * (δexp / δwi)
= Σp (outp – tp) * (outp * (1-outp)) * inip
How can you compute a suggested weight change for backprop?
Suggested change for ith weight, Δwi, = [-α * δE] / δwi
δE / δwi= Σp (outp – tp) * (outp * (1-outp)) * inip
How do you find weights to output unit k in a single or multi layer?
δE / δwjk = Σp (outkp - tkp) * outkp(1 - outkp) * outjp
Note: for a single layer, outjp = injp
How do you find weights to hidden unit j in the final or only hidden layer (single or multiple hidden layers)?
δE / δwij = Σk Σp (outkp - tkp) * outkp(1 - outkp) * wjk * outjp(1 - outjp) * outip
Note: if unit i is an input unit then outip = inip
How do you find weights to hidden unit i in the penultimate hidden layer?
δE / δwui = Σk Σp (outkp - tkp) * outkp(1 - outkp) * wjk * outjp(1 - outjp) * wij * outip(1 - outip) * outup
Note: if unit u is an input unit then outup = inup
How does far right out in error derivatives change for layers further back from the front of a neural network?
For layers further back, far R.H.S. outup is replaced with wui·outup·(1-outup) times output
from previous layer outtp and so on
How does far right out term in error derivatives change when the 1st hidden layer of a neural network?
The far R.H.S. out will be the in from the input unit in this case
What is Multi-layer Training?
???
Why is the error-weight surface in the shape of a trough?
???
True/false: error-weight surface is almost a quadratic bowl for non-linear sigmoid activation functions near a minimum – is a quadratic bowl for linear activation functions.
True
What is summation? WHy does it lead to complex surface features?
???
What is momentum and why is it used to aid gradient descent?
Analogous to physical momentum, keep weight changing in same direction until overcame by large change from large error
How do you calculate the weight change with momentum?
Δwij (t) = – α δE(t) + β Δwij (t-1)
momentum coefficient, β is between 0 and 1
t is some measure of time. t comes immediately after t - 1.
How does momentum help, especially on plateaus?
– Makes bigger transitions when gradients point consistently in one direction.
– Simulates the ball accelerating down a constant incline or down a hill.
– Reduces time for learning when gradients are shallow, e.g. on plateaus.
How does momentum help on ravines?
– Adding a component that points in the previous transition direction damps
oscillations on ravines – as long as momentum coefficient < 1.
– Can speed up travel along the ravine bottom as it does on plateaus.
How does momentum help with local minima?
– May possibly allow gradient descent to shoot over shallow local minima.
– But could also cause gradient descent to shoot over global minimum.
– A momentum coefficient that will allow learning to shoot over local minima
and not the global minimum may not exist.
– In any case, the optimal momentum setting is not known a priori.
– So momentum does not really overcome local minima other than by luck.
How is Steepest gradient descent used?
Steepest gradient descent is used to guide the learning from random initial
weight states to weight states providing outputs closer to the given targets
given suitable neural topologies.
Is back propagation supervised learning? Why?
Yes. There are explicit supervised target output values.
What is the Ravine Problem?
???
Why is the Ravine Problem prevalent?
???
How does the Ravine Problem arise?
???