class 14 15 Flashcards
What is Sequential domains?
Learning in sequential domains is different from learning in static domains. In static domains each sample is independent and identically distributed. In sequential domains there is a dependency among all the points in sequence. A sequential domain means that, an instance in time t depends on another instance in time t-1, overall, the instances in different timesteps are dependent on one another.
Explain what is statical data and sequential data in details.
Static Data: learn the probabilistic distribution of the output given the input.
P(o|x): We are learning probabilistic distribution because it is strong against noise.
typeof(x) is fixed size tuple
o: classification or regression
Sequential data: P(o|x)
typeof(x): a sequence x(1) ,x(2) , x(t), where each x(t) has static type
o: can be either static or a sequence
What is a sequence, what is sequential transduction?
Sequence is or an ordered pair(t,h) where head is the vertex and tail is a sequence.
Sequential transduction:
Let the X and O be the input and output label spaces. The transduction, transforms any input sequence into an output sequence.
A general transduction T is a subset of X* x O*
Limiting it: T:X->O
A transduction T(.) is algebraic if it has a limited memory.
If the transduction has a finite memory k, limited memory, it means that an instance at time t can only depend on t,t-1 t-2 .. t-k
But sequences can have variable length, which means that the length can vary, therefore we need a fixed size window. Everytime you make a prediction you do it from that window.
What is casuality?
What is recursive state representation?
A transduction T(.) is casual if the output at time t does not depend on future inputs at time t+1 t+2 ..
Recursive State Representation:
A recursive state representation exists if only the transaction T is casual.
Output depend on hidden state variables(label space H)
h(t) =f( h(t-1) , x(t) , t )
o(t) = g(h(t) ,x(t))
f: HxX -> H
g: HxX -> O
Transduction T is stationary if f(.) and g(.) does not depend on t.
Time Shift Operator and Graphical Description
q^-1 is a time shift operator where it means that h(t) depends on h(t-1) that one means that it depends on 1 time step backward
o(t) | h(t) [ (q^-1) loop] | x(t)
Time Unfolding
The unfolded network had a feed forward structure, Weights are shared(replicated) meaning that the same weights are usedin different timesteps
Examples of Sequential Transductions
Sequence Classification(n->1) I-O transduction(n->n) Sequence Generation(1->n) Sequence Transduction(n->m)
input and output objects
RECURRENT NEURAL NETWORKS
NON LINEAR
1)Shallow Recurrent Neural Networks
Shallow Recurrent neural networks are non linear.
h(t) = f(U x(t) + Wh(t-1) + b)
o(t) = g(Vh(t) +c)
h(t) = tanh(Ux(t)+Wh(t-1) + b) o(t) = V h(t) + c y(t) = softmax(o(t)) Loss = TOPLAM logpmodel(y(t) {x(1),x(2),x(3) .. ]
Some Architectural features for RNN
1)shortcut connections:
İnput is also connected to output (the output not only depends on the hidden but also input)
O(t) = Vx(t) + Vh(t) + b
2) Higher order:
Hidden representation has a connectiion on q^-1 and q^-2
h(t) = W(1)h(t-1) + W(2) h(t-2) + Vx(t) + c
3) feedback from output:
the output unit of the previous time step is fed into the next timesteps hidden state
h(t) = Ux(t) ++ wh(t-1) + Z o(t-1) + c
4)TEACHER FORCİNG:
the previous layers target value is fed into the next time steps hidden layer
ALL OF THESE HAVE CASUAL TRANSDUCTION WHICH MEANS THAT THERE IS A DEPENDENCY IN THE PAST BUUUTTT:
5)BIDIRECTIONAL RNN: Possibility to look at the future
What is bidirectional RNN and what is the difference between the other RNN methods?
In bidirectional RNN, the hidden representation takes both future and past inputs, therefore the transduction is not casual anymore, since the input depends on the future inputs.
h(t) = Ux(t)+Wph(t-1)
h(t) =Ux(t) + Wf h(t+1)
DNA sequences
What is back propagation over time and real time recurrent neural networks?
We use those in RNNs, in backpropagation through time we compute the full sequence, if we have a sequence with lenght 1000 we have to sum everything up for each sequence and calculate the full gradient.
In real time RNN in each round we compute partial derivatives whenever we go forward.
At the end both calculates the same but in different ways. For Real time recurrent neural networks, the memory requirement and the time complexity is bigger than the back prropagation through time