P3 - Architecture & Machine Learning Models Flashcards
What Are Neural Networks in NLP?
Neural networks process sequential data in NLP to understand language.
e.g.: translation, text generation, and chatbot responses.
How do NNs work?
Nodes in one layer are connected to the next.
The strength of each connection varies - this is called the weight.
NN learns by adjusting the weights of the connections.
(Some NNs may have billions of input nodes (parameters) and hundreds of hidden layers.)
What is the input of a NN?
Each word is turned into a multidimensional vector first, using a word embedding algorithm.
This groups words together with other similar words.
e.g.: Word2Vec was created at Google in 2013 to help with search.
Alternative algorithms:
- GloVe (Global Vectors), Stanford 2014
- fastText, Facebook 2015
How are the relationships between words represented?
Vectors capture the relationships between words.
Similar vectors represent an equivalent relationship between words.
How is Word2Vec trained using written text?
Each word is compared to those typically found close to it in ordinary texts.
- Continuous Bag of Words (CBOW):
Tries to predict the “central” word in a phrase by looking at those nearby.
- Skip-gram:
Does the opposite - starts with the central word and predicts those likely to be before or after.
Both algorithms are used simultaneously when training the Word2Vec neural network.
Simplified neural network process
- An input text is split into tokens (words or parts of words)
- A word embedding algorithm converts the tokens into vectors
- The vectors are passed to the neural network.
What is the purpose of comparing the output vector to the correct vector in training a neural network?
It allows calculation of the error value, which is used for backpropagation to adjust the weights of the layers.
How does backpropagation help in training a neural network?
It adjusts the weights of the layers based on the error value, improving accuracy over time.
Why is training a neural network done repeatedly with large amounts of data?
Repeated training refines the weights, improving the model’s accuracy in predicting correct outputs.
What problem does an Recurrent Neural Network solve with standard NN?
It processes all input data simultaneously, losing the order of words and outputting an aggregate result of all input vectors. RNNs preserve the order of words by processing them one at a time and feeding the result of the last inner layer back to the first, creating a form of memory.
Why can’t you use standard backpropagation in an RNN?
Because of the feedback process, you must use backpropagation through time (BPTT) to retrace each step where the output was fed back into the hidden layers.
What is the vanishing gradient problem in RNNs?
The influence of earlier words in a sequence diminishes over time, making it difficult for RNNs to learn long-term dependencies.
Why do RNNs struggle with long input sequences?
Due to the vanishing gradient problem, earlier words in a sequence have a much smaller effect on learning compared to later ones.
What is an LSTM network?
A type of RNN designed to handle long-term dependencies using memory cells and gating mechanisms.
What are the three gates in an LSTM?
- Forget gate – Discards irrelevant or outdated information.
- Input gate – Incorporates new information at each time step.
- Output gate – Passes part of the updated cell state to the next layer.
How do LSTMs improve upon standard RNNs?
Better at retaining important information over longer sequences.
Reduce the impact of vanishing gradients.
Why do LSTMs help prevent the vanishing gradient problem?
The memory cell allows information to bypass repetitive multiplication, preventing the exponential shrinking of gradients.
What are some limitations of LSTMs?
The memory cell allows information to bypass repetitive multiplication, preventing the exponential shrinking of gradients.
What are some limitations of LSTMs?
Still process one token at a time, making them inefficient.
More computationally expensive than RNNs.
Require fixed-length sequences, making padding inefficient and splitting disruptive.
Struggle with handling global context (distant token relationships).
When are LSTMs useful?
When sequence lengths are moderate (e.g., speech recognition).
When computational resources are limited (e.g., edge computing).
For domain-specific models optimised for LSTMs