Machine Learning and Protein Structure Flashcards
Week 10 Lecture 2
What are artificial neural networks?
Artificial neural networks comprise many switching units (artificial neurons) that are connected according to a specific network architecture. The objective of an artificial neural network is to learn how to transform inputs into meaningful outputs.
Activation functions
- tanh x
- sigmoid: originally the most popular non-linear activation function used in neural nets
- rectifier: now the most commonly used as it works well for deep networks, technically still a non-linear function
What is a transformer?
- Take a set of vectors representing something and transform them into a new set of vectors which have extra contextual information added
- These are sequence-to-sequence methods
- They treat data as sets and so are permutation invariant (don’t take order into account, although this can be fixed by adding a position term to the vector encoding of the tokens)
Simple 2-layer neural network
- A network is made smarter by adding layers
- Connecting every node to other nodes in adjacent layers enables you to use matrix multiplication
- Matrices in blue lines show hidden layers
Initially
How do we embed tokens?
- Tokens need to be converted into numbers to use them in the neural network
- Embed them in a high dimensional space and set each token to a vector of numbers
- The numbers are initially random and the embeddings are stored in a lookup table, the same vector is always equal to the same token
- The vector has a position in high dimensional space
How do we measure the similarity between two vectors?
- Represent the tokens as vectors and use Euclidean distance in any number of dimensions
- Cosine similarity is better because:
1. It measures the angle between them
2. It is independent of the vector length - The smaller the angle the more similar they are to each other
How are word embeddings done?
- Place the words randomly initially in high dimensional space
- Process the words such that similar or related words are close together
Transformer encoder
Input vectors –> transformed vectors –> final outputs
Scaled dot-product attention
- Calculates the dot product of every pair of vectors in the input
- Q (queries) and K (keys) are tensors made of the same set of vectors and are populated by the vectors that are assigned to the words in the sentence
- The system tries to calculate how similar the words are
- K is transposed and K and Q are multiplied and then normalised by dividing the dot product by the square root of the number of dimensions
- Run this through a softmax function so that all the values of a row add up to 1
- These values become the attention weights and we use these weights to generate a weighted average of the input tensor V
- The final output is a combination of the original vectors added together and multiplied by the softmax weights
Scaled dot-product self-attention
- Dot products are calculated between all pairs of input vectors
- Values are scaled by the square root of the input vector dimensions
- Similarities are normalised so that each row of the similarity matrix adds up to 1 using the softmax function. This is called the attention matrix.
- Each row of the attention matrix is used as the weights in a weighted average of the value input vectors.
- These weighted averages become the new output vectors.
Multi-head attention
- Scaled dot-product attention cannot be trained and is not adjustable in any way
- Q, K, and V can be run through linear perceptrons which have weights and can be trained
- SDPA takes the place of the non-linearity function in the original neural nets
- You can have multiple SDPA blocks that all take in the same input and focus on different aspects
Bidirectional Encoder Representations from Transformers (BERT)
- Hide parts of the input sentence and make the transformer return a probability distribution of what the likely missing part of the sentence is
- Updates the weights which makes it understand the context better
- Can make the transformer understand context and language
BERT loss
- The transformer is used to encode randomly masked versions of the same text
- Replace the original word/letter/amino acid with a placeholder which means don’t know
- The transformer is trained to predict the tokens that have been masked out correctly
- Scored with a cross-entropy loss function
Cross-entropy loss function
Assesses the probability of the correct word in the probability distribution that the network outputs. If the correct word has a high probability in the distribution you get a low loss value.
Different ways of using language models
- Single-task supervised training
- Unsupervised pre-training + supervised fine-tuning
- Unsupervised pre-training + supervised training of small downstream classifier
- Future: unsupervised pre-training at scale + prompting (few-shot learning)