Machine Learning and Protein Structure Flashcards
Week 10 Lecture 2
What are artificial neural networks?
Artificial neural networks comprise many switching units (artificial neurons) that are connected according to a specific network architecture. The objective of an artificial neural network is to learn how to transform inputs into meaningful outputs.
Activation functions
- tanh x
- sigmoid: originally the most popular non-linear activation function used in neural nets
- rectifier: now the most commonly used as it works well for deep networks, technically still a non-linear function
What is a transformer?
- Take a set of vectors representing something and transform them into a new set of vectors which have extra contextual information added
- These are sequence-to-sequence methods
- They treat data as sets and so are permutation invariant (don’t take order into account, although this can be fixed by adding a position term to the vector encoding of the tokens)
Simple 2-layer neural network
- A network is made smarter by adding layers
- Connecting every node to other nodes in adjacent layers enables you to use matrix multiplication
- Matrices in blue lines show hidden layers
Initially
How do we embed tokens?
- Tokens need to be converted into numbers to use them in the neural network
- Embed them in a high dimensional space and set each token to a vector of numbers
- The numbers are initially random and the embeddings are stored in a lookup table, the same vector is always equal to the same token
- The vector has a position in high dimensional space
How do we measure the similarity between two vectors?
- Represent the tokens as vectors and use Euclidean distance in any number of dimensions
- Cosine similarity is better because:
1. It measures the angle between them
2. It is independent of the vector length - The smaller the angle the more similar they are to each other
How are word embeddings done?
- Place the words randomly initially in high dimensional space
- Process the words such that similar or related words are close together
Transformer encoder
Input vectors –> transformed vectors –> final outputs
Scaled dot-product attention
- Calculates the dot product of every pair of vectors in the input
- Q (queries) and K (keys) are tensors made of the same set of vectors and are populated by the vectors that are assigned to the words in the sentence
- The system tries to calculate how similar the words are
- K is transposed and K and Q are multiplied and then normalised by dividing the dot product by the square root of the number of dimensions
- Run this through a softmax function so that all the values of a row add up to 1
- These values become the attention weights and we use these weights to generate a weighted average of the input tensor V
- The final output is a combination of the original vectors added together and multiplied by the softmax weights
Scaled dot-product self-attention
- Dot products are calculated between all pairs of input vectors
- Values are scaled by the square root of the input vector dimensions
- Similarities are normalised so that each row of the similarity matrix adds up to 1 using the softmax function. This is called the attention matrix.
- Each row of the attention matrix is used as the weights in a weighted average of the value input vectors.
- These weighted averages become the new output vectors.
Multi-head attention
- Scaled dot-product attention cannot be trained and is not adjustable in any way
- Q, K, and V can be run through linear perceptrons which have weights and can be trained
- SDPA takes the place of the non-linearity function in the original neural nets
- You can have multiple SDPA blocks that all take in the same input and focus on different aspects
Bidirectional Encoder Representations from Transformers (BERT)
- Hide parts of the input sentence and make the transformer return a probability distribution of what the likely missing part of the sentence is
- Updates the weights which makes it understand the context better
- Can make the transformer understand context and language
BERT loss
- The transformer is used to encode randomly masked versions of the same text
- Replace the original word/letter/amino acid with a placeholder which means don’t know
- The transformer is trained to predict the tokens that have been masked out correctly
- Scored with a cross-entropy loss function
Cross-entropy loss function
Assesses the probability of the correct word in the probability distribution that the network outputs. If the correct word has a high probability in the distribution you get a low loss value.
Different ways of using language models
- Single-task supervised training
- Unsupervised pre-training + supervised fine-tuning
- Unsupervised pre-training + supervised training of small downstream classifier
- Future: unsupervised pre-training at scale + prompting (few-shot learning)
Single-task supervised training
Train a transformer LM on labelled sequences to predict correct labels
Unsupervised pre-training + supervised fine-tuning
Train a transformer LM using BERT on unlabelled sequences. Then add a new output layer and continue training it on labelled sequences to predict correct labels.
Unsupervised pre-training + supervised training of small downstream classifiers
Train a transformer LM using BERT on unlabelled sequences then freeze the weights. Use the frozen LM outputs on labelled sequences to generate inputs to train a new model to predict labels.
Future: unsupervised pre-training at scale + prompting (few-shot learning)
Train a very large transformer LM autoregressively on very diverse data. Then try to find suitable prompts to induce it to predict correct labels.
Using a pre-trained protein language model
- Input sequence
- A fixed pre-trained language model generates embeddings of each residue
- Average the embeddings to produce a summary vector
- The summary vector is used to train a specialised neural net
Correlated mutations in proteins
- Residues in close proximity have a tendency to covary, probably in order to maintain a stable microenvironment
- Changes at one site can be compensated for by a mutation in another site
- These spatial constraints leave an evolutionary record within the sequence
- By observing patterns of covarying residues in deep MSAs of homologous sequences, we can infer this structural information
Predicting the 3D structure of proteins by amino acid co-evolution
- We can produce accurate lists of contacting residues from covariation observed in large MSAs
- If we have an efficient way to project this information into 3D space whilst satisfying the physicochemical constraints of protein chains then we have everything we need to predict 3D structure
What does AlphaFold2 do?
- Encodes a MSA using transformer blocks to produce an embedding
- Decodes the MSA embedding to generate 3D coordinates
BERT for training AF
- Mask out random amino acids
- The system has to guess what they are
- Learns the context of amino acids for this given protein and homologues
- Network learns about co-variation
AlphaFold EvoFormer Block
MSA representation tower:
- The neural network prioritises looking for row-wise relationships between residue pairs and the input sequence before considering column-wise information that evaluates each residue’s importance in the context of other sequences.
Pair representation tower:
- Evaluates the relationship between every 2 residues (nodes) to refine the proximities or edges between the two.
- It achieves this by triangulating the relationship of each node in a pair relative to a third node.
- The goal is to help the network satisfy the triangle inequality theorem.
How does AlphaFold2 produce a 3D structure?
- The simple neural network reduces the number of input dimensions from 10 to 3
- The 3 outputs are weighted sums of the inputs
- The operation is called a projection and is used many times in AlphaFold 2 (dimensions are usually 256 but can be 384 or 128)
- Each layer in the neural network can either reduce or increase dimensions
- Converts the MSA into a 3D structure
The structure module
- A neural network that takes the refined models and performs rotations and translations on each amino acid revealing an initial guess of the 3D protein structure.
- Also applies physical and chemical constraints determined by atomic bonds, angles, and torsional angles.
- The refined models as well as the output of the structure module is iterated back through the evo former and structure module process 3 more times for a total of 4 cycles before it arrives at the final result: predicted 3D atomic coordinates for the protein’s 3D structure.
Training AlphaFold2
Things you need:
- Known 3D structures
- MSAs
- Train AlphaFold2 to translate from a given MSA to the correct native 3D coordinates of the protein chain
Limitations of AlphaFold2
- Model quality depends on having good MSAs
- Reliance on evolutionary information means that AF2 cannot predict mutation effects or antibody structures
- Only produces a single maximum likelihood conformation and doesn’t predict conformational change
- Works in terms of MSAs and not single sequences
AlphaFold-Multimer to model multimers
- Co-evolution can be observed between protein chains that are in contact with multimers
- AF can model multimeric structures with properly paired MSAs
- Success rate is about 50% (limited by MSA quality and interface region size)
- Doesn’t work for antibody-antigen docking