- tanh x - sigmoid: originally the most popular non-linear activation function used in neural nets - rectifier: now the most commonly used as it works well for deep networks, technically still a non-linear function

- Scaled dot-product attention cannot be trained and is not adjustable in any way - Q, K, and V can be run through linear perceptrons which have weights and can be trained - SDPA takes the place of the non-linearity function in the original neural nets - You can have multiple SDPA blocks that all take in the same input and focus on different aspects

- The transformer is used to encode randomly masked versions of the same text - Replace the original word/letter/amino acid with a placeholder which means don't know - The transformer is trained to predict the tokens that have been masked out correctly - Scored with a cross-entropy loss function

Machine Learning and Protein Structure Flashcards by Asha Shinde

What are artificial neural networks?

Artificial neural networks comprise many switching units (artificial neurons) that are connected according to a specific network architecture. The objective of an artificial neural network is to learn how to transform inputs into meaningful outputs.

How well did you know this?

Not at all

Perfectly

Activation functions

tanh x
sigmoid: originally the most popular non-linear activation function used in neural nets
rectifier: now the most commonly used as it works well for deep networks, technically still a non-linear function

How well did you know this?

Not at all

Perfectly

What is a transformer?

Take a set of vectors representing something and transform them into a new set of vectors which have extra contextual information added
These are sequence-to-sequence methods
They treat data as sets and so are permutation invariant (don’t take order into account, although this can be fixed by adding a position term to the vector encoding of the tokens)

How well did you know this?

Not at all

Perfectly

Simple 2-layer neural network

A network is made smarter by adding layers
Connecting every node to other nodes in adjacent layers enables you to use matrix multiplication
Matrices in blue lines show hidden layers

How well did you know this?

Not at all

Perfectly

Initially

How do we embed tokens?

Tokens need to be converted into numbers to use them in the neural network
Embed them in a high dimensional space and set each token to a vector of numbers
The numbers are initially random and the embeddings are stored in a lookup table, the same vector is always equal to the same token
The vector has a position in high dimensional space

How well did you know this?

Not at all

Perfectly

How do we measure the similarity between two vectors?

Represent the tokens as vectors and use Euclidean distance in any number of dimensions
Cosine similarity is better because:
1. It measures the angle between them
2. It is independent of the vector length
The smaller the angle the more similar they are to each other

How well did you know this?

Not at all

Perfectly

How are word embeddings done?

Place the words randomly initially in high dimensional space
Process the words such that similar or related words are close together

How well did you know this?

Not at all

Perfectly

Transformer encoder

Input vectors –> transformed vectors –> final outputs

How well did you know this?

Not at all

Perfectly

Scaled dot-product attention

Calculates the dot product of every pair of vectors in the input
Q (queries) and K (keys) are tensors made of the same set of vectors and are populated by the vectors that are assigned to the words in the sentence
The system tries to calculate how similar the words are
K is transposed and K and Q are multiplied and then normalised by dividing the dot product by the square root of the number of dimensions
Run this through a softmax function so that all the values of a row add up to 1
These values become the attention weights and we use these weights to generate a weighted average of the input tensor V
The final output is a combination of the original vectors added together and multiplied by the softmax weights

How well did you know this?

Not at all

Perfectly

Scaled dot-product self-attention

Dot products are calculated between all pairs of input vectors
Values are scaled by the square root of the input vector dimensions
Similarities are normalised so that each row of the similarity matrix adds up to 1 using the softmax function. This is called the attention matrix.
Each row of the attention matrix is used as the weights in a weighted average of the value input vectors.
These weighted averages become the new output vectors.

How well did you know this?

Not at all

Perfectly

Multi-head attention

Scaled dot-product attention cannot be trained and is not adjustable in any way
Q, K, and V can be run through linear perceptrons which have weights and can be trained
SDPA takes the place of the non-linearity function in the original neural nets
You can have multiple SDPA blocks that all take in the same input and focus on different aspects

How well did you know this?

Not at all

Perfectly

Bidirectional Encoder Representations from Transformers (BERT)

Hide parts of the input sentence and make the transformer return a probability distribution of what the likely missing part of the sentence is
Updates the weights which makes it understand the context better
Can make the transformer understand context and language

How well did you know this?

Not at all

Perfectly

BERT loss

The transformer is used to encode randomly masked versions of the same text
Replace the original word/letter/amino acid with a placeholder which means don’t know
The transformer is trained to predict the tokens that have been masked out correctly
Scored with a cross-entropy loss function

How well did you know this?

Not at all

Perfectly

Cross-entropy loss function

Assesses the probability of the correct word in the probability distribution that the network outputs. If the correct word has a high probability in the distribution you get a low loss value.

How well did you know this?

Not at all

Perfectly

Different ways of using language models

Single-task supervised training
Unsupervised pre-training + supervised fine-tuning
Unsupervised pre-training + supervised training of small downstream classifier
Future: unsupervised pre-training at scale + prompting (few-shot learning)

How well did you know this?

Not at all

Perfectly

Single-task supervised training

Study These Flashcards

Train a transformer LM on labelled sequences to predict correct labels

Unsupervised pre-training + supervised fine-tuning

Study These Flashcards

Train a transformer LM using BERT on unlabelled sequences. Then add a new output layer and continue training it on labelled sequences to predict correct labels.

Unsupervised pre-training + supervised training of small downstream classifiers

Study These Flashcards

Train a transformer LM using BERT on unlabelled sequences then freeze the weights. Use the frozen LM outputs on labelled sequences to generate inputs to train a new model to predict labels.

Future: unsupervised pre-training at scale + prompting (few-shot learning)

Study These Flashcards

Train a very large transformer LM autoregressively on very diverse data. Then try to find suitable prompts to induce it to predict correct labels.

Using a pre-trained protein language model

Study These Flashcards

Input sequence
A fixed pre-trained language model generates embeddings of each residue
Average the embeddings to produce a summary vector
The summary vector is used to train a specialised neural net

Correlated mutations in proteins

Study These Flashcards

Residues in close proximity have a tendency to covary, probably in order to maintain a stable microenvironment
Changes at one site can be compensated for by a mutation in another site
These spatial constraints leave an evolutionary record within the sequence
By observing patterns of covarying residues in deep MSAs of homologous sequences, we can infer this structural information

Predicting the 3D structure of proteins by amino acid co-evolution

Study These Flashcards

We can produce accurate lists of contacting residues from covariation observed in large MSAs
If we have an efficient way to project this information into 3D space whilst satisfying the physicochemical constraints of protein chains then we have everything we need to predict 3D structure

What does AlphaFold2 do?

Study These Flashcards

Encodes a MSA using transformer blocks to produce an embedding
Decodes the MSA embedding to generate 3D coordinates

BERT for training AF

Study These Flashcards

Mask out random amino acids
The system has to guess what they are
Learns the context of amino acids for this given protein and homologues
Network learns about co-variation

AlphaFold EvoFormer Block

MSA representation tower: - The neural network prioritises looking for row-wise relationships between residue pairs and the input sequence before considering column-wise information that evaluates each residue's importance in the context of other sequences. Pair representation tower: - Evaluates the relationship between every 2 residues (nodes) to refine the proximities or edges between the two. - It achieves this by triangulating the relationship of each node in a pair relative to a third node. - The goal is to help the network satisfy the triangle inequality theorem.

How does AlphaFold2 produce a 3D structure?

- The simple neural network reduces the number of input dimensions from 10 to 3 - The 3 outputs are weighted sums of the inputs - The operation is called a projection and is used many times in AlphaFold 2 (dimensions are usually 256 but can be 384 or 128) - Each layer in the neural network can either reduce or increase dimensions - Converts the MSA into a 3D structure

The structure module

* A neural network that takes the refined models and performs rotations and translations on each amino acid revealing an initial guess of the 3D protein structure. * Also applies physical and chemical constraints determined by atomic bonds, angles, and torsional angles. * The refined models as well as the output of the structure module is iterated back through the evo former and structure module process 3 more times for a total of 4 cycles before it arrives at the final result: predicted 3D atomic coordinates for the protein’s 3D structure.

Training AlphaFold2

Things you need: - Known 3D structures - MSAs - Train AlphaFold2 to translate from a given MSA to the correct native 3D coordinates of the protein chain

Limitations of AlphaFold2

- Model quality depends on having good MSAs - Reliance on evolutionary information means that AF2 cannot predict mutation effects or antibody structures - Only produces a single maximum likelihood conformation and doesn't predict conformational change - Works in terms of MSAs and not single sequences

AlphaFold-Multimer to model multimers

- Co-evolution can be observed between protein chains that are in contact with multimers - AF can model multimeric structures with properly paired MSAs - Success rate is about 50% (limited by MSA quality and interface region size) - Doesn't work for antibody-antigen docking

Machine Learning and Protein Structure Flashcards

Week 10 Lecture 2 (30 cards)