Week 1-4: Alignments and Machine Learning Flashcards
Define: genome
All of the genetic information in an organism, can be either DNA or RNA. Usually refers to the genetic sequence.
Define: transcriptome
All of the transcribed RNA in an organism. Not all DNA is transcribed
Define: proteome
The entire complement of proteins that is or can be expressed by a cell, tissue, or organism
Describe: homology (genes/proteins)
Proteins or genes are defined as homologous if they can be said to have shared an ancestor. Genes or proteins are either homologs or they are not; there is no such thing as percent homology, however % identity of sequences can be identified.
Define: Homology
Descent from a common ancestor
Define: Orthology
Descent from a speciation event
Define: Parology
Descent from a duplication event
Define: Xenology
Descent from a horizontal transfer event
When can homology be determined as ‘real’?
If in a pairwise alignment >25% identical amino acids as proteins will have a similar folding pattern. 18-25% identity - possible homology. <18% - cannot be determined from alignment
When are DNA alignments appropriate for use (over protein alignments)?
- To confirm identity of cDNA
- To study non-coding DNA
- To study DNA polymorphisms
Define: Global alignment
Aligns entirety of two sequences, finding global (overall) similarity.
Define: Local alignment
Looks for (local) regions of similarity
PAM Matrices
Multiple alignments of related sequences (>85%), looking at substitutions. 1 PAM (point accepted mutation) = one change per hundred aa's.
Machine learning: Supervised Learning
- Training data are labelled e.g. malignant or benign
- Goal is to correctly predict the output value given an unlabelled data point
Machine learning: Unsupervised Learning
- Input data are not labelled (output data not attached) - underlying class not important
- Algorithm must find patterns in data
Machine learning: Reinforcement Learning
Algorithm learns through occasional ‘rewards’ or ‘punishments’, adjusting it’s behaviour based on this feedback
Machine learning: Supervised Learning categories
- Classification: categorical labels e.g. skin cancer types (malignant, non-malignant, melanoma etc.)
- Regression: continuous labels e.g. predicting drug half life
- Ordinal regression: categorical labels that has some natural ordering e.g. predicting cancer stages (0-IV)
Machine learning: Supervised learning: regression
- Predicting continuous output values: gene expression levels, drug half-life or patient life-expectancy (continuous range of output values).
- A function or algorithm that maps a set of known input features to a real valued output value
- Training of a regression algorithm often involves minimising a cost function
Machine learning: Linear regression
- Assumes a linear relationship between continuous input variables and output value
- Can be problematic with highly dimensional data
- Aim is to find the ‘line of best fit’ through the data
Machine learning: Cost function
a way of quantifying the difference between prediction and reality
Machine learning: Support Vector Machine
A classification (supported learning) algorithm.
Machine learning: Supervised learning - classification
- Function or algorithm that maps set of known input features to a categorical output value, i.e. predicting categorical output values using known input data values, e.g. cancer status, protein secondary structure, or immune recognition (epitope/non-epitope)
Machine learning: Classification algorithms
Support vector machine and neural networks
Machine learning: Support vector machine
- Classification (supervised learning) algorithm
- SVM maximises separation between classes
- Only the values of support vectors affect decision boundary
- If non-linear separation is needed (for data not linearly separable), can use Kernel function/trick
Machine learning: Kernel trick
- Kernels allow us to use this mapping into higher dimensions without having to explicitly compute the transformation for each data point
- Input vectors only appear once, as a dot product
Machine learning: Dot product
- ‘Similarity measure’ between two input data points
- If input data is far apart, dot product will be close to zero.
- Can be replaced with any equation that measures similarity between two data points
A Kernel function only needs to compute ____ between two input data points
similarity
Machine learning: Neural Networks
Each node outputs a value based on a combination of input values
Each node ‘learns’ which inputs are important
Can have multiple hidden layers
Often need large amounts of training data for algorithms to perform well
Machine learning: hierarchical Clustering
A clustering (unsupervised learning) algorithm
Important for discovering biologically relevant clusters/groups
Input data isn’t labeled
Anatomy of a Machine Learning Problem
- Data collection:
More data is nearly always better, however most biological datasets are ‘small’ in the context of modern machine learning but despite relatively small dataset size (number of samples), can often have a large number of features (highly dimensional) e.g. gene expression data. Hence arises need for: - Feature extraction/selection:
Conversion of biological observations into computer-friendly representations (features) in a process known as feature extraction. This can be driven by biological insight (if important contributing factors are known), followed by feature selection (reducing observations to informative features). - Model selection and training:
Data is split into training, validation and test datasets. This prevents overfitting (good performance during training but poor generalisation of outcome). The model is trained on a training dataset. Performance is then assessed on a validation dataset and hyperparameters used to improve performance. The best model is then evaluated on a test dataset.
If performance of a machine learning model is poor, what can be the contributing factors?
Poor algorithm choice.
Insufficient data.
Noisy dataset.
Machine learning evaluation metrics - accuracy
What proportion of predictions are correct?
TP+TN/Total number of predictions
Machine learning evaluation metrics - precision
What proportion of our predicted positives are true?(TP/TP+FP)
Machine learning evaluation metrics - recall
What proportion of true positives did we correctly identify?(TP/TP+FN)
Machine learning problem: Overfitting
excellent performance on the training dataset, but poor generalisation to unseen data
A big problem when training data has many features, or few data points
Machine learning problem: redundancy between test and training sets
Very similar data points in both test and training data, leading to overestimation of model performance
Machine learning problem: Unbalanced classes within the dataset
Can be very difficult to train a model when one class dominates the training data
Machine learning problem: Batch effect
Data collection is not properly randomised
Model won’t generalise to unseen data
Define: Systematics
An attempt to understand the interrelationships of living things
Taxonomy
The science of naming and classifying organisms (evolutionary theory not necessarily involved)
Phylogenetics
The field of systematics that focuses on evolutionary relationships between organisms or genes/proteins (phylogeny).
Define: Taxon
Any named group of organisms (evolutionary theory not necessarily involved)
A group of taxa proposed to have a common ancestral origin is called
clade
The base of a clade is called a ______
node
Define: Topology
Order of branching in a phylogenetic tree
Define: Cladistics
Classifying organisms based on revolutionary relatedness or shared characteristics
True or false:
In a cladogram, branch lengths indicate quantity of evolutionary divergence.
False. In a cladogram, branch lengths are not significant. Only the topology (order of branching) matters.
True or false:
In a phylogram, branch lengths indicate quantity of evolutionary divergence.
True.
In a phylogenetic tree, clades show a common ancestor and _____
all of its descendants.
True or false:
In a phylogenetic tree, terminal nodes beside each other are more closely related than tips further apart.
False.
True or false:
Maximum Parsimony is based on sequence characters rather than distances
True
Bootstrapping
A way of statistically validating a phylogenetic tree.
Data is resampled (generally 1000 times) after being slightly perturbed and the number (or percentage) of times a node appears is given. If a node is present 700 times from 1000, around 95% probability it is in the correct position
Molecular clock
Technique that uses mutation rate of biomolecules to deduce the time taken for two or more species to diverge
Dynamic programming in sequence alignment - steps
- Initialisation
- Scoring the matrix
- Traceback (to get the alignment)
PAM250 refers to
250 substitutions per 100 amino acids
BLOSUM62 is derived from
proteins that have no more than 62% identity
BLASTP uses which default scoring matrix?
BLOSUM62
E value
The ‘expect’ value - the probability that the search will show a match by chance
Classification supervised learning can be used for
Skin cancer type (melanoma, non-malignant, basal-cell carcinoma etc.)
Regression supervised learning can be used for
Predicting drug half-life
Ordinal regression supervised learning can be used for
Predicting cancer stages (Stage I-IV)
Linear regression can be problematic with _______ data.
highly dimensional