Week 1-4: Alignments and Machine Learning Flashcards
Define: genome
All of the genetic information in an organism, can be either DNA or RNA. Usually refers to the genetic sequence.
Define: transcriptome
All of the transcribed RNA in an organism. Not all DNA is transcribed
Define: proteome
The entire complement of proteins that is or can be expressed by a cell, tissue, or organism
Describe: homology (genes/proteins)
Proteins or genes are defined as homologous if they can be said to have shared an ancestor. Genes or proteins are either homologs or they are not; there is no such thing as percent homology, however % identity of sequences can be identified.
Define: Homology
Descent from a common ancestor
Define: Orthology
Descent from a speciation event
Define: Parology
Descent from a duplication event
Define: Xenology
Descent from a horizontal transfer event
When can homology be determined as ‘real’?
If in a pairwise alignment >25% identical amino acids as proteins will have a similar folding pattern. 18-25% identity - possible homology. <18% - cannot be determined from alignment
When are DNA alignments appropriate for use (over protein alignments)?
- To confirm identity of cDNA
- To study non-coding DNA
- To study DNA polymorphisms
Define: Global alignment
Aligns entirety of two sequences, finding global (overall) similarity.
Define: Local alignment
Looks for (local) regions of similarity
PAM Matrices
Multiple alignments of related sequences (>85%), looking at substitutions. 1 PAM (point accepted mutation) = one change per hundred aa's.
Machine learning: Supervised Learning
- Training data are labelled e.g. malignant or benign
- Goal is to correctly predict the output value given an unlabelled data point
Machine learning: Unsupervised Learning
- Input data are not labelled (output data not attached) - underlying class not important
- Algorithm must find patterns in data
Machine learning: Reinforcement Learning
Algorithm learns through occasional ‘rewards’ or ‘punishments’, adjusting it’s behaviour based on this feedback
Machine learning: Supervised Learning categories
- Classification: categorical labels e.g. skin cancer types (malignant, non-malignant, melanoma etc.)
- Regression: continuous labels e.g. predicting drug half life
- Ordinal regression: categorical labels that has some natural ordering e.g. predicting cancer stages (0-IV)
Machine learning: Supervised learning: regression
- Predicting continuous output values: gene expression levels, drug half-life or patient life-expectancy (continuous range of output values).
- A function or algorithm that maps a set of known input features to a real valued output value
- Training of a regression algorithm often involves minimising a cost function
Machine learning: Linear regression
- Assumes a linear relationship between continuous input variables and output value
- Can be problematic with highly dimensional data
- Aim is to find the ‘line of best fit’ through the data
Machine learning: Cost function
a way of quantifying the difference between prediction and reality
Machine learning: Support Vector Machine
A classification (supported learning) algorithm.
Machine learning: Supervised learning - classification
- Function or algorithm that maps set of known input features to a categorical output value, i.e. predicting categorical output values using known input data values, e.g. cancer status, protein secondary structure, or immune recognition (epitope/non-epitope)
Machine learning: Classification algorithms
Support vector machine and neural networks
Machine learning: Support vector machine
- Classification (supervised learning) algorithm
- SVM maximises separation between classes
- Only the values of support vectors affect decision boundary
- If non-linear separation is needed (for data not linearly separable), can use Kernel function/trick