Week 1-4: Alignments and Machine Learning Flashcards

1
Q

Define: genome

A

All of the genetic information in an organism, can be either DNA or RNA. Usually refers to the genetic sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define: transcriptome

A

All of the transcribed RNA in an organism. Not all DNA is transcribed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define: proteome

A

The entire complement of proteins that is or can be expressed by a cell, tissue, or organism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe: homology (genes/proteins)

A

Proteins or genes are defined as homologous if they can be said to have shared an ancestor. Genes or proteins are either homologs or they are not; there is no such thing as percent homology, however % identity of sequences can be identified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define: Homology

A

Descent from a common ancestor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define: Orthology

A

Descent from a speciation event

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define: Parology

A

Descent from a duplication event

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define: Xenology

A

Descent from a horizontal transfer event

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When can homology be determined as ‘real’?

A

If in a pairwise alignment >25% identical amino acids as proteins will have a similar folding pattern. 18-25% identity - possible homology. <18% - cannot be determined from alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When are DNA alignments appropriate for use (over protein alignments)?

A
  1. To confirm identity of cDNA
  2. To study non-coding DNA
  3. To study DNA polymorphisms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define: Global alignment

A

Aligns entirety of two sequences, finding global (overall) similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define: Local alignment

A

Looks for (local) regions of similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

PAM Matrices

A
Multiple alignments of related sequences (>85%), looking at substitutions. 
1 PAM (point accepted mutation) = one change per hundred aa's.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Machine learning: Supervised Learning

A
  • Training data are labelled e.g. malignant or benign

- Goal is to correctly predict the output value given an unlabelled data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Machine learning: Unsupervised Learning

A
  • Input data are not labelled (output data not attached) - underlying class not important
  • Algorithm must find patterns in data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Machine learning: Reinforcement Learning

A

Algorithm learns through occasional ‘rewards’ or ‘punishments’, adjusting it’s behaviour based on this feedback

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Machine learning: Supervised Learning categories

A
  • Classification: categorical labels e.g. skin cancer types (malignant, non-malignant, melanoma etc.)
  • Regression: continuous labels e.g. predicting drug half life
  • Ordinal regression: categorical labels that has some natural ordering e.g. predicting cancer stages (0-IV)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Machine learning: Supervised learning: regression

A
  • Predicting continuous output values: gene expression levels, drug half-life or patient life-expectancy (continuous range of output values).
  • A function or algorithm that maps a set of known input features to a real valued output value
  • Training of a regression algorithm often involves minimising a cost function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Machine learning: Linear regression

A
  • Assumes a linear relationship between continuous input variables and output value
  • Can be problematic with highly dimensional data
  • Aim is to find the ‘line of best fit’ through the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Machine learning: Cost function

A

a way of quantifying the difference between prediction and reality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Machine learning: Support Vector Machine

A

A classification (supported learning) algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Machine learning: Supervised learning - classification

A
  • Function or algorithm that maps set of known input features to a categorical output value, i.e. predicting categorical output values using known input data values, e.g. cancer status, protein secondary structure, or immune recognition (epitope/non-epitope)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Machine learning: Classification algorithms

A

Support vector machine and neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Machine learning: Support vector machine

A
  • Classification (supervised learning) algorithm
  • SVM maximises separation between classes
  • Only the values of support vectors affect decision boundary
  • If non-linear separation is needed (for data not linearly separable), can use Kernel function/trick
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Machine learning: Kernel trick

A
  • Kernels allow us to use this mapping into higher dimensions without having to explicitly compute the transformation for each data point
  • Input vectors only appear once, as a dot product
26
Q

Machine learning: Dot product

A
  • ‘Similarity measure’ between two input data points
  • If input data is far apart, dot product will be close to zero.
  • Can be replaced with any equation that measures similarity between two data points
27
Q

A Kernel function only needs to compute ____ between two input data points

A

similarity

28
Q

Machine learning: Neural Networks

A

Each node outputs a value based on a combination of input values
Each node ‘learns’ which inputs are important
Can have multiple hidden layers
Often need large amounts of training data for algorithms to perform well

29
Q

Machine learning: hierarchical Clustering

A

A clustering (unsupervised learning) algorithm
Important for discovering biologically relevant clusters/groups
Input data isn’t labeled

30
Q

Anatomy of a Machine Learning Problem

A
  1. Data collection:
    More data is nearly always better, however most biological datasets are ‘small’ in the context of modern machine learning but despite relatively small dataset size (number of samples), can often have a large number of features (highly dimensional) e.g. gene expression data. Hence arises need for:
  2. Feature extraction/selection:
    Conversion of biological observations into computer-friendly representations (features) in a process known as feature extraction. This can be driven by biological insight (if important contributing factors are known), followed by feature selection (reducing observations to informative features).
  3. Model selection and training:
    Data is split into training, validation and test datasets. This prevents overfitting (good performance during training but poor generalisation of outcome). The model is trained on a training dataset. Performance is then assessed on a validation dataset and hyperparameters used to improve performance. The best model is then evaluated on a test dataset.
31
Q

If performance of a machine learning model is poor, what can be the contributing factors?

A

Poor algorithm choice.
Insufficient data.
Noisy dataset.

32
Q

Machine learning evaluation metrics - accuracy

A

What proportion of predictions are correct?

TP+TN/Total number of predictions

33
Q

Machine learning evaluation metrics - precision

A

What proportion of our predicted positives are true?(TP/TP+FP)

34
Q

Machine learning evaluation metrics - recall

A

What proportion of true positives did we correctly identify?(TP/TP+FN)

35
Q

Machine learning problem: Overfitting

A

excellent performance on the training dataset, but poor generalisation to unseen data
A big problem when training data has many features, or few data points

36
Q

Machine learning problem: redundancy between test and training sets

A

Very similar data points in both test and training data, leading to overestimation of model performance

37
Q

Machine learning problem: Unbalanced classes within the dataset

A

Can be very difficult to train a model when one class dominates the training data

38
Q

Machine learning problem: Batch effect

A

Data collection is not properly randomised

Model won’t generalise to unseen data

39
Q

Define: Systematics

A

An attempt to understand the interrelationships of living things

40
Q

Taxonomy

A

The science of naming and classifying organisms (evolutionary theory not necessarily involved)

41
Q

Phylogenetics

A

The field of systematics that focuses on evolutionary relationships between organisms or genes/proteins (phylogeny).

42
Q

Define: Taxon

A

Any named group of organisms (evolutionary theory not necessarily involved)

43
Q

A group of taxa proposed to have a common ancestral origin is called

A

clade

44
Q

The base of a clade is called a ______

A

node

45
Q

Define: Topology

A

Order of branching in a phylogenetic tree

46
Q

Define: Cladistics

A

Classifying organisms based on revolutionary relatedness or shared characteristics

47
Q

True or false:

In a cladogram, branch lengths indicate quantity of evolutionary divergence.

A

False. In a cladogram, branch lengths are not significant. Only the topology (order of branching) matters.

48
Q

True or false:

In a phylogram, branch lengths indicate quantity of evolutionary divergence.

A

True.

49
Q

In a phylogenetic tree, clades show a common ancestor and _____

A

all of its descendants.

50
Q

True or false:

In a phylogenetic tree, terminal nodes beside each other are more closely related than tips further apart.

A

False.

51
Q

True or false:

Maximum Parsimony is based on sequence characters rather than distances

A

True

52
Q

Bootstrapping

A

A way of statistically validating a phylogenetic tree.
Data is resampled (generally 1000 times) after being slightly perturbed and the number (or percentage) of times a node appears is given. If a node is present 700 times from 1000, around 95% probability it is in the correct position

53
Q

Molecular clock

A

Technique that uses mutation rate of biomolecules to deduce the time taken for two or more species to diverge

54
Q

Dynamic programming in sequence alignment - steps

A
  1. Initialisation
  2. Scoring the matrix
  3. Traceback (to get the alignment)
55
Q

PAM250 refers to

A

250 substitutions per 100 amino acids

56
Q

BLOSUM62 is derived from

A

proteins that have no more than 62% identity

57
Q

BLASTP uses which default scoring matrix?

A

BLOSUM62

58
Q

E value

A

The ‘expect’ value - the probability that the search will show a match by chance

59
Q

Classification supervised learning can be used for

A

Skin cancer type (melanoma, non-malignant, basal-cell carcinoma etc.)

60
Q

Regression supervised learning can be used for

A

Predicting drug half-life

61
Q

Ordinal regression supervised learning can be used for

A

Predicting cancer stages (Stage I-IV)

62
Q

Linear regression can be problematic with _______ data.

A

highly dimensional