Week 1-4: Alignments and Machine Learning Flashcards

Question

Machine learning: Kernel trick

Answer 1

- Kernels allow us to use this mapping into higher dimensions without having to explicitly compute the transformation for each data point - Input vectors only appear once, as a dot product

Answer 2

- ‘Similarity measure’ between two input data points - If input data is far apart, dot product will be close to zero. - Can be replaced with any equation that measures similarity between two data points

Answer 3

similarity

Answer 4

Each node outputs a value based on a combination of input values Each node ‘learns’ which inputs are important Can have multiple hidden layers Often need large amounts of training data for algorithms to perform well

Answer 5

A clustering (unsupervised learning) algorithm Important for discovering biologically relevant clusters/groups Input data isn't labeled

Answer 6

1. Data collection: More data is nearly always better, however most biological datasets are 'small' in the context of modern machine learning but despite relatively small dataset size (number of samples), can often have a large number of features (highly dimensional) e.g. gene expression data. Hence arises need for: 2. Feature extraction/selection: Conversion of biological observations into computer-friendly representations (features) in a process known as feature extraction. This can be driven by biological insight (if important contributing factors are known), followed by feature selection (reducing observations to informative features). 3. Model selection and training: Data is split into training, validation and test datasets. This prevents overfitting (good performance during training but poor generalisation of outcome). The model is trained on a training dataset. Performance is then assessed on a validation dataset and hyperparameters used to improve performance. The best model is then evaluated on a test dataset.

Answer 7

Poor algorithm choice. Insufficient data. Noisy dataset.

Answer 8

What proportion of predictions are correct? | TP+TN/Total number of predictions

Answer 9

What proportion of our predicted positives are true?(TP/TP+FP)

Answer 10

What proportion of true positives did we correctly identify?(TP/TP+FN)

Answer 11

excellent performance on the training dataset, but poor generalisation to unseen data A big problem when training data has many features, or few data points

Answer 12

Very similar data points in both test and training data, leading to overestimation of model performance

Answer 13

Can be very difficult to train a model when one class dominates the training data

Answer 14

Data collection is not properly randomised | Model won’t generalise to unseen data

Answer 15

An attempt to understand the interrelationships of living things

Answer 16

The science of naming and classifying organisms (evolutionary theory not necessarily involved)

Answer 17

The field of systematics that focuses on evolutionary relationships between organisms or genes/proteins (phylogeny).

Answer 18

Any named group of organisms (evolutionary theory not necessarily involved)

Answer 19

Order of branching in a phylogenetic tree

Answer 20

Classifying organisms based on revolutionary relatedness or shared characteristics

Answer 21

False. In a cladogram, branch lengths are not significant. Only the topology (order of branching) matters.

Answer 22

all of its descendants.

Answer 23

A way of statistically validating a phylogenetic tree. Data is resampled (generally 1000 times) after being slightly perturbed and the number (or percentage) of times a node appears is given. If a node is present 700 times from 1000, around 95% probability it is in the correct position

Answer 24

Technique that uses mutation rate of biomolecules to deduce the time taken for two or more species to diverge

Answer 25

1. Initialisation 2. Scoring the matrix 3. Traceback (to get the alignment)

Answer 26

250 substitutions per 100 amino acids

Answer 27

proteins that have no more than 62% identity

Answer 28

The 'expect' value - the probability that the search will show a match by chance

Answer 29

Skin cancer type (melanoma, non-malignant, basal-cell carcinoma etc.)

Answer 30

Predicting drug half-life

Answer 31

Predicting cancer stages (Stage I-IV)

Answer 32

highly dimensional

Week 1-4: Alignments and Machine Learning Flashcards

(62 cards)