Bioinformatics for Stratified Medicine Flashcards

1
Q

What are examples of large datasets?

Much store data that…

A
  1. MIMIC III- De-identified health data from ~40K critical care patients

Demographics, vital signs, laboratory tests, medications, notes.

  1. The Cancer Genome Atlas TCGA
  2. National Initiatives: 100K Genomes project / Genomics England. President Obama’s initiative to create a 1 million person research cohort (Precision medicine initiative). Includes Baseline health exam, Clinical data derived from electronic health records (EHRs), Healthcare claims, Laboratory data
  3. Biological databases: Hundreds of thousands of species to explore. Millions of written articles in scientific journals
Detailed genetic information: 
gene names
phenotype of mutants
location of genes/mutations on chromosomes
linkage (distances between genes)

High Throughput lab technologies:
PCR
Rapid inexpensive DNA sequencing (Illumina HiSeq)
Microarrays (Affymetrix)
Genome-wide SNP chips / SNP arrays (Illumina)

Must store data such that:

  • Minimum data quality is checked
  • Well annotated according to standards
  • Made available to wide public to foster research
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a database?

A

A collection of data that are

  • Structured
  • Searchable (index)
  • updated periodically (release)
  • cross-referenced (hyperlinks)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Databases are often categorised as primary or secondary. How do these differ?

A
  • Primary databasesare populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.
  • Secondary databasescomprise data derived from the results of analysing primary data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are essential aspects of primary and secondary databases?

A

Primary database-

Synonyms: Archival database

Source of data: Direct submission of experimentally derived data from researchers

Secondary database-

Synonyms: Curated database, knowledgeable

Source of data: Results of analysis, literature research and interpretation, often of data in primary databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the challenges of databases?

A

Heterogeneous data sources (need for data fusion);

Complexity of the data (high-dimensionality);

Noisy, uncertain data, dirty data, the discrepancy between data-information-knowledge (various definitions)

Big data sets (when is data big? when manual handling of the data is impossible)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is machine learning

A

Development of algorithms which can learn from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between supervised and unsupervised machine learning?

A

Supervised/Prediction- Guided

Unsupervised/Discovery- Unguided

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What issue can arise with supervised learning?

A

Incorrect findings may be concluded if the right dataset and specific question are not decided/used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a type 1 error?

A

false positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a type 2 error?

A

false negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does sensitivity (also called the true positive rate, the ecall, or probability of detection[1] in some fields) measure?

A

Measures the proportion of positives that are correctly identified as such (i.e. the percentage of sick people who are correctly identified as having the condition).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does Specificity/ true negative rate measure?

A

The proportion of negatives that are correctly identified as such (i.e., the percentage of healthy people who are correctly identified as not having the condition).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the positive and negative predictive values (PPV and NPV respectively) ?

A

Proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Receiver Operating characteristic curve (ROC)?

A

Trade-off between sensitivity (or TPR) and specificity (1 – FPR).

X axis: Plot one minus the specificity( the probability of being a false positive)

Y axis: Plot the sensitivity over the probability at true positive.

Thus trying to produce a curve where every single point along this curve corresponds to exactly one cut off.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What machine learning method requires No a priori hypothesis on the real number of clusters (groups) present and requires no additional information besides the data itself?

A

Discovery/ Unsupervised Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an example of a discovery problem?

A

Patient Similarity Problem- how can the physician categorise patient to allocate the best treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How is similarity among genes expressed

A

As a mathematical distance

Euclidean distance-length of a line segment between the two points. Linear associations

Manhattan distance- distancebetween two points measured along axes at right angles.

Correlation distance- Measures both linear and non linear associations

Genes close in the “expression space” have similar expression profiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are two unsupervised learning technique/ technique for reducing the dimensionality of such datasets?

A

Principal Component Analysis (PCA)

Hierarchical clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Principal component analysis?

A

An exploratory technique to simply a dataset

It is a linear transformation that chooses a new coordinate system for the data set such that

  • greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component)
  • the second greatest variance on the second axis and so on
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Principal component analysis can be used to…

A

Reduce number of dimensions in data

Find patterns in high-dimensional data

Visualise data of high dimensionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are example applications of Principal component analysis?

A
  • Face recognition
  • Image compression
  • Gene expression analysis
22
Q

What is the most common algorithm for unsupervised learning?

A

Hierarchical clustering

23
Q

How is hierarchical clustering achieved?

A

At the beginning, each object (gene) is a cluster. In each of the subsequent steps, the two closest clusters are merged into one cluster until there is only one cluster left.

  1. Assigns each item to its own cluster.
  2. Finds the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
  3. Computes distances (similarities) between the new cluster and each of the old clusters.
  4. Repeats steps 2 and 3 until all items are clustered into a single cluster of size N.
24
Q

What are biological pathways?

What are three examples?

A

Groups of genes interacting to produce a certain product or trigger a change in a cell

  1. Signalling pathways
  2. Gene-regulatory pathways
  3. Metabolic pathways
25
Q

What are two ways in which biological pathways can be represented?

A
  1. Pathways as collections of genes (gene sets)

2. The same gene can belong to multiple pathways

26
Q

What is pathway analysis?

A

One way of inferring biological meaning from lists of genes by looking at overlaps with known gene sets

27
Q

What are the three reasons as to why pathway analysis is useful?

A
  1. Perturbations at the single gene level might not explain the whole picture
    - A single gene mutation might not be enough to perturb an entire pathway because of redundancy
    - Mutations of opposite effect might compensate each other so that the outcome of pathway is not disrupted even if some of its genes are
  2. Shift to a pathway-centered view of biological systems
  3. Can be used to generate hypotheses about the phenomenon studied that can be taken forward for further evaluation
28
Q

What is the fundamental question in concern to pathway analysis?

How can this be answered?

A

Given the overlap between a list of genes and a gene set of interest, what is the probability of obtaining the same or a greater overlap between the two by chance?

Performing enrichment analysis

29
Q

What are prerequisites for pathway analysis?

A
  1. Experiment that generates a gene list
  2. Meaningful categories (gene functions, pathways, etc.) - creates gene set
  3. Association between genes and categories (annotation) - creates gene set
  4. Methods to estimate which gene sets are significantly perturbed by an experiment (enrichment analysis)
30
Q

What are meta databases?

A

Collection of many different resources (GO, pathways, published signatures, etc.) provided for purpose of testing a list of genes against gene sets coming from different sources at once

31
Q

Why can it be useful to interrogate more than one resource to discover enriched pathways?

A

The same pathway can appear quite different according to the pathway resource used.

32
Q

What are two enrichment methods?

A
  1. Overrepresentation analysis
    - Estimates the significance of the overlap between a list and a gene set
  • Choice of genes in the list based on an arbitrary threshold
    Fisher’s test
  1. Ranked enrichment analysis
    - Genes are sorted according to a meaningful metric

-No arbitrary threshold needed

33
Q

What is enrichment analysis?

A

A collection of statistical methods for estimating how enriched is a gene set in a list of genes of interest (are genes in the gene list coming from a gene set more frequently then what would be expected by chance?).

34
Q

What steps are involved in overrepresentation analysis?

A
  • List of genes generated by an experiment
  • In most cases genes are chosen according to a certain threshold
    P-value
    Fold Change
35
Q

In overrepresentation analysis why is the size of the overlap between genes in our list and genes belonging to a gene set is not enough to estimate significance?

A

Because a large overlap could be due to a larger number of genes in the gene list of interest. If for example our gene list contained all known genes, the overlap between that gene list and any pathway would be very high but this would not be particularly meaningful.

36
Q

In Overrepresentation analysis what statistical tests are used to calculate the significance of overlap computed?

A
  • Is the overlap between my list of genes and the gene set I am testing bigger than what I would get by randomly selecting the same number of genes?
  • Usually computed using: Fisher’s Exact Test
    Size of the list
    Size of the gene set
    Size of the overlap
    Size of the universe (number of genes tested)
37
Q

In overrepresentation analysis what strongly affects results?

A

The universe

Universe=
Genome
Only genes present on the array
Only expressed genes
Only genes for which pathway annotation is available

If for example we are running a small scale transcriptomic experiment where only 100 genes are tested, no more than 100 genes could have come up as significant

38
Q

What limitation of Overrepresentation analysis does rank-based enrichment overcome?

A

Overrepresentation analysis requires a list of genes to compute significance. The threshold used to select these genes has an impact on the results of the enrichment

39
Q

What steps comprise rank-based enrichment?

A

Takes as an input a ranked (sorted) list of genes
P-value
Fold change
Other metrics

Are the genes from a gene set overrepresented at the top of the list compared to randomly picked lists of genes of similar sizes?

The enrichment is significant if the genes of the gene set are mainly located at the top of the ranked list

The enrichment is not significant if the genes of the gene set are randomly scattered across the list

40
Q

What is an example of rank-based enrichment method ?

How is this computed?

A

Gene Set Enrichment Analysis (GSEA)

Enrichment Score calculated by screening the list top-down:

  • Increase statistic whenever gene in the ranked list belongs to gene set S
  • Decrease statistic whenever gene in the rank list belongs to gene set S
  • Score equals to the maximum deviation from zero
41
Q

What is network analysis?

A

Another way of inferring biological meaning from gene lists looking at connections between genes

  • Establish links between genes
  • Analyse the structure of the network
42
Q

What is a network in a biological context?

A

Can be represented by a graph composed by nodes and edges.

Nodes can represent biological entities
Genes
Proteins
Variants
Metabolites
Edges represent relationships between nodes
Activation/Repression
Physical interaction
Binding 
Cleavage
Co-expression

Edges can be directed as in the case of activation (one molecule activates another but not vice versa) or undirected - co-expression (if gene A is correlated with gene B then gene B is correlated with gene A).

43
Q
  1. What is a tool for network analysis?

2. What is this?

A

Ingenuity

  1. Commercial application for analysing complex omics datasets in a network framework

Multiple modules available: - Pathway/Function/Disease analysis (Fisher’s test)

  • Network generation
  • Upstream regulators

Knowledge based on curated published results

44
Q

How does ingenuity transform list of genes into sets of relevant networks using pre-existing published knowledge?

A

A link is created between two entities if there exists a published experiment that describes this relationship- The nature of the relationship is included in the network

45
Q

In its inference process what is Ingenuity considerate of?

A

Certain recurrent patterns of biological networks like the presence of hubs.

46
Q
  1. What machine learning technique was used to identify diagnostic biomarkers?
  2. What method was used?
  3. What statistical test is typically used?
A
  1. Supervised
  2. Supervised multiclass pathway activity inference method:
    - For each pathway expression dataset, patterns of its constituent genes are summarised into pathway activity
    - Infer a feature as a weighted linear summation of expression of its constituent genes
    - Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes
    - Classification is then performed on the resulting low-dimensional pathway activity profile
  3. Enrichment analyses, however typically rank based is used.
47
Q

How is Hierarchical clustering and classification used to identify gene signatures specific to disease as diagnostic biomarkers?

A
  1. Comprehensive analysis of gene expression in paired lesional and non-lesional psoriatic tissue samples, compared with controls
  2. establish differences in RNA expression patterns across all tissue types
  3. Ensembles of decision tree predictors were employed to cluster psoriatic samples on the basis of gene expression patterns and reveal gene expression signatures that best discriminate molecular disease subtypes
48
Q

Describe a network module-based method for identifying cancer prognostic signatures.

A
  1. A human protein functional interaction (FI) network constructed by combining curated and uncurated data sources using a machine learning technique
  2. Modules derived from a highly reliable gene functional interaction network
  3. Infer a feature as a weighted linear summation of expression of its constituent genes
  4. Assigning gene co-expression values as weights for the FI network, network modules were discovered containing genes having similar expression patterns in a disease, and used as features to model disease heterogeneity
  5. Survival curves among the high and low module expression groups were derived, and acts as a proof of principle for using module 2 expression as a cross-platform prognostic signature.
49
Q

How can Pathway and Gene Set Enrichment Analysis GSEA be applied to identify predictive biomarkers/signature that are associated with time to progression under therapy?

A

Whole-genome gene expression profiling was performed on 42 biopsy samples (from SAKK 19/05 trial) using Affymetrix exon arrays, and associations with the following endpoints: time-to-progression (TTP) under therapy, tumor-shrinkage (TS), and overall survival (OS) were investigated.

Gene set enrichment analyses (GSEA) was performed

GSEA revealed a significant enrichment of the angiogenesis-associated genes within the genes that associate with TTP under BE therapy endpoint

50
Q

Gene pathways here are based upon gene expression

True or false

A

True