Bioinformatics for Stratified Medicine Flashcards
What are examples of large datasets?
Much store data that…
- MIMIC III- De-identified health data from ~40K critical care patients
Demographics, vital signs, laboratory tests, medications, notes.
- The Cancer Genome Atlas TCGA
- National Initiatives: 100K Genomes project / Genomics England. President Obama’s initiative to create a 1 million person research cohort (Precision medicine initiative). Includes Baseline health exam, Clinical data derived from electronic health records (EHRs), Healthcare claims, Laboratory data
- Biological databases: Hundreds of thousands of species to explore. Millions of written articles in scientific journals
Detailed genetic information: gene names phenotype of mutants location of genes/mutations on chromosomes linkage (distances between genes)
High Throughput lab technologies:
PCR
Rapid inexpensive DNA sequencing (Illumina HiSeq)
Microarrays (Affymetrix)
Genome-wide SNP chips / SNP arrays (Illumina)
Must store data such that:
- Minimum data quality is checked
- Well annotated according to standards
- Made available to wide public to foster research
What is a database?
A collection of data that are
- Structured
- Searchable (index)
- updated periodically (release)
- cross-referenced (hyperlinks)
Databases are often categorised as primary or secondary. How do these differ?
- Primary databasesare populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.
- Secondary databasescomprise data derived from the results of analysing primary data.
What are essential aspects of primary and secondary databases?
Primary database-
Synonyms: Archival database
Source of data: Direct submission of experimentally derived data from researchers
Secondary database-
Synonyms: Curated database, knowledgeable
Source of data: Results of analysis, literature research and interpretation, often of data in primary databases
What are the challenges of databases?
Heterogeneous data sources (need for data fusion);
Complexity of the data (high-dimensionality);
Noisy, uncertain data, dirty data, the discrepancy between data-information-knowledge (various definitions)
Big data sets (when is data big? when manual handling of the data is impossible)
What is machine learning
Development of algorithms which can learn from data
What is the difference between supervised and unsupervised machine learning?
Supervised/Prediction- Guided
Unsupervised/Discovery- Unguided
What issue can arise with supervised learning?
Incorrect findings may be concluded if the right dataset and specific question are not decided/used.
What is a type 1 error?
false positive
What is a type 2 error?
false negative
What does sensitivity (also called the true positive rate, the ecall, or probability of detection[1] in some fields) measure?
Measures the proportion of positives that are correctly identified as such (i.e. the percentage of sick people who are correctly identified as having the condition).
What does Specificity/ true negative rate measure?
The proportion of negatives that are correctly identified as such (i.e., the percentage of healthy people who are correctly identified as not having the condition).
What are the positive and negative predictive values (PPV and NPV respectively) ?
Proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results
What is a Receiver Operating characteristic curve (ROC)?
Trade-off between sensitivity (or TPR) and specificity (1 – FPR).
X axis: Plot one minus the specificity( the probability of being a false positive)
Y axis: Plot the sensitivity over the probability at true positive.
Thus trying to produce a curve where every single point along this curve corresponds to exactly one cut off.
What machine learning method requires No a priori hypothesis on the real number of clusters (groups) present and requires no additional information besides the data itself?
Discovery/ Unsupervised Learning
What is an example of a discovery problem?
Patient Similarity Problem- how can the physician categorise patient to allocate the best treatment
How is similarity among genes expressed
As a mathematical distance
Euclidean distance-length of a line segment between the two points. Linear associations
Manhattan distance- distancebetween two points measured along axes at right angles.
Correlation distance- Measures both linear and non linear associations
Genes close in the “expression space” have similar expression profiles
What are two unsupervised learning technique/ technique for reducing the dimensionality of such datasets?
Principal Component Analysis (PCA)
Hierarchical clustering
What is Principal component analysis?
An exploratory technique to simply a dataset
It is a linear transformation that chooses a new coordinate system for the data set such that
- greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component)
- the second greatest variance on the second axis and so on
Principal component analysis can be used to…
Reduce number of dimensions in data
Find patterns in high-dimensional data
Visualise data of high dimensionality
What are example applications of Principal component analysis?
- Face recognition
- Image compression
- Gene expression analysis
What is the most common algorithm for unsupervised learning?
Hierarchical clustering
How is hierarchical clustering achieved?
At the beginning, each object (gene) is a cluster. In each of the subsequent steps, the two closest clusters are merged into one cluster until there is only one cluster left.
- Assigns each item to its own cluster.
- Finds the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
- Computes distances (similarities) between the new cluster and each of the old clusters.
- Repeats steps 2 and 3 until all items are clustered into a single cluster of size N.
What are biological pathways?
What are three examples?
Groups of genes interacting to produce a certain product or trigger a change in a cell
- Signalling pathways
- Gene-regulatory pathways
- Metabolic pathways
What are two ways in which biological pathways can be represented?
- Pathways as collections of genes (gene sets)
2. The same gene can belong to multiple pathways
What is pathway analysis?
One way of inferring biological meaning from lists of genes by looking at overlaps with known gene sets
What are the three reasons as to why pathway analysis is useful?
- Perturbations at the single gene level might not explain the whole picture
- A single gene mutation might not be enough to perturb an entire pathway because of redundancy
- Mutations of opposite effect might compensate each other so that the outcome of pathway is not disrupted even if some of its genes are - Shift to a pathway-centered view of biological systems
- Can be used to generate hypotheses about the phenomenon studied that can be taken forward for further evaluation
What is the fundamental question in concern to pathway analysis?
How can this be answered?
Given the overlap between a list of genes and a gene set of interest, what is the probability of obtaining the same or a greater overlap between the two by chance?
Performing enrichment analysis
What are prerequisites for pathway analysis?
- Experiment that generates a gene list
- Meaningful categories (gene functions, pathways, etc.) - creates gene set
- Association between genes and categories (annotation) - creates gene set
- Methods to estimate which gene sets are significantly perturbed by an experiment (enrichment analysis)
What are meta databases?
Collection of many different resources (GO, pathways, published signatures, etc.) provided for purpose of testing a list of genes against gene sets coming from different sources at once
Why can it be useful to interrogate more than one resource to discover enriched pathways?
The same pathway can appear quite different according to the pathway resource used.
What are two enrichment methods?
- Overrepresentation analysis
- Estimates the significance of the overlap between a list and a gene set
- Choice of genes in the list based on an arbitrary threshold
Fisher’s test
- Ranked enrichment analysis
- Genes are sorted according to a meaningful metric
-No arbitrary threshold needed
What is enrichment analysis?
A collection of statistical methods for estimating how enriched is a gene set in a list of genes of interest (are genes in the gene list coming from a gene set more frequently then what would be expected by chance?).
What steps are involved in overrepresentation analysis?
- List of genes generated by an experiment
- In most cases genes are chosen according to a certain threshold
P-value
Fold Change
In overrepresentation analysis why is the size of the overlap between genes in our list and genes belonging to a gene set is not enough to estimate significance?
Because a large overlap could be due to a larger number of genes in the gene list of interest. If for example our gene list contained all known genes, the overlap between that gene list and any pathway would be very high but this would not be particularly meaningful.
In Overrepresentation analysis what statistical tests are used to calculate the significance of overlap computed?
- Is the overlap between my list of genes and the gene set I am testing bigger than what I would get by randomly selecting the same number of genes?
- Usually computed using: Fisher’s Exact Test
Size of the list
Size of the gene set
Size of the overlap
Size of the universe (number of genes tested)
In overrepresentation analysis what strongly affects results?
The universe
Universe= Genome Only genes present on the array Only expressed genes Only genes for which pathway annotation is available
If for example we are running a small scale transcriptomic experiment where only 100 genes are tested, no more than 100 genes could have come up as significant
What limitation of Overrepresentation analysis does rank-based enrichment overcome?
Overrepresentation analysis requires a list of genes to compute significance. The threshold used to select these genes has an impact on the results of the enrichment
What steps comprise rank-based enrichment?
Takes as an input a ranked (sorted) list of genes
P-value
Fold change
Other metrics
Are the genes from a gene set overrepresented at the top of the list compared to randomly picked lists of genes of similar sizes?
The enrichment is significant if the genes of the gene set are mainly located at the top of the ranked list
The enrichment is not significant if the genes of the gene set are randomly scattered across the list
What is an example of rank-based enrichment method ?
How is this computed?
Gene Set Enrichment Analysis (GSEA)
Enrichment Score calculated by screening the list top-down:
- Increase statistic whenever gene in the ranked list belongs to gene set S
- Decrease statistic whenever gene in the rank list belongs to gene set S
- Score equals to the maximum deviation from zero
What is network analysis?
Another way of inferring biological meaning from gene lists looking at connections between genes
- Establish links between genes
- Analyse the structure of the network
What is a network in a biological context?
Can be represented by a graph composed by nodes and edges.
Nodes can represent biological entities Genes Proteins Variants Metabolites
Edges represent relationships between nodes Activation/Repression Physical interaction Binding Cleavage Co-expression
Edges can be directed as in the case of activation (one molecule activates another but not vice versa) or undirected - co-expression (if gene A is correlated with gene B then gene B is correlated with gene A).
- What is a tool for network analysis?
2. What is this?
Ingenuity
- Commercial application for analysing complex omics datasets in a network framework
Multiple modules available: - Pathway/Function/Disease analysis (Fisher’s test)
- Network generation
- Upstream regulators
Knowledge based on curated published results
How does ingenuity transform list of genes into sets of relevant networks using pre-existing published knowledge?
A link is created between two entities if there exists a published experiment that describes this relationship- The nature of the relationship is included in the network
In its inference process what is Ingenuity considerate of?
Certain recurrent patterns of biological networks like the presence of hubs.
- What machine learning technique was used to identify diagnostic biomarkers?
- What method was used?
- What statistical test is typically used?
- Supervised
- Supervised multiclass pathway activity inference method:
- For each pathway expression dataset, patterns of its constituent genes are summarised into pathway activity
- Infer a feature as a weighted linear summation of expression of its constituent genes
- Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes
- Classification is then performed on the resulting low-dimensional pathway activity profile - Enrichment analyses, however typically rank based is used.
How is Hierarchical clustering and classification used to identify gene signatures specific to disease as diagnostic biomarkers?
- Comprehensive analysis of gene expression in paired lesional and non-lesional psoriatic tissue samples, compared with controls
- establish differences in RNA expression patterns across all tissue types
- Ensembles of decision tree predictors were employed to cluster psoriatic samples on the basis of gene expression patterns and reveal gene expression signatures that best discriminate molecular disease subtypes
Describe a network module-based method for identifying cancer prognostic signatures.
- A human protein functional interaction (FI) network constructed by combining curated and uncurated data sources using a machine learning technique
- Modules derived from a highly reliable gene functional interaction network
- Infer a feature as a weighted linear summation of expression of its constituent genes
- Assigning gene co-expression values as weights for the FI network, network modules were discovered containing genes having similar expression patterns in a disease, and used as features to model disease heterogeneity
- Survival curves among the high and low module expression groups were derived, and acts as a proof of principle for using module 2 expression as a cross-platform prognostic signature.
How can Pathway and Gene Set Enrichment Analysis GSEA be applied to identify predictive biomarkers/signature that are associated with time to progression under therapy?
Whole-genome gene expression profiling was performed on 42 biopsy samples (from SAKK 19/05 trial) using Affymetrix exon arrays, and associations with the following endpoints: time-to-progression (TTP) under therapy, tumor-shrinkage (TS), and overall survival (OS) were investigated.
Gene set enrichment analyses (GSEA) was performed
GSEA revealed a significant enrichment of the angiogenesis-associated genes within the genes that associate with TTP under BE therapy endpoint
Gene pathways here are based upon gene expression
True or false
True