QBIO2001 Flashcards

Big data

1
Q

What are biomarkers?

A

Data signatures that are diagnostic of different people’s signatures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is GWAS?

A

Genome Wide Association Studies
A genome-wide association study is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are some diseases not detectable by SNPs?

A

• The inability to detect some disease through SNPs is because disease is the result of an interaction between genes and the environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is something in gene analysis that should be done in the future?

A

• Gene analysis has no temporal analysis, but we need to study genes, diseases and the environment comprehensively and dynamically
• Studying the system when it is perturbed is useful
o Weaknesses of the system are exposed
o Connections between system and environment are figured out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why are humans unreliable test subjects? What is the solution to this?

A

they can lie and not go through the treatment properly

• This is why mice and animals are used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an example of an experiment that had to be done with mice, and not humans due to the flaws of human experiments? What were problems with this study?

A

• For example, diet studies
o Mice on high fat western diet had:
 Increase in anxiety, short term memory and laziness
 Weak bone structure
 High blood glucose
o Mice with calorie restriction
 Lived longer
• However, the diet study is unreliable as all mice are genetically identical and belong to the same mouse strain
o Doesn’t take genetic diversity into account

o Calorie restriction can be good but can also be bad depending on genetics, which is why the previous experiment doesn’t work
o Fat, glucose and insulin response also depends on genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three major sub-species of laboratory mice?

A

o Laboratory mice are derived from three major sub-species
 Musculus Domesticus
 Musculus Musculus
 Musculus Castaneus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is genetic diversity introduced in lab mice and why?

A

o Original population of mice are largely genetically different
o Get collection of strains by crossbreeding the populations
o This is useful to induce genetic diversity in environmental experiments
o See gene + environment result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference in what has to be studied in monogenic vs polygenic diseases?

A

Monogenic vs polygenic diseases
• Monogenic- can look at a SNP and say with high probability that there will be the disease
• Polygenic- have to look at both disease and environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the different types of networks and a bit about them?

A

• Cell signaling networks
o Phosphorylation
o Kinase has to recognize substrate and bind to substrate
• Transcriptional networks
o Which gene is regulating expression of which gene
o Genes regulate each other and themselves
o All different transcripts change expression of other transcripts
o Transcription factors have to interact with each other to form protein complexes
o Gene regulatory networks
o Gene regulatory circuitry
• Protein-protein interactions networks
o Proteins interact together to function
• Metabolic networks
o Looking at metabolites
• And more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Talk about the insulin cell signaling network

A

• Cell signaling network- insulin
o Insulin receptor
 Recognize and allow insulin to bind, which triggers signaling cascade
o IRS-1 will phosphorylate different kinases (Mik-1, Mik-2, Erk)
o The kinases phosphorylate the substrates by recognizing them by motifs
o Kinases eventually control expression of the genes
o GLUT-4: vesicle that translocate through the membrane and brings glucose from the surface into the cell
 If that pathway is broken your cell is insulin insensitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are transcription factors and what can they do?

A

o Transcription factors are proteins that recognize DNA sequences and bind to specific DNA sequences called motifs
 Allow the cell to differentiate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What could transcription factors be used for in medicine?

A

 Embryonic stem cells can differentiate into different cell types- all kinds of them
• Could be used to generate tissues
• Can be studied by using Chip sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are protein-protein interaction networks looked at?

A

o Physical interaction networks
 Multiple proteins come together and physically attach each other
o Cross-link different proteins
o Measured by using mass spectrometry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are metabolic networks organised?

A

o Organized by concept of functions of cells and the metabolites that contribute to the function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is DNA sequencing?

A

Sequence DNA of an animal/plant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is transcriptome/RNA sequencing?

A

Measure the letters from mRNA and know what the mRNA is floating in the cell and how many copies there are- can translate that expression to how high the gene expression is
o Can measure transcriptomes- transcription level of different genes
o What combinations of genes are expressed in different cell types
o Can be expressed at different (cell specific genes) or similar level (housekeeping genes- genes that are conserved in all cell types) in different cell types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is ChIP sequencing and how is it done?

A

o Measure the DNA sequence that a transcription factor binds to
o Know where transcription factor is binding- what motifs of DNA it binds to
o Transcription factor will regulate the gene to which it binds to
 Transcription factors are protein
o If re-align sequence back into genome can see exactly where the transcription factor binds to
o There will be background noise in sequencing experiments- signal is coming from very sharp peaks
o There is
 Histogram of gene expression
 Accessible Chromatin
 Histone modifications
 Transcription factor binding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a pathway database?

A

o Pathway Database: Computerize current knowledge of molecular and cellular biology in terms of the pathway of interacting molecules or genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a genes database?

A

o Genes Database: Maintain gene catalogs of all sequenced organisms and link each gene product to a pathway component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a ligand database?

A

o Ligand Database: Organize a database of all chemical compounds in living cells and link each compound to a pathway component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are pathway tools?

A

o Pathway Tools: Developed new bioinformatics technologies for functional genomics, such as pathway comparison, pathway reconstruction and pathway design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is clustering a typical procedure for?

A

Creating regulatory networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can you study protein phosphorylation?

A

 Have control cells and test cells
 Mix lysates 1:1
 Enzymatic digestion
• Break proteins up so can be passed through the mass spectrometer
 Enrichment of phosphorylated peptides
 nanoLC-MS/MS analysis
 Lots of computation
 Signalling dynamics graph
 Do clustering
 Once have all phosphorylation sites measured, partition them into different patterns
 K-mean clustering used to partition different phosphorylation sites into different clusters which have different patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does clustering do and what is it often used for?

A

• Partitions samples that have similar patterns into same groups- grouping technique
• Methods of grouping samples (x) that are similar- according to some pre-defined criteria: how do you measure similarity?
• A form of unsupervised learning- no label information (y) is used to tell the algorithm which observations should be grouped together
o Algorithm puts stuff in cluster
• It is often used for exploratory data analysis- a way of looking for patterns or structure in the data that are of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

So what is the aim of clustering?

A

To group observations that are similar based on predefined criteria

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the issues of clustering?

A

o Data types- counts, ratio, ordinal, categorical and continuous
 Similarity depends on data types
o Missing data
 Replace with appropriate number or remove instances
o Scaling
o (Dis)similarity metric
 Person correlation, Spearman correlation, Euclidean, Manhattan…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a metric?

A

 A metric is a measure of the similarity or dissimilarity between two data objects and it’s used to form data points into clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the two main classes of distance?

A

o Correlation coefficients (compares shape of expression curves)
o Distance metrics
 Manhattan distance
 Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is genetic useful for?

A

• Genetics provides an unbiased tool for discovering factors that trigger or modify disease, without any prior knowledge of the nature or mechanism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

If a disease is developed in less than a year, what are the most prominent causes of death? Is it genetic or environmental?

A

• If developed in less than a year, most probably genetic cause of death even if there is environmental influence
o Peritenal and conginetal most prominent disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

If a disease is developed from ages 1-44, what are the most prominent causes of death? Is it genetic or environmental?

A

• If get it from ages 1-44, most probably environmental cause of death even if there is genetic influence
o Suicide is most prominent disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

If a disease is developed from 45-95+, what are the most prominent causes of death? Is it genetic or environmental?

A

• If get it from ages 45-95+, both environmental causes and genetic causes of death.
o Heart failure and disease for males
o Dementia and alzheimer’s for female
o Cancer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are the 3 classes of diseases?

A

Monogenic

  • High penetrance
  • Rare
  • Entirely genetic
  • Population genetics are used (GWAS)

Monogenic variable

  • Low penetrance
  • Variable expression
  • Rare frequency and large effect size
  • Mostly genetic with some environment
  • Target population studies genetics (GWAS and/or DNA sequencing) used

Polygenic

  • Common
  • Mostly environment with some genetic
  • Population genetics are used (GWAS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How do we perform classical mendelian linkage analysis in complex diseases?

A
  1. Identify the inheritance and chromosome- in 80’s, 90’s
    a. Autosomal, x linked, dominant, recessive, mitochondrial
  2. Use polymorphic markers (SNP, restriction map, microsatellites) to narrow down the region
    a. Restriction map- take DNA, cut with enzymes
    b. Microsatellites, analyze with PCR
  3. Confirm with Sanger sequencing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How does Sanger sequencing work?

A

a. Synthesise DNA in vitro
b. Chain termination method: lacks 3’OH group and therefore can’t form phosphodiester bonds, preventing DNA polymerase from continuing. Only add one ddNTP to each reaction. Then heat, denature and separate on gel (polyacrylamide)
i. Chain termination for each nucleotide
c. In gels, bigger is slower and smallest is slowest
d. Super accurate but slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Talk about two mutations in which mapping and sequencing helped identify their causes

A

• Achondroplasia (Dwarfism)
o Mapping and sequencing
o Mutation in FGFR3
o Find people of different families and see if they have the same mutation
 Older fathers make more mutations
o 80% of cases are sporadic mutations because of age of father
• Pain- mutations in Nav1.7 causing congenital insensitivity to pain have lead to the development of new pain killers targeting this channel
o Dominant- erythromelalgia
 Burning in hands and feet
o Recessive- congenital insensitivity to pain
 Feel no pain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are examples of important life factors/diseases with a genetic basis of disease?

A

• Human height
o 70-80% genetic
o Only identify 20% of genetics that control height if don’t have mendelian disease
• Obesity
o 50% genetics, 50% environment
o Tissues expression of genes at GWAS loci compared to random gene sets
o Obesity/BMI genes are in the brain
o Variant of KSR2- produce more insulin after they eat, eat more
 Affects metabolism
• Mental illness
• Lifespan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is a polygenic disease and what are polygenic contributions to disease?

A

o Relatively common genetic changes common variant- common disease model
o Rare genetic changes in the multiple rare variants- common disease model
o Clearly more common that monogenic obesity (that is, there is a genetic predisposition to obesity that has a complex genetic architecture)
o Gene combinations, or “epistastatic” interactions?
o Need to think about gene-environment interactions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the aims of the HapMap project?

A

o Provide insight into patterns of genetic variation in the human population
o Guide design and analysis of medical genetic studies
o Increase power and efficiency of association studies to medical traits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the HapMap project?

A
  • Public resource
  • Catalogue of common genetic variants that occur in humans
  • Genetic data from 4 populations (n=270) with African, Asian and European ancestry
  • Millions of SNPs identified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How many alleles do SNPs have?

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is a haplotype?

A

o Haplotype: a set of SNPs along a chromosome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What does association mean?

A

o Any relationship between two measured quantities that renders them statistically dependent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is an Odds Ratio?

A

o An odds ratio is a measure of association between an exposure and an outcome. The odds ratio represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
o Odds ratios are most commonly used in case control-studies, however they can also be used in cross-sectional and cohort study designs as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

How can you find and order your GWAS results?

A

o Large case/control population
o Blood samples
o Genotype common variants via SNP chip( near 15 millions SNPs)
o Population analysis
 P values at 10-8 are significant for genome-wide significance
o Whole genome disease loci (Manhattan plot)
 Can tell us how obesity works

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How does whole genome/exome sequencing work?

A
•	Genomic library
•	Construct shotgun library
•	Break out DNA in fragments 
•	Hybridization 
•	Pulldown
o	Wash those that don’t code for protein 
•	Captured DNA
•	DNA sequencing 
•	Mapping, alignment, variant calling 
•	For genome
o	Don’t hybridise go straight for DNA sequencing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are three main platforms for sequencing, along with their advantages and disadvantages?

A

-Illumina
 Huge amount of data
 Short ~100 bp reads
 90x coverage of the genome- too much information
 Most commonly used
 Good for exomes as don’t need to worry about reassembling bit

-Ion torrent
 Fast and cheap
 Medium 400 bp reads
 Less coverage of genome

-Pacific biosciences
 Expensive
 Really long reads
 Less coverage of genome

49
Q

How Illumina work?

A

 Sequencing by synthesis:
1. Universal primers are immobilized on a glass surface inside the flow cell
2. Genomic DNA is converted into sequencing templates ready to load into the flow cell
o Genomic DNA is sheared/digested
o Add on a single strand/enzymatically tail
o Add labeled nucleotide
3. Hybridize the DNA templates to the immobilized primers inside the flow cell
o Put a poly-A tail with a dye in it
4. Visualize the template: primer duplexes by illuminating the surface with a laser and imaging with an electronic camera connected to a microscope. Record the positions of all the duplexes on the surface. After imaging, the dye molecules are cleaved and washed away
5. Flow in DNA polymerase and one type of fluorescently labelled nucleotide. The polymerase will catalyze the addition of labelled nucleotide to appropriate primers
o Similar to Sanger sequencing
6. Wash out the polymerase and unincorporated nucleotides
7. Visualise the incorporated labelled nucleotides by illuminating the surface with a laser and imaging with the camera. Record the positions of the incorporated nucleotides
8. Remove the fluorescent label on each nucleotide
9. Repeat the process from step 5 with the next nucleotide (steeping through A,C,G,T) until the desired read-length is achieved

50
Q

What does absolute correlation capture?

A

Both positive and negative correlations

51
Q

Do Spearman and Pearson have units? Why/why not?

A

• Spearman and Pearson are unit free

o Spearman is unit free due to the denominator, which removes unit dependence through standardization

52
Q

Is the distance approach unit dependent or independent?

A

Unit dependent

53
Q

What are the impacts of potential pitfalls in correlation?

A

o Perfect correlation doesn’t mean that the independent variable really changes and assuming so can lead to false positive identification

54
Q

What are between cluster dissimilarity measures?

A
o	Single (minimum)
o	Complete (maximum)
o	Distance between centroids
	Centroid is the median of each group
	Robust to outliers
o	Average(mean) linkage
55
Q

What are the different clustering algorithms?

A

Hierarchical

Partitioning

56
Q

How does hierarchical clustering work and what is the result of this? How do you do it in R

A

 Produce trees or dendograms bottom up
 Sorted into less and less specific degrees of similarity- like an evolutionary tree
 Everything is combined into one cluster with varying degrees of similarity being in different branches
 End nodes  most similar instances
 hclust(dist(dataset[,])
• Default eucledian distance
 cutree(dataset, k=)
 Start with n sample (or m feature) clusters
 At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters
 The distance between clusters is defined by the method used

57
Q

How does the partitioning clustering approach work?

A

 Partition into different groups  no mixing
 Partition the data into a pre-specified number of k of mutually exclusive and exhaustive groups
• Specify number of clusters you need
 Arbitrarily choose k objects as the initial cluster centers
• This is done randomly by the computer
 Until no change, do
• Reassign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster
• Update the cluster means (that is) calculate the mean value of the objects for each cluster
• Continues until it no longer changes as it updates

58
Q

What are 3 different types of partitioning clustering? Describe them.

A

-Kmeans
Is sensitive to outliers, which may substantially distort the distribution of the data
• Assigns each sample to a single cluster

-Kmedoids
takes the most centrally located object in the cluster instead of the mean like k-means does

-Fuzzy cmeans clustering
• Very similar to the k-means algorithm in that the objective functions are virtually identical
• But fuzzy c-means algorithm assigns each sample with a vector confidence to each of all clusters
o Randomly initialize the membership matrix
o Calculate the centroid as follows

o Update M(t), M(t+1)
o If ||M(t+1)-M(t)||

59
Q

Why is clustering useful?

A

• Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space
o We can cluster columns for identification
o We can cluster rows to reduce redundancy in predictive models

60
Q

For a stability based matrix…

A

o Try removing time points (or columns) of matrices and most stable k value is the one that should be used –> k should not change

61
Q

Partitioning clustering vs hierarchical clustering

A

Partitioning
Advantages
Optimal for certain criteria
Samples automatically assigned to clusters

Disadvantages
Need initial k
Often require long computation times
All samples are forced into a cluster

Hierarchical clustering-
Advantages
Faster computation
Visual

Disadvantages
Unrelated objects are eventually joined
Rigid, cannot correct later for erroneous decisions made earlier
Hard to define clusters

62
Q

What does Var(measurement) include?

A

o Phenotypic variability (signal of interest)
o Measurement error
o Natural biological variation

63
Q

What is a way to capture variability?

A

Replication

64
Q

What are biological replicates?

A
  • Capture biological variation

- Test multiple people

65
Q

What are technical replicates?

A
  • Capture measurement error

- Measure the same individual multiple times

66
Q

What is the goal of hypothesis testing?

A
  • Goal- to rule out chance (natural biological variation + measurement error) as a plausible explanation for the difference
  • Test a claim using samples obtained from a population
67
Q

How do you go about hypothesis testing?

A
  1. Define null and alternative hypotheses
  2. Select appropriate test-statistics
    a. Two sampled t-test takes variance into account
  3. Obtain p-value and interpret test result
    a. Choose significance level
68
Q

What is a type I error?

A

o Occurs when sample data appears to show deviation from the population when, in fact, there is none
o In this case the researcher will reject the null hypothesis

69
Q

What can type I errors be caused by?

A

o Type I errors can be caused by unusual, unrepresentative samples or testing across a large number of variables. Just by chance an extreme value may cause it to fall in the critical region even though the variable has no association with the measurement
o Can occur due to biological variability

70
Q

What is the probability of a type I error?

A

o The probability of a type I error is equal α, that has typical values of 0.05, 0.01 or 0.001

71
Q

What is a type II error?

A

o A type II error occurs when the sample does not appear to show strong difference from population mean, but in fact, there is a real difference
o In this case, the researcher will fail to reject the null hypothesis

72
Q

Why would a type II error occur?

A

o Type II errors are commonly the result of a small effect and large variability from technical and biological aspects

73
Q

What is power?

A

probability of discovering a real signal is there

74
Q

What is the typical value of power?

A

Power is typically set at 80% (1-β=0.8)
High power is more desirable as low power studies are not reproducible
Where:
Power is 1-β, where β is the probability of making a type 2 error

75
Q

What are factors that affect power of a test?

A

o The size of the effect
 The bigger the effect size, the bigger the power
o The standard deviation of the characteristics
 The smaller the sd, the bigger the power
o Sample size
 The bigger the sample size, the bigger the power is
o The significance level desired
 The bigger the significance level, the bigger the power is

76
Q

What is a confounding factor?

A

a third factor which is related to both the explanatory variable (observed) and the outcome, and which accounts for some or all of the observed relationship between the two

77
Q

What is the batch effect?

A
A potential confounder that includes:
	Experimenter change
	Test subject changes
	Location change
	Time change
	Technique change
78
Q

How are confounding variables minimized?

A

• Randomization
• Control
• Randomization and control are key elements for insuring the statistical inferences are valid and the bias caused by confounding factors (variables) are minimized
• In completely randomized design, all treatments are randomly allocated among all experimental subjects
• This allows every experimental subject to have equal probability of receiving a treatment
• Balanced experimental design: all treatments have equal sample size
o Equal treatment and control groups in batch to have more power

79
Q

What is stratification and what does it do?

A
  • Blocking
  • The most common solution is to stratify the group if there is unequal variation in the group
  • Stratification maximizes power and minimizes imbalance in a sample
80
Q

What is a control?

A

a control group is a group of subjects left untreated for the treatment of interest but otherwise experiencing the same conditions as the treated subjects

81
Q

Why is placebo given?

A

o Placebo needs to be given to weed out changes in behavior and mindset as a confounding factor

82
Q

What is a blind experiment?

A

 Don’t let the subject know if they are receiving the treatment or not

83
Q

What is a double blind experiment?

A

 Both the experimenter and subjects don’t know which drug is which: distributions of drugs is handled by something else

84
Q

Why is normalization important?

A

So we can minimize the effect caused by unknown confounding factors

85
Q

How do we normalise by mean (or median) centering?

A
o	Find the mean for all replicates
o	Pull the data points so the means are centered to 0
o	Data is then normalized
o	In R
	Normalized x
86
Q

Why would quantile normalization be used?

A

o If non-linear

o More robust approach

87
Q

How do you perform quantile normalization?

A

o Rank based normalization
o Find highest value in each replicate
o Find median of the genes
o Centre the three genes onto this line
o Find the second highest, third highest… and repeat this process
o In R:
 preprocessCore package that has to be downloaded from web
 normalize.quantiles(cbind(x,y))
• Quantile keeps the y values, mean/median centering doesn’t

88
Q

What was the context of the melanoma experiment?

A

• Common in Caucasians living in sunny climates
• Of those that metastasis (stage III) about 40% go on to live cancer free, but another 40% succumb to the disease in less than 1 year
• Have gene expression data and clinical data for 79 stage II individuals
o Fresh frozen tissue, where the signal from those RNA tissues are a lot higher

• Detailed clinical and pathological data, with mutation status for majority of the patients
• Small clinical data
o Small relative to other cancer types
o Large study of fresh tissue samples focusing on Stage III patients
o Missing values
o Potentially could lead to unstable logistic model
• Vertical (omics) data

89
Q

What were the aims of the melanoma experiment?

A

o New prognostic markers
 To determine whether there are significant biomarker and pathway differences between melanomas of good and bad prognosis after resection of nodal metastatic disease
o New therapeutic targets:
 To identify and validate the principal regulatory pathway abnormalities that characterize metastatic (stage III and IV) melanomas
 To investigate novel genomic drivers of melanoma tumor progression and outcome

90
Q

What data techniques do you use to analyse the genome?

A
  • SNP data
  • Exome seq
  • DNA seq
91
Q

What data techniques do you use to analyse the transcriptome?

A
  • mRNA array
  • microRNA
  • RNA seq
92
Q

What data techniques do you use to analyse the proteome?

A

ITRAQ

93
Q

What data techniques do you use to analyse the metabolome?

A

MS/MS

94
Q

What data techniques do you use to analyse the phenome?

A

Clinical and pathological data

95
Q

What does omics data analysis involve?

A

o Biological question
o Experimental design
o Experiment involving some high throughput biotechnologies
o Pre-processing and quality assessment
o Higher level analysis
o Biological verification and interpretation

96
Q

What does pre-processing and quality assessment include?

A
  • Data wrangling
  • Cleaning data
  • Linked closely to type of biotechnology used
  • Remove the noise
  • Makes a matrix
97
Q

What are the different types of files in raw processing?

A
  • CEL files
  • Various file types
  • Fastq files
98
Q

What are characteristic of the CEL files and what quality assessment pre-processing can you perform on them?

A
	Short-oligonucleotide chip data:
•	Quality assessment 
•	Background correction
•	Probe-level normalization
•	Probe-set summary
99
Q

What are characteristics of the Various file types and what quality assessment pre-processing can you perform on them?

A

 Two Colour array data
• Quality assessment; diagnostic plots
• Background correction
• Array normalization

100
Q

What are characteristics of the Fastq files and what quality assessment pre-processing can you perform on them?

A
	Most commonly used now
	RNA-seq data:
•	Mapping: map to reference
•	Annotation
•	Gene-level summarization 
•	Normalization
101
Q

Probes by sample matrix of:

A

o Log-ratios or log-intensities (microarray)

o Count or RPKM or Total counts (RNA-seq data)

102
Q

How do you analyse expression data?

A

o Identify D.E. genes, estimation and testing
o Clustering and
o Discrimination
 DE genes are the ones able to discriminate, but may or may not need D.E. genes for discrimination

103
Q

Before analysing the data, what should you do?

A
  • Did the experiment work?
  • What do you do analytically?
  • Identify candidate DE genes between extreme survivability groups (for this articular experiment)
104
Q

How can you test if the experiment worked?

A

a. Do a summary of the data
b. Compare experiment results with other researches
c. Could there be contaminants?
d. Just because there’s numbers doesn’t mean they’re valid
e. Figure out if you have signal
i. Boxplots good for detecting outliers

105
Q

Generally, what can you analytically do to analyse data?

A

a. Cluster the expression profiles of genes with greatest variance
b. Principle component analysis to investigate the predominant source of variability
c. Data normalizations
i. Allows experimenter to see if experiment worked or if something went wrong in the experimental process
1. In this case study, was found that one technician didn’t do his job properly

106
Q

How do you identify candidate DE genes between extreme survivability groups?

A

a. Merge data
i. Merge different batches together
b. Perform pre-processing, quality assessment and normalization of expression data
c. Exploratory data analysis
d. Examination of patient survivability to identify extreme cases and group the samples accordingly for DE analysis
e. Identify candidate DE genes between extreme survivability groups and generate corresponding heatmap
f. Split into groups
i. For this experiment, group chosen was
1. Survived less than one year-Died melanoma (poor group)
2. Alive- NSR (no sign of recurrence- disease free) greater than 4 years (rich group)
a. Now 48 samples in total
g. External validation
h. Survival analysis and prediction (machine learning)
i. Find markers  DE analysis, network-based biomarkers…
ii. Classify

107
Q

How do you analyse the RNA biomarker?

A

• RNA  transcriptome mRNA array, microRNA, RNA-seq –> PPI network –>Clinical and pathological data –> phenome –> phenotype

108
Q

What is single gene level?

A

Gene by gene analysis

109
Q

What is gene set level?

A

Features are subsets of genes (set of nodes)

110
Q

What is network level?

A

examine a subsets of genes (nodes in the network) together with information on relationship between the genes (edges in network)

111
Q

How can you rank subnetworks?

A

 For each edge, k, the correlation difference between the two classes (GP and PP) was calculated
 Delta= PPcork –GPcork
 For each sub-network, calculate the average absolute difference in hub interactor correlation
 Rank the hub subnetworks based on their average hub difference values or use permutation tests to determine the statistical significance of each hub

112
Q

What is vertical omics data and what are the challenges of analysing this data?

A

o More genetic information on the same patients
o Contrast to horizontal (more patients with the same genetic information)
o Challenges:
 Small number of samples
 Sample mismatch
 Understanding and processing many different platforms
 Correlated information from multiple platforms
 Unbalance number of variables between CPM and high throughput-omics platforms

113
Q

What are some prediction questions we need to ask and what are solutions to these questions?

A

• What platforms or combinations of platforms are better?

• To identify candidate biomarkers- is there different ways to think about biomarkers?
o Identify biomarkers with differential distributions
o Identify network biomarkers-which may help us understand underlying mechanisms
• Can we trust the biomarker?

o Cross-validation between multiple datasets
• Can we trust the platform?

o Wet lab validation of biotechnologies

114
Q

What can metabolite variation % be explained by?

A
o	Clinical covariates
o	Unexplained variation
o	Top NSP
o	Second top NSP
o	Other genetic factors
115
Q

Why are metabolites useful to analyse?

A

• Metabolites will change based on enzymatic pathways

o Capture immutable change of genome and environmental impact

116
Q

How can you optimise the performance of a mass spectrometer experiment?

A
  • Sensitivity
  • Limit of Detection
  • Reproducibility
  • Biological vs technical variability
  • Internal controls
  • Maintenance protocol
117
Q

Why don’t we fragment lipids?

A

They’re predictable- just look at their parent mass

118
Q

What was the Framingham heart study?

A

• Metabolomic profiling in Framingham heart study-
o Frozen samples of other 3000 people
o Follow up events of 12 years

119
Q

After analysing experiments, what should you always do before publishing?

A

• Establish an external cohort to make sure results aren’t an anomaly