QBIO2001 Flashcards

Question

What does clustering do and what is it often used for?

Answer 1

• Partitions samples that have similar patterns into same groups- grouping technique • Methods of grouping samples (x) that are similar- according to some pre-defined criteria: how do you measure similarity? • A form of unsupervised learning- no label information (y) is used to tell the algorithm which observations should be grouped together o Algorithm puts stuff in cluster • It is often used for exploratory data analysis- a way of looking for patterns or structure in the data that are of interest

Answer 2

To group observations that are similar based on predefined criteria

Answer 3

o Data types- counts, ratio, ordinal, categorical and continuous  Similarity depends on data types o Missing data  Replace with appropriate number or remove instances o Scaling o (Dis)similarity metric  Person correlation, Spearman correlation, Euclidean, Manhattan…

Answer 4

 A metric is a measure of the similarity or dissimilarity between two data objects and it’s used to form data points into clusters

Answer 5

o Correlation coefficients (compares shape of expression curves) o Distance metrics  Manhattan distance  Euclidean distance

Answer 6

• Genetics provides an unbiased tool for discovering factors that trigger or modify disease, without any prior knowledge of the nature or mechanism

Answer 7

• If developed in less than a year, most probably genetic cause of death even if there is environmental influence o Peritenal and conginetal most prominent disease

Answer 8

• If get it from ages 1-44, most probably environmental cause of death even if there is genetic influence o Suicide is most prominent disease

Answer 9

• If get it from ages 45-95+, both environmental causes and genetic causes of death. o Heart failure and disease for males o Dementia and alzheimer’s for female o Cancer

Answer 10

Monogenic - High penetrance - Rare - Entirely genetic - Population genetics are used (GWAS) Monogenic variable - Low penetrance - Variable expression - Rare frequency and large effect size - Mostly genetic with some environment - Target population studies genetics (GWAS and/or DNA sequencing) used Polygenic - Common - Mostly environment with some genetic - Population genetics are used (GWAS)

Answer 11

1. Identify the inheritance and chromosome- in 80’s, 90’s a. Autosomal, x linked, dominant, recessive, mitochondrial 2. Use polymorphic markers (SNP, restriction map, microsatellites) to narrow down the region a. Restriction map- take DNA, cut with enzymes b. Microsatellites, analyze with PCR 3. Confirm with Sanger sequencing

Answer 12

a. Synthesise DNA in vitro b. Chain termination method: lacks 3’OH group and therefore can’t form phosphodiester bonds, preventing DNA polymerase from continuing. Only add one ddNTP to each reaction. Then heat, denature and separate on gel (polyacrylamide) i. Chain termination for each nucleotide c. In gels, bigger is slower and smallest is slowest d. Super accurate but slow

Answer 13

• Achondroplasia (Dwarfism) o Mapping and sequencing o Mutation in FGFR3 o Find people of different families and see if they have the same mutation  Older fathers make more mutations o 80% of cases are sporadic mutations because of age of father • Pain- mutations in Nav1.7 causing congenital insensitivity to pain have lead to the development of new pain killers targeting this channel o Dominant- erythromelalgia  Burning in hands and feet o Recessive- congenital insensitivity to pain  Feel no pain

Answer 14

• Human height o 70-80% genetic o Only identify 20% of genetics that control height if don’t have mendelian disease • Obesity o 50% genetics, 50% environment o Tissues expression of genes at GWAS loci compared to random gene sets o Obesity/BMI genes are in the brain o Variant of KSR2- produce more insulin after they eat, eat more  Affects metabolism • Mental illness • Lifespan

Answer 15

o Relatively common genetic changes common variant- common disease model o Rare genetic changes in the multiple rare variants- common disease model o Clearly more common that monogenic obesity (that is, there is a genetic predisposition to obesity that has a complex genetic architecture) o Gene combinations, or “epistastatic” interactions? o Need to think about gene-environment interactions

Answer 16

o Provide insight into patterns of genetic variation in the human population o Guide design and analysis of medical genetic studies o Increase power and efficiency of association studies to medical traits

Answer 17

* Public resource * Catalogue of common genetic variants that occur in humans * Genetic data from 4 populations (n=270) with African, Asian and European ancestry * Millions of SNPs identified

Answer 18

o Haplotype: a set of SNPs along a chromosome

Answer 19

o Any relationship between two measured quantities that renders them statistically dependent

Answer 20

o An odds ratio is a measure of association between an exposure and an outcome. The odds ratio represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure. o Odds ratios are most commonly used in case control-studies, however they can also be used in cross-sectional and cohort study designs as well

Answer 21

o Large case/control population o Blood samples o Genotype common variants via SNP chip( near 15 millions SNPs) o Population analysis  P values at 10-8 are significant for genome-wide significance o Whole genome disease loci (Manhattan plot)  Can tell us how obesity works

Answer 22

``` • Genomic library • Construct shotgun library • Break out DNA in fragments • Hybridization • Pulldown o Wash those that don’t code for protein • Captured DNA • DNA sequencing • Mapping, alignment, variant calling • For genome o Don’t hybridise go straight for DNA sequencing ```

Answer 23

-Illumina  Huge amount of data  Short ~100 bp reads  90x coverage of the genome- too much information  Most commonly used  Good for exomes as don’t need to worry about reassembling bit -Ion torrent  Fast and cheap  Medium 400 bp reads  Less coverage of genome -Pacific biosciences  Expensive  Really long reads  Less coverage of genome

Answer 24

 Sequencing by synthesis: 1. Universal primers are immobilized on a glass surface inside the flow cell 2. Genomic DNA is converted into sequencing templates ready to load into the flow cell o Genomic DNA is sheared/digested o Add on a single strand/enzymatically tail o Add labeled nucleotide 3. Hybridize the DNA templates to the immobilized primers inside the flow cell o Put a poly-A tail with a dye in it 4. Visualize the template: primer duplexes by illuminating the surface with a laser and imaging with an electronic camera connected to a microscope. Record the positions of all the duplexes on the surface. After imaging, the dye molecules are cleaved and washed away 5. Flow in DNA polymerase and one type of fluorescently labelled nucleotide. The polymerase will catalyze the addition of labelled nucleotide to appropriate primers o Similar to Sanger sequencing 6. Wash out the polymerase and unincorporated nucleotides 7. Visualise the incorporated labelled nucleotides by illuminating the surface with a laser and imaging with the camera. Record the positions of the incorporated nucleotides 8. Remove the fluorescent label on each nucleotide 9. Repeat the process from step 5 with the next nucleotide (steeping through A,C,G,T) until the desired read-length is achieved

Answer 25

Both positive and negative correlations

Answer 26

• Spearman and Pearson are unit free | o Spearman is unit free due to the denominator, which removes unit dependence through standardization

Answer 27

Unit dependent

Answer 28

o Perfect correlation doesn’t mean that the independent variable really changes and assuming so can lead to false positive identification

Answer 29

``` o Single (minimum) o Complete (maximum) o Distance between centroids  Centroid is the median of each group  Robust to outliers o Average(mean) linkage ```

Answer 30

Hierarchical | Partitioning

Answer 31

 Produce trees or dendograms bottom up  Sorted into less and less specific degrees of similarity- like an evolutionary tree  Everything is combined into one cluster with varying degrees of similarity being in different branches  End nodes  most similar instances  hclust(dist(dataset[,]) • Default eucledian distance  cutree(dataset, k=)  Start with n sample (or m feature) clusters  At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters  The distance between clusters is defined by the method used

Answer 32

 Partition into different groups  no mixing  Partition the data into a pre-specified number of k of mutually exclusive and exhaustive groups • Specify number of clusters you need  Arbitrarily choose k objects as the initial cluster centers • This is done randomly by the computer  Until no change, do • Reassign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster • Update the cluster means (that is) calculate the mean value of the objects for each cluster • Continues until it no longer changes as it updates

Answer 33

-Kmeans Is sensitive to outliers, which may substantially distort the distribution of the data • Assigns each sample to a single cluster -Kmedoids takes the most centrally located object in the cluster instead of the mean like k-means does -Fuzzy cmeans clustering • Very similar to the k-means algorithm in that the objective functions are virtually identical • But fuzzy c-means algorithm assigns each sample with a vector confidence to each of all clusters o Randomly initialize the membership matrix o Calculate the centroid as follows o Update M(t), M(t+1) o If ||M(t+1)-M(t)||

Answer 34

• Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space o We can cluster columns for identification o We can cluster rows to reduce redundancy in predictive models

Answer 35

o Try removing time points (or columns) of matrices and most stable k value is the one that should be used --> k should not change

Answer 36

Partitioning Advantages Optimal for certain criteria Samples automatically assigned to clusters Disadvantages Need initial k Often require long computation times All samples are forced into a cluster Hierarchical clustering- Advantages Faster computation Visual Disadvantages Unrelated objects are eventually joined Rigid, cannot correct later for erroneous decisions made earlier Hard to define clusters

Answer 37

o Phenotypic variability (signal of interest) o Measurement error o Natural biological variation

Answer 38

Replication

Answer 39

- Capture biological variation | - Test multiple people

Answer 40

- Capture measurement error | - Measure the same individual multiple times

Answer 41

* Goal- to rule out chance (natural biological variation + measurement error) as a plausible explanation for the difference * Test a claim using samples obtained from a population

Answer 42

1. Define null and alternative hypotheses 2. Select appropriate test-statistics a. Two sampled t-test takes variance into account 3. Obtain p-value and interpret test result a. Choose significance level

Answer 43

o Occurs when sample data appears to show deviation from the population when, in fact, there is none o In this case the researcher will reject the null hypothesis

Answer 44

o Type I errors can be caused by unusual, unrepresentative samples or testing across a large number of variables. Just by chance an extreme value may cause it to fall in the critical region even though the variable has no association with the measurement o Can occur due to biological variability

Answer 45

o The probability of a type I error is equal α, that has typical values of 0.05, 0.01 or 0.001

Answer 46

o A type II error occurs when the sample does not appear to show strong difference from population mean, but in fact, there is a real difference o In this case, the researcher will fail to reject the null hypothesis

Answer 47

o Type II errors are commonly the result of a small effect and large variability from technical and biological aspects

Answer 48

probability of discovering a real signal is there

Answer 49

Power is typically set at 80% (1-β=0.8) High power is more desirable as low power studies are not reproducible Where: Power is 1-β, where β is the probability of making a type 2 error

Answer 50

o The size of the effect  The bigger the effect size, the bigger the power o The standard deviation of the characteristics  The smaller the sd, the bigger the power o Sample size  The bigger the sample size, the bigger the power is o The significance level desired  The bigger the significance level, the bigger the power is

Answer 51

a third factor which is related to both the explanatory variable (observed) and the outcome, and which accounts for some or all of the observed relationship between the two

Answer 52

``` A potential confounder that includes:  Experimenter change  Test subject changes  Location change  Time change  Technique change ```

Answer 53

• Randomization • Control • Randomization and control are key elements for insuring the statistical inferences are valid and the bias caused by confounding factors (variables) are minimized • In completely randomized design, all treatments are randomly allocated among all experimental subjects • This allows every experimental subject to have equal probability of receiving a treatment • Balanced experimental design: all treatments have equal sample size o Equal treatment and control groups in batch to have more power

Answer 54

* Blocking * The most common solution is to stratify the group if there is unequal variation in the group * Stratification maximizes power and minimizes imbalance in a sample

Answer 55

a control group is a group of subjects left untreated for the treatment of interest but otherwise experiencing the same conditions as the treated subjects

Answer 56

o Placebo needs to be given to weed out changes in behavior and mindset as a confounding factor

Answer 57

 Don’t let the subject know if they are receiving the treatment or not

Answer 58

 Both the experimenter and subjects don’t know which drug is which: distributions of drugs is handled by something else

Answer 59

So we can minimize the effect caused by unknown confounding factors

Answer 60

``` o Find the mean for all replicates o Pull the data points so the means are centered to 0 o Data is then normalized o In R  Normalized x ```

Answer 61

o If non-linear | o More robust approach

Answer 62

o Rank based normalization o Find highest value in each replicate o Find median of the genes o Centre the three genes onto this line o Find the second highest, third highest… and repeat this process o In R:  preprocessCore package that has to be downloaded from web  normalize.quantiles(cbind(x,y)) • Quantile keeps the y values, mean/median centering doesn’t

Answer 63

• Common in Caucasians living in sunny climates • Of those that metastasis (stage III) about 40% go on to live cancer free, but another 40% succumb to the disease in less than 1 year • Have gene expression data and clinical data for 79 stage II individuals o Fresh frozen tissue, where the signal from those RNA tissues are a lot higher • Detailed clinical and pathological data, with mutation status for majority of the patients • Small clinical data o Small relative to other cancer types o Large study of fresh tissue samples focusing on Stage III patients o Missing values o Potentially could lead to unstable logistic model • Vertical (omics) data

Answer 64

o New prognostic markers  To determine whether there are significant biomarker and pathway differences between melanomas of good and bad prognosis after resection of nodal metastatic disease o New therapeutic targets:  To identify and validate the principal regulatory pathway abnormalities that characterize metastatic (stage III and IV) melanomas  To investigate novel genomic drivers of melanoma tumor progression and outcome

Answer 65

- SNP data - Exome seq - DNA seq

Answer 66

- mRNA array - microRNA - RNA seq

Answer 67

Clinical and pathological data

Answer 68

o Biological question o Experimental design o Experiment involving some high throughput biotechnologies o Pre-processing and quality assessment o Higher level analysis o Biological verification and interpretation

Answer 69

* Data wrangling * Cleaning data * Linked closely to type of biotechnology used * Remove the noise * Makes a matrix

Answer 70

- CEL files - Various file types - Fastq files

Answer 71

```  Short-oligonucleotide chip data: • Quality assessment • Background correction • Probe-level normalization • Probe-set summary ```

Answer 72

 Two Colour array data • Quality assessment; diagnostic plots • Background correction • Array normalization

Answer 73

```  Most commonly used now  RNA-seq data: • Mapping: map to reference • Annotation • Gene-level summarization • Normalization ```

Answer 74

o Log-ratios or log-intensities (microarray) | o Count or RPKM or Total counts (RNA-seq data)

Answer 75

o Identify D.E. genes, estimation and testing o Clustering and o Discrimination  DE genes are the ones able to discriminate, but may or may not need D.E. genes for discrimination

Answer 76

- Did the experiment work? - What do you do analytically? - Identify candidate DE genes between extreme survivability groups (for this articular experiment)

Answer 77

a. Do a summary of the data b. Compare experiment results with other researches c. Could there be contaminants? d. Just because there’s numbers doesn’t mean they’re valid e. Figure out if you have signal i. Boxplots good for detecting outliers

Answer 78

a. Cluster the expression profiles of genes with greatest variance b. Principle component analysis to investigate the predominant source of variability c. Data normalizations i. Allows experimenter to see if experiment worked or if something went wrong in the experimental process 1. In this case study, was found that one technician didn’t do his job properly

Answer 79

a. Merge data i. Merge different batches together b. Perform pre-processing, quality assessment and normalization of expression data c. Exploratory data analysis d. Examination of patient survivability to identify extreme cases and group the samples accordingly for DE analysis e. Identify candidate DE genes between extreme survivability groups and generate corresponding heatmap f. Split into groups i. For this experiment, group chosen was 1. Survived less than one year-Died melanoma (poor group) 2. Alive- NSR (no sign of recurrence- disease free) greater than 4 years (rich group) a. Now 48 samples in total g. External validation h. Survival analysis and prediction (machine learning) i. Find markers  DE analysis, network-based biomarkers… ii. Classify

Answer 80

• RNA  transcriptome mRNA array, microRNA, RNA-seq --> PPI network -->Clinical and pathological data --> phenome --> phenotype

Answer 81

Gene by gene analysis

Answer 82

Features are subsets of genes (set of nodes)

Answer 83

examine a subsets of genes (nodes in the network) together with information on relationship between the genes (edges in network)

Answer 84

 For each edge, k, the correlation difference between the two classes (GP and PP) was calculated  Delta= PPcork –GPcork  For each sub-network, calculate the average absolute difference in hub interactor correlation  Rank the hub subnetworks based on their average hub difference values or use permutation tests to determine the statistical significance of each hub

Answer 85

o More genetic information on the same patients o Contrast to horizontal (more patients with the same genetic information) o Challenges:  Small number of samples  Sample mismatch  Understanding and processing many different platforms  Correlated information from multiple platforms  Unbalance number of variables between CPM and high throughput-omics platforms

Answer 86

• What platforms or combinations of platforms are better? • To identify candidate biomarkers- is there different ways to think about biomarkers? o Identify biomarkers with differential distributions o Identify network biomarkers-which may help us understand underlying mechanisms • Can we trust the biomarker? o Cross-validation between multiple datasets • Can we trust the platform? o Wet lab validation of biotechnologies

Answer 87

``` o Clinical covariates o Unexplained variation o Top NSP o Second top NSP o Other genetic factors ```

Answer 88

• Metabolites will change based on enzymatic pathways | o Capture immutable change of genome and environmental impact

Answer 89

* Sensitivity * Limit of Detection * Reproducibility * Biological vs technical variability * Internal controls * Maintenance protocol

Answer 90

They're predictable- just look at their parent mass

Answer 91

• Metabolomic profiling in Framingham heart study- o Frozen samples of other 3000 people o Follow up events of 12 years

Answer 92

• Establish an external cohort to make sure results aren’t an anomaly

QBIO2001 Flashcards

Big data (119 cards)