Lecture 10 Flashcards
What is the ultimate crime in bioinformatics?
not using existing resources
*Excel
Name a tool for Data Manipulation
BioConductor
What is BioConductor?
-open-source suite of programs for gene expression profiling analysis
-Runs in R statistical language
What can you do with 13 lines in BioConductor?
- GC arrays
- Identify significantly differentially expressed genes
- Display in a heatmap
Why is gene expression profiling important?
- Only 40-60% of genes identified in genome sequencing projects are functionally annotated by sequence similarity (lineage and species-specific genes)
- sequence similarity will not identify novel functions of proteins
- genes involved in regulation, interaction, or integration of pathways
- genes expressed at low levels or show transient exprsesion (missed)
How were genes involved in regulation, interaction, or integration of pathways traditionally identified?
genetic (mutant) analysis and biochemically
What is functional genomics?
Field aiming to create and apply technologies that take advantage of sequence information to analyze full complement of gens and proteins encoded by an organism
What are the 4 major approaches used to elucidate possible function of genes?
- Expression pattern for all genes
- Expression and distribution of all proteins
- Knocking out of genes and examination of phenotype and/or gene expression patterns
- Identifying interactions among proteins (two-hybrid analysis and newer bait methods)
What are the four omics approaches? Which one is the only context-independent
- Genomics - complete set of genes of an organism or oragnelles
**context-independent - Transcriptome - complete set of mRNA molec present in cell, tissue, ororgan
- Proteome - complete set of protein molec. in cell, tissue, organ
- Metabolome - complete set of metabolites in cell, tissue, organ
What methods of analysis are available for studying the genome?
Systematic DNA sequencing
What methods of analysis are available for studying the transcriptome? (4)
Microarrays, high-throughput northern analysis, ESTs, RNA-seq
What methods of analysis are available for studying the proteome?
-2D gel electrophoresis
-peptide mass spec, BioID
-two-hybrid analysis
-peptide/protein microarray
What methods of analysis are available for studying the metabolome?
- nuclear magnetic resonance spectrometry NMRS
- mass spectrometry
- infra-red spectroscopy
Name three profiling technologies
- cDNA Microarrays
- Oligonucleotide Microarrays
- RNA-Seq
How do cDNA microarrays work?
- cDNAs for each gene in genome are spotted onto glass slide
- Each spot represents specific gene - Take RNA from some populations, label with fluorophores (diff dye colours), incorporated during RT
- Mix samples and hybridize to cDNA microarray
- Analyze colour (if mixed two populations, then equal abundance yellow, red or green means one population greater expression)
How do oligonucleotide microarrays work?
- Based on genome sequence design oligont that matches to 3’ end of transcript (25 nt in lenght)
- synthesized oligont on silican wayfare
- mRNA population labelled using biotynylation hybridized to array
- Scan to see how much labeled RNA is bound to particular probe
- Determine gene expression level according to intensity
what process is used in oligonucleotide microarray technology that is used in computer chips?
Photolithographic process (deprotect nt at each position and add specific nt)
What should be considered for oligonucleotide design in microarrays? (5)
- Should have similar Tm = allow similar hybridization efficiency
- Discriminate against members of same family (stimes for paralogues impossible)
- Oligos with specific Tm and length
- Free of secondary structure and self-annealing tendency
- unique to one species without homology with another species (discriminate bacterial from human genes)
Why is image processing done?
extract information after hybridization
- input is scanned images of array fluorescence
- grid applied to image
- position and value (with background correction) on slide associated with appropriate identifier
- output: table of Ids and values
*ScanAnalyze, Affymetrix
Name 5 cell-type specific expression profiling methods
- Lasser Capture Microdissection
- Specific GFP lines–>protoplasting–>FACS
- INTACT / TRAP (translating ribosome affinity purification)
- scRNA-seq
- Spatial transcriptomics
What are the problems of counts for profiling experiments? why do we need normalization
Microarray: counts per pixel (CCD)
numbers (CDD or rna-seq) are:
-arbitrary
-not comparable between samples
-not linear multiple of abundance of what you want to detect
What does normalization do?
remove trends that correlate with variables not expected to influence gene expression changes
-mean expression level across samples should be similar
What is an MA plot?
A = x axis = Log intensity of expression
M = y axis = ratio of intensity relative to median value
- expect a cloud shape
What are the methods of normalization for microarray data? (2)
- RMA Robust Multichip Analysis
- quantile normalization
-better expression estimates but introduces inter-array correlations in coexpression analyses - GCOS/ MAS5.0 Affymetrix normalization algorith
- Loess
-locally weighted linear regression to smooth data (cDNA microarray)
Name two methods to normalize RNA-seq data
- RPKM Reads per KB per Million reads
- FPKM Fragments per KB per Million reads
- TPM transcripts per million
TMM trimmed means of m-values
CPM counts per million
Name two gene expression databases
- ArrayExpress
- GEO (gene expression omnibus)
-may also have organism-specific gene expression DBs: Arabidopsis, human (RefExA), mouse, worm, fly
Why is MIAME and MINSEQE metadata important?
Because they contain information relevant to specific experiment that allows correct interpretation of datasets
– source of tissue, age, microarray element identifiers, identifier annotation, fragmentation protocols, library protocols
Complete name of MIAME
Minimum Information About a Microarray Experiment
Complete name of MINSEQE
Minimum Information for a Sequencing Experiment
What are four methods for selecting significant genes?
- Fold-change
- T-test
- S-test
- ANOVA
For affymetrix chips what is higher: biological variation or chip-chip variation?
biological variation
For cDNA microarrays which one is higher biological variation or chip-chip variation?
chip-chip variation so. need technical replicates
What factors introduce variability in RNA profiling experiments?
gene and variety (types of sample, treatment, time)
individual sample, chip-array, dye (microarrays)
library prep (RNA-seq)
How can you statistically validate results?
- Include duplicates (or replicates) to approximate baseline distribution
- Independent biological replicates
- Replicates for most variable biological factors
Why is fold-change not a statistical test?
Subject to bias: low expression have higher variance
Pros/ Cons of t-test:
Pro: better than t-test
Con: low power due to small sample size and unstable error variance
Pros/Cons of S-test:
Pro: small positive constant added to denominator of gene-specific t-test
- genes with small fold-change will not be selected as significant
What does regularized t-test do?
combines information from gene-specific and global average variance estimates (weighted average in denominator of t-test)
What is B-statistic?
Log posterior odds ratio of differential expression versus non-differential expression
If there are more than 2 conditions for comparison, which statistical test will you use?
ANOVA (with test) or Limma
Why is Bonferroni correction used?
To further adjust p-value by dividing cutoff p-value by number of genes *p-value =0.05 –> 0.0005
How can you select significant genes with multiple testing?
- convert test statistic to p-value
- FWER: family-wise error rate
-prob of accumulating one or more false-positive errors over a number of tests - FDR: false discovery rate
-post-data measure of confidence
-estimate false positive rates by swapping sample labels and asking how many DEGs are identified
How can you select significant genes for RNA-seq data?
DESeq2: negative binomial distribution
Voom and Limma: after transforming count data to approximte normal distribution
Can you use RPKM/FPKM/TMM to select DEGs in RNA-seq?
no
What type of error is controlled better with Limma and Voom?
Type 1 error: false positive
Name four ways to organize expression data
- Similar expression profiles
- Groups of genes of interest
- Functional classification according to GO or MIPS categories
- Pathway analysis
When is not normalizing or log-transforming data acceptable?
If you really want to cluster based on expression level
Why is better to log-transform if you are working with fold-change? what base log is used?
Because values less than 1 are not compressed, and extended equally in negative direction as positive counterparts do.
- norming and median-centering allow shape of change to be better visualized (median less susceptible to outliers than mean)
Base 2 (log2) so you can use exponent for 2 to get fold change
Why do you do median/media-centering?
- Want analysis to be independent of the amount of gene present in reference sample
- remove biases
What assumption is made when median-centering?
Average gene in an experiment is expected to have a ratio of 1.0 (log-ratio of 0)
What does normalization do?
Sets magnitude (sum of squares of values) of row/column vector to 1.0
Divide values of vector by square root of the sum of squares of the values
Describe the process of normalizing data
- Log2 transformation
- Compute median of log2 values
- Substract log2 value - median = log2-MC
- Compute sum of squares of log2-MC
- Compute log2-MC-N bc sum of squares was not 1, so normalize until sum of median squares equals to 1
For what are volcano plots used?
Identify genes that are highly or lowly significantly differentially expressed relative to mock or control sample
Name 5 clustering methods
- Hierarchical
- K-means
- SOM: self-organizing map
- Dimensionality reduction methods: PCA, t-SNE, UMAP (LDA)
- SVM
What does SOM stands for and what is it?
Self-Organizing Map
a clustering method
How does hierarchical clustering calculate pairwise distances?
Pearson Correlation Coefficient (PCC)
What are the steps of hierarchical clustering?
- Compute all possible pairwise distances (pearson CC)
- Join closest neighbours, with branch length reflective of distance between them
- Recompute distances. Use average linkage clustering
- node score= average of PCCs between genes in the node and other gene vectors
-If average node PCC is higher than that between two individual genes, join it to the best gene . Otherwise, join two genes with best PCC - Repeat until all nodes/genes are joined
What is one difference between hierarchical clustering and k-means?
With k-means we approximate the initial number of clusters, while we do not set that number in hierarchical clustering
Name 4 applications of SOMs
- Automatic speech recognition
- Cloud classification from satellite images
- Analysis of electrical signals from brain
- Gene expression data analysis
What questions can you ask after establishing similarly expressed genes?
- Insight into specific biological experiment
- Coexpression analysis by using gene expression databases
- Do promoters of similarly expressed genes contain common cis-elements?
- Is there enrichment of a particular functional category (GO, MIPS, KEGG)–>individual clusters
- Are genes in a given cluster all part of given pathway –>map expression info onto pathways