Lecture 10 Flashcards

Question

Name two methods to normalize RNA-seq data

Answer 1

1. RPKM Reads per KB per Million reads 2. FPKM Fragments per KB per Million reads 3. TPM transcripts per million TMM trimmed means of m-values CPM counts per million

Answer 2

1. ArrayExpress 2. GEO (gene expression omnibus) -may also have organism-specific gene expression DBs: Arabidopsis, human (RefExA), mouse, worm, fly

Answer 3

Because they contain information relevant to specific experiment that allows correct interpretation of datasets -- source of tissue, age, microarray element identifiers, identifier annotation, fragmentation protocols, library protocols

Answer 4

Minimum Information About a Microarray Experiment

Answer 5

Minimum Information for a Sequencing Experiment

Answer 6

1. Fold-change 2. T-test 3. S-test 4. ANOVA

Answer 7

biological variation

Answer 8

chip-chip variation so. need technical replicates

Answer 9

gene and variety (types of sample, treatment, time) individual sample, chip-array, dye (microarrays) library prep (RNA-seq)

Answer 10

1. Include duplicates (or replicates) to approximate baseline distribution 2. Independent biological replicates 3. Replicates for most variable biological factors

Answer 11

Subject to bias: low expression have higher variance

Answer 12

Pro: better than t-test Con: low power due to small sample size and unstable error variance

Answer 13

Pro: small positive constant added to denominator of gene-specific t-test - genes with small fold-change will not be selected as significant

Answer 14

combines information from gene-specific and global average variance estimates (weighted average in denominator of t-test)

Answer 15

Log posterior odds ratio of differential expression versus non-differential expression

Answer 16

ANOVA (with test) or Limma

Answer 17

To further adjust p-value by dividing cutoff p-value by number of genes *p-value =0.05 --> 0.0005

Answer 18

1. convert test statistic to p-value 2. FWER: family-wise error rate -prob of accumulating one or more false-positive errors over a number of tests 3. FDR: false discovery rate -post-data measure of confidence -estimate false positive rates by swapping sample labels and asking how many DEGs are identified

Answer 19

DESeq2: negative binomial distribution Voom and Limma: after transforming count data to approximte normal distribution

Answer 20

Type 1 error: false positive

Answer 21

1. Similar expression profiles 2. Groups of genes of interest 3. Functional classification according to GO or MIPS categories 4. Pathway analysis

Answer 22

If you really want to cluster based on expression level

Answer 23

Because values less than 1 are not compressed, and extended equally in negative direction as positive counterparts do. - norming and median-centering allow shape of change to be better visualized (median less susceptible to outliers than mean) Base 2 (log2) so you can use exponent for 2 to get fold change

Answer 24

1. Want analysis to be independent of the amount of gene present in reference sample 2. remove biases

Answer 25

Average gene in an experiment is expected to have a ratio of 1.0 (log-ratio of 0)

Answer 26

Sets magnitude (sum of squares of values) of row/column vector to 1.0 Divide values of vector by square root of the sum of squares of the values

Answer 27

1. Log2 transformation 2. Compute median of log2 values 3. Substract log2 value - median = log2-MC 4. Compute sum of squares of log2-MC 5. Compute log2-MC-N bc sum of squares was not 1, so normalize until sum of median squares equals to 1

Answer 28

Identify genes that are highly or lowly significantly differentially expressed relative to mock or control sample

Answer 29

1. Hierarchical 2. K-means 3. SOM: self-organizing map 4. Dimensionality reduction methods: PCA, t-SNE, UMAP (LDA) 5. SVM

Answer 30

Self-Organizing Map a clustering method

Answer 31

Pearson Correlation Coefficient (PCC)

Answer 32

1. Compute all possible pairwise distances (pearson CC) 2. Join closest neighbours, with branch length reflective of distance between them 3. Recompute distances. Use average linkage clustering - node score= average of PCCs between genes in the node and other gene vectors -If average node PCC is higher than that between two individual genes, join it to the best gene . Otherwise, join two genes with best PCC 4. Repeat until all nodes/genes are joined

Answer 33

With k-means we approximate the initial number of clusters, while we do not set that number in hierarchical clustering

Answer 34

1. Automatic speech recognition 2. Cloud classification from satellite images 3. Analysis of electrical signals from brain 4. Gene expression data analysis

Answer 35

1. Insight into specific biological experiment 2. Coexpression analysis by using gene expression databases 3. Do promoters of similarly expressed genes contain common cis-elements? 4. Is there enrichment of a particular functional category (GO, MIPS, KEGG)-->individual clusters 5. Are genes in a given cluster all part of given pathway -->map expression info onto pathways

Lecture 10 Flashcards

(60 cards)