R and Deseq2 Flashcards
Name some benefits of using R
- opensource
- reproducible research w R markdown
- huge community of developers
- custom packages are available
What are the uses of .R projects?
- links all files and outputs to project dir
- imported data looked for in proj dir instead of specifying file path
- can save environment in .RData and reload workspace where you left it
What are some challenges when trying to identify differentially expressed genes? (3)
- distinguish technical variation from treatment variation (ex: technical factors that can’t be controlled in library prep)
- majority of genes don’t change b/w treatments (hard to perform stats and get significant)
- only few replicates per treatment, hard to estimate variance (not feasible for price, limited material, and experiment execution)
What is count normalization?
The determination of size factors to account for/normalize differences in sample sequencing depth, gene length, and RNA composition
Why can’t you use total sampling depth to normalize counts?
In highly expressed genes, taking ratios doesn’t reflect the actual expression. Two genes might be expressed at the same levels, but taking ratios might make them appear at a lower level
Why should count normalization be done?
the numerical value of non-differentially expressed genes should not vary due to sampling depth or RNA composition. We need to determine a sample-specific size factor for each sample
This is needed to make accurate comparison of gene expression between samples
What are the steps in DESeq2 count normalization? (6)
- determine natural log of genes counts
- calculate the geometric mean of each row to use a pseudo-reference sample
- Remove infinite values
- subtract geometric mean from log of counts (subtract reference from log of counts which is equivalent to lof ratio of counts to reference)
- calculate median for each sample
- convert log of median to number
What are 2 limits to FDR-controlling procedures? & what are the solutions?
- multiple testing causes false positives
- when FDR correct, the more negatives, the more false negatives
solutions:
- low-expressed genes variance can’t be estimated
- remove low-expressed genes
What is the probability distribution if a gene is not differentially expressed in 2 different conditions?
The samples come from the same distribution
the probability distribution is uniform from 0 to 1
What is the probability distribution if a gene is differentially expressed in 2 different conditions?
Samples come from 2 different distributions
probability distribution is skewed towards 0
most samples below 0.05
What are the p values for true positives and false negatives?
true positives : 0 - 0.05
false negatives: > 0.05
What is the Benjamini-hochberg method?
A method to control the FDR and account for the fact that sometimes p-values less than 0.05 happen by chance.
It adjusts p-values by making them larger
It ensures that the false positives never make up more than 5% of all positives
In independent filtering how is the filter threshold calculated?
filter threshold = max of fit curve - 1SD
How does the Benjamini-hochberg method work?
It sets the p-value to which ever is the lower value of
1. the p-value of the next higher rank (after ranking p-values from lowest to highest); p-value(rank+1)
- p-value(rank) * [total# of p-values]/rank
What happens to the numbers of true positives after the Benjamini-hochberg method is applied?
fewer true positives are identified