R and Deseq2 Flashcards by Melanie Law

Name some benefits of using R

opensource
reproducible research w R markdown
huge community of developers
custom packages are available

How well did you know this?

Not at all

Perfectly

What are the uses of .R projects?

links all files and outputs to project dir
imported data looked for in proj dir instead of specifying file path
can save environment in .RData and reload workspace where you left it

How well did you know this?

Not at all

Perfectly

What are some challenges when trying to identify differentially expressed genes? (3)

distinguish technical variation from treatment variation (ex: technical factors that can’t be controlled in library prep)
majority of genes don’t change b/w treatments (hard to perform stats and get significant)
only few replicates per treatment, hard to estimate variance (not feasible for price, limited material, and experiment execution)

How well did you know this?

Not at all

Perfectly

What is count normalization?

The determination of size factors to account for/normalize differences in sample sequencing depth, gene length, and RNA composition

How well did you know this?

Not at all

Perfectly

Why can’t you use total sampling depth to normalize counts?

In highly expressed genes, taking ratios doesn’t reflect the actual expression. Two genes might be expressed at the same levels, but taking ratios might make them appear at a lower level

How well did you know this?

Not at all

Perfectly

Why should count normalization be done?

the numerical value of non-differentially expressed genes should not vary due to sampling depth or RNA composition. We need to determine a sample-specific size factor for each sample

This is needed to make accurate comparison of gene expression between samples

How well did you know this?

Not at all

Perfectly

What are the steps in DESeq2 count normalization? (6)

determine natural log of genes counts
calculate the geometric mean of each row to use a pseudo-reference sample
Remove infinite values
subtract geometric mean from log of counts (subtract reference from log of counts which is equivalent to lof ratio of counts to reference)
calculate median for each sample
convert log of median to number

How well did you know this?

Not at all

Perfectly

What are 2 limits to FDR-controlling procedures? & what are the solutions?

multiple testing causes false positives
when FDR correct, the more negatives, the more false negatives

solutions:

low-expressed genes variance can’t be estimated
remove low-expressed genes

How well did you know this?

Not at all

Perfectly

What is the probability distribution if a gene is not differentially expressed in 2 different conditions?

The samples come from the same distribution

the probability distribution is uniform from 0 to 1

How well did you know this?

Not at all

Perfectly

What is the probability distribution if a gene is differentially expressed in 2 different conditions?

Samples come from 2 different distributions
probability distribution is skewed towards 0
most samples below 0.05

How well did you know this?

Not at all

Perfectly

What are the p values for true positives and false negatives?

true positives : 0 - 0.05

false negatives: > 0.05

How well did you know this?

Not at all

Perfectly

What is the Benjamini-hochberg method?

A method to control the FDR and account for the fact that sometimes p-values less than 0.05 happen by chance.
It adjusts p-values by making them larger
It ensures that the false positives never make up more than 5% of all positives

How well did you know this?

Not at all

Perfectly

In independent filtering how is the filter threshold calculated?

filter threshold = max of fit curve - 1SD

How well did you know this?

Not at all

Perfectly

How does the Benjamini-hochberg method work?

It sets the p-value to which ever is the lower value of
1. the p-value of the next higher rank (after ranking p-values from lowest to highest); p-value(rank+1)

p-value(rank) * [total# of p-values]/rank

How well did you know this?

Not at all

Perfectly

What happens to the numbers of true positives after the Benjamini-hochberg method is applied?

fewer true positives are identified

How well did you know this?

Not at all

Perfectly

What does FDR do?

Study These Flashcards

Limit the number of false negatives reported, but lose some true positives

Why are genes with low read count very noisy?

Study These Flashcards

small changes will have dramatic changes on the calculated fold change; remove these out of dataset

What does DESeq2 do to counteract loss of true positives when there are lots of non-differentially expressed genes?

Study These Flashcards

Removes tests that are unlikely to show significant differences and determines which quantiles return the largest number of rejections

What is independent filtering?

Study These Flashcards

The removal of genes with very low counts; Maximizes the number of positives

What is gene dispersion?

Study These Flashcards

The variance of the gene

What are the steps in independent filtering?

Study These Flashcards

Genes with low counts removed (sample mean > filter threshold)

determine significant genes for different threshold (expressed as quantiles) and lot of significant genes vs quantities
fit curve
determine filter threshold

Describe properties of read counts (4)

Study These Flashcards

sparse events (i.e. small likelihood p of a read mapping to a specific gene) = read count of a given gene likely small
discrete and high number of events n( (sampling depth/# of reads)
model raw counts with poison distribution (mean = variance)

What distribution is suitable for a large sampling depth (n) and a very small number for p?

Study These Flashcards

poissson

important property: mean = variance

what does over-dispersed mean? and what is this caused by?

Study These Flashcards

when the variance of the data increases faster than the mean.
caused by biological variation

What factors can make variance difficult to deal with?

- only few replicates/conditions make it hard to estimate variance - at low expression levels, data is very noisy

How does DESeq2 deal with challenges with variation? (3)

- Models count matrix using negative bionomical distribution - borrows info across genes to estimate dispersion and ultimately variance (to determine variance at an expected mean expression); can calculate dispersion for each gene - determines the log2fold change and its significance (wald statistic)

What is negative binomial distribution?

distribution with overdispersion K = raw count of gene i in sample J with fitted mean u and gene disperation a a = value to estimate to determine gene variance K ~NB(u, a) u = sq ``` s = sample specific size factor q = number of fragments in sample (expression level) ```

What is the formula for variance?

Var(K) = u + a(u )^2 variance of each gene fitted to line which is assumed to be the true dispersion for any given mean a used to determine log2 fold change and error of that estimate

On a plot of estimating gene-wise dispersion what happens to genes that are too far away from the best fit line?

the variance of those genes won't be adjusted

describes the steps in estimating gene-wise dispersion?

1. gene dispersion estimates done on their own 2. information from all genes used to created a best ifr line to give general estimate of what to expected for a given mean 3. estimates of individual genes are shrunk towards the best fit line

R and Deseq2 Flashcards

(30 cards)