Lecture 15 - Gene Expression Statistics Flashcards by Lina Zhuge

what is accuracy

how close something is to the real answer

How well did you know this?

Not at all

Perfectly

what is precision

how close your replicates are (consistency)

How well did you know this?

Not at all

Perfectly

what is good precision characterized by

reproducible results

How well did you know this?

Not at all

Perfectly

what is good accuracy characterized by

measurements that correspond to an independently known result

How well did you know this?

Not at all

Perfectly

what is the main goal of data preprocessing

removing systematic bias and variation in the data due to non-biological factors
preserving the variate in gene expression that occurs because of biologically relevant changes in transcription

How well did you know this?

Not at all

Perfectly

what are two methods of global normalization

Trimmed Means of Means (TMM)
use housekeeping genes

How well did you know this?

Not at all

Perfectly

what is Trimmed Mean of Means (TMM)

remove genes with big difference (log-fold change) between samples, or very high or low expression
adjust sample to a common mean after removing trimmed genes

How well did you know this?

Not at all

Perfectly

why is log fold-change used for TMM [log(A/B)]

proportional changes are generally more important that absolute changes in biological data

How well did you know this?

Not at all

Perfectly

what are housekeeping genes

genes that are typically invariant

How well did you know this?

Not at all

Perfectly

what is a plot commonly used to represent gene expression values from two samples

scatter plots

How well did you know this?

Not at all

Perfectly

how do log transformations change a scatter plot of gene expression

it results in a more even distribution of data points and the variance is more uniform

How well did you know this?

Not at all

Perfectly

what is inferential statistics used for

used to make inferences about a population from a sample

How well did you know this?

Not at all

Perfectly

what type of statistics is hypothesis testing a form of

inferential statistics

How well did you know this?

Not at all

Perfectly

what is the null hypothesis for gene expression

there is no difference in gene expression between samples

How well did you know this?

Not at all

Perfectly

what is the alternate hypothesis for gene expression

there is a difference in expression between samples

How well did you know this?

Not at all

Perfectly

how do we determine whether or not to reject the null hypothesis

with statistical tests

How well did you know this?

Not at all

Perfectly

what is type I error of hypothesis testing

a false positive; the null hypothesis is rejected but it is actually true

How well did you know this?

Not at all

Perfectly

what is type II error in hypothesis testing

a false negative; we fail to reject the null hypothesis but it is actually false

How well did you know this?

Not at all

Perfectly

what can p-values be used to estimate in terms of hypothesis testing

type I errors, false positives

How well did you know this?

Not at all

Perfectly

what does it mean if the p-value is less than a threshold α

the chance of a false positive is less than α

How well did you know this?

Not at all

Perfectly

what is a common form of hypothesis testing

the t-test

what is the issue with p-values and multiple hypothesis testing

you would expect to see 5% identified at the p < 0.05 level by chance alone

what is the Bonferroni correction

the level of statistical significance divided by the number of measurements

what does the new threshold created by the Bonferroni correction mean

the chance of any false positive being present is < 0.05

what is the issue with the Bonferroni correction

it will reject many true positive cases

what is the false discovery rate (FDR)

Expected(# false positives / # total positives)

how do we estimate the number of expected false positives for any p-value threshold

approximate the background distribution of negatives based on the null hypothesis

what does a greater difference from the background distribution imply for the p-value

it is smaller

how do you calculate q, the false positive rate

expected number of false positives / total number of positives

what do we want to find for FDR

the largest i such that p_(i) <= (i/m) * q

what is descriptive statistics

finding meaningful patterns in data

what are two commonly used distance metrics

Euclidian Distance and Pearson Correlation Coefficient

how is Euclidian distance calculated

d_xy = the square root of the sum of the differences squared for 3D: [(x1-y1)^2+(x2-y2)^2+(x3-y3)^2]^0.5

what is the calculation for the Pearson Correlation Coefficient (r)

* search it up

what does an r value of 0 (Pearson Correlation Coefficient) indicate

X and Y are uncorrelated

how can the Pearson Correlation Coefficient be translated to a rough distance measure

D = 1 - r

what are the two types of hierarchical clustering

agglomerative and divisive

what is agglomerative hierarchical clustering

building up the branches of a tree, beginning with the two most closely related objects

what is divisive hierarchical clustering

building the tree by finding the most dissimilar group first

what is the distance between clusters in UPGMA

the average of all distances between cluster elements

what are three ways distances between clusters are defined

- single linkage clustering - complete linkage clustering - centroid clustering

what is single linkage clustering

the minimum distance between a point in X and a point in Y

what is complete linkage clustering

the maximum distance between a point in X and a point in Y

what is centroid clustering

the distance between the average within clusters (centroids)

how do you get a correlation coefficient from cluster distances

use the log of expression ratios

what range is the distance between clusters in

[0,2]

what are two non-hierarchical clustering methods

- k-means clustering - self-organizing maps (SOM)

how is k-means clustering done

- randomly pick k points, called centroids - assign all other points to the nearest centroid - recalculate centroids to minimize the sum of squared distances within clusters - iteratively reassign points until convergence

how do self-organizing maps differ from k-means clustering

- initial assignment is not random, you pick the starting points - you start with an initial geometry of 'nodes' (e.g. a 3x2 grid)

what are self-organizing maps useful for

mapping time-series data