Lecture 15 - Gene Expression Statistics Flashcards

1
Q

what is accuracy

A

how close something is to the real answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is precision

A

how close your replicates are (consistency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is good precision characterized by

A

reproducible results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is good accuracy characterized by

A

measurements that correspond to an independently known result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the main goal of data preprocessing

A
  • removing systematic bias and variation in the data due to non-biological factors
  • preserving the variate in gene expression that occurs because of biologically relevant changes in transcription
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are two methods of global normalization

A
  • Trimmed Means of Means (TMM)
  • use housekeeping genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is Trimmed Mean of Means (TMM)

A
  • remove genes with big difference (log-fold change) between samples, or very high or low expression
  • adjust sample to a common mean after removing trimmed genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

why is log fold-change used for TMM [log(A/B)]

A
  • proportional changes are generally more important that absolute changes in biological data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are housekeeping genes

A

genes that are typically invariant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a plot commonly used to represent gene expression values from two samples

A

scatter plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how do log transformations change a scatter plot of gene expression

A

it results in a more even distribution of data points and the variance is more uniform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is inferential statistics used for

A

used to make inferences about a population from a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what type of statistics is hypothesis testing a form of

A

inferential statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the null hypothesis for gene expression

A

there is no difference in gene expression between samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the alternate hypothesis for gene expression

A

there is a difference in expression between samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how do we determine whether or not to reject the null hypothesis

A

with statistical tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is type I error of hypothesis testing

A

a false positive; the null hypothesis is rejected but it is actually true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is type II error in hypothesis testing

A

a false negative; we fail to reject the null hypothesis but it is actually false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what can p-values be used to estimate in terms of hypothesis testing

A

type I errors, false positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what does it mean if the p-value is less than a threshold α

A

the chance of a false positive is less than α

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is a common form of hypothesis testing

A

the t-test

22
Q

what is the issue with p-values and multiple hypothesis testing

A

you would expect to see 5% identified at the p < 0.05 level by chance alone

23
Q

what is the Bonferroni correction

A

the level of statistical significance divided by the number of measurements

24
Q

what does the new threshold created by the Bonferroni correction mean

A

the chance of any false positive being present is < 0.05

25
what is the issue with the Bonferroni correction
it will reject many true positive cases
26
what is the false discovery rate (FDR)
Expected(# false positives / # total positives)
27
how do we estimate the number of expected false positives for any p-value threshold
approximate the background distribution of negatives based on the null hypothesis
28
what does a greater difference from the background distribution imply for the p-value
it is smaller
29
how do you calculate q, the false positive rate
expected number of false positives / total number of positives
30
what do we want to find for FDR
the largest i such that p_(i) <= (i/m) * q
31
what is descriptive statistics
finding meaningful patterns in data
32
what are two commonly used distance metrics
Euclidian Distance and Pearson Correlation Coefficient
33
how is Euclidian distance calculated
d_xy = the square root of the sum of the differences squared for 3D: [(x1-y1)^2+(x2-y2)^2+(x3-y3)^2]^0.5
34
what is the calculation for the Pearson Correlation Coefficient (r)
* search it up
35
what does an r value of 0 (Pearson Correlation Coefficient) indicate
X and Y are uncorrelated
36
how can the Pearson Correlation Coefficient be translated to a rough distance measure
D = 1 - r
37
what are the two types of hierarchical clustering
agglomerative and divisive
38
what is agglomerative hierarchical clustering
building up the branches of a tree, beginning with the two most closely related objects
39
what is divisive hierarchical clustering
building the tree by finding the most dissimilar group first
40
what is the distance between clusters in UPGMA
the average of all distances between cluster elements
41
what are three ways distances between clusters are defined
- single linkage clustering - complete linkage clustering - centroid clustering
42
what is single linkage clustering
the minimum distance between a point in X and a point in Y
43
what is complete linkage clustering
the maximum distance between a point in X and a point in Y
44
what is centroid clustering
the distance between the average within clusters (centroids)
45
how do you get a correlation coefficient from cluster distances
use the log of expression ratios
46
what range is the distance between clusters in
[0,2]
47
what are two non-hierarchical clustering methods
- k-means clustering - self-organizing maps (SOM)
48
how is k-means clustering done
- randomly pick k points, called centroids - assign all other points to the nearest centroid - recalculate centroids to minimize the sum of squared distances within clusters - iteratively reassign points until convergence
49
how do self-organizing maps differ from k-means clustering
- initial assignment is not random, you pick the starting points - you start with an initial geometry of 'nodes' (e.g. a 3x2 grid)
50
what are self-organizing maps useful for
mapping time-series data