Hard questions from exam Flashcards

1
Q

When is it suitable to use a t-test?

A

When to use a t test. A t test can only be used when comparing the means of two groups. If the data follows normal distribution we can perform a t-test which compares the means between two normal distributions and if the differences we will get a big t-value, also big if we have big variability in each group and number of samples. They need to be independent of each other, follow normal distribution, random sampling and homogeneity of variances. One can use log to make the data more normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you know if your data is normally distributed based on a table?

A

If the standard deviation is higher then half the mean it is not normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you use Beta when interpretating a table?

A

The beta coefficient represents the estimated average change in standard deviation units. So a beta coefficient of 0.5 means that every time the independent variable changes by one standard deviation, the estimated outcome variable changes by half a standard deviation, on average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

c) What is meant by the concepts FWER, FDR, (1p)

A

FDR – false discovery rate is the percentage of false positives in the gene list

FWER – family-wise error rate is the probability of having at least one false positive in the gene list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe what type of tools will be used in three of these steps steps, what are their input/outputs, and what assumptions are they based on? (1.5p)

A

quality control: FastQC -> understand the sequences
trimming tools: CutAdapt
alignment: Bowtie, Hista2 -> against human genome
read count: featurecount
Annotation: edgeR
Enrichment analysis: clusterProfiler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

a.) Name at least two repositories of high-throughput data, and describe what data they store (1p).

A

SRA: raw sequence data files
ENA: nucleotides sequencing information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

b.) Describe two online tools to extract biological meaningful information from this data, and what information can be derived from that data? (2p).

A

Kegg: a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances.
Disgenet - collections of genes and variants associated to human diseases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In an enrichmeant plot what does an enrichmeant score mean

A

he ES reflects the degree to which the genes in a gene set are overrepresented at the top or bottom of the entire ranked list of genes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In an enrichmeant plot what does an leading edge subset mean

A

Enrichmeant plot is the first part where the genes that are the most over or under represented are located

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In an enrichmeant plot what does an ranking score at maximum mean

A

Where the top of the curve is what is the vlaue that is the most under or over represented gene

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In an enrichmeant plot what does an ranking score at maximum mean

A

Where the top of the curve is what is the vlaue that is the most under or over represented gene(e) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the main steps in how the GSEA algorithm works (3p

A

Estimate the statistical significance of the ES
Adjust multiple hypothesis testing by normalized the ES accounting for size of each gene set. Get FDR value after comparing
Do fold-change to look at the directionality of the change. RUN it twice with both positive and negative -> up-down regulated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly