Old exam form older student Flashcards

1
Q

⦁ State your hypothesis H0 and H1 (1p)

A

H0: there is no different between control and treatment group
H1: there is a different between control and treatment group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is age for data type?

A

Quantative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it mean with a p value of 0.75?

A

Not statistically significant or reliablem high risk of result just by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is mann whitney and when to use it?

A

Mann-Whitney since our study has only two group and our data is not normally distributed.

Whitney, är inom statistiken ett icke-parametriskt test för att identifiera skillnader i en variabel mellan två oberoende grupper. Ett Mann–Whitney U-test är inte beroende av en normalfördelning och är en icke-parametrisk motsvarighet till t-testet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

⦁ Explain how Type I- and Type II error relate to H0 and H1 . (3p)

A

Type I error: Risk of rejecting H0, although null hypothesis is true. We have found a significant result so there is a different, accept H1.
Type II error: Risk of retaining H0, although H0 is false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

⦁ Explain when to use Pearson respectively Spearman? (2p)

A

Person is used in parametric analysis and to find out if there is linear relationship between two variables
Spearman is used in non-parametric analysis and to see if there is a relationship between two ranked variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

⦁ Before we perform a regression, we must start with correlation analysis, explain why. (2p)

A

We need to find out the relationship between y and each x variables. And then we need to look at the relation between chosen x variables, if there is relationship between two x-variables, then we can perfor a regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

⦁ Explain and interpret the expression Adj. R2 50%. (2p)

A

R2 tells how well our x-variables predict the y variable. In this case the R2 is 50%, which is not good. It should be above 0.7 to be considered good explanation of correlation. Below 0.4 is considered not good.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

⦁ How do we use b in our interpretation? (1p)

A

To calculate the predicted y-variables for a person

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

⦁ How do we use Beta in our interpretation? (2p)

A

We use beta to rank on the impact on y variables. Also, we can level differences among x variables.

Beta weights can be rank ordered to help you decide which predictor variable is the “best” in multiple linear regression. β is a measure of total effect of the predictor variables, so the top-ranked variable is theoretically the one with the greatest total effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. Given that H0 (Null hypothesis) is true and we perform 20,000 independent hypothesis test,
    a) What are the range of P-values? (1p)
A

0 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

b) Please sketch a plot a histogram plot of the expected distribution of P-values. (2p)

A

H0 is ture given from the question -> no different
frequencing on y-axis and p-value in x-axis -> the test is done randomly -> it will look like normal distrubution but without the ends just one big slope in the middle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

b) How many are expected to have a nominal P-value <0.05 ? (1p)

A

5% * 20,000 = 1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

c) What is meant by the concepts FWER, FDR, (1p)

A

FDR – false discovery rate is the percentage of false positives in the gene list
FWER – family-wise error rate is the probability of having at least one false positive in the gene list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. You have performed an RNA-seq analysis of 10 patients and 10 controls, which you now will perform bioinformatic analysis to better understand the molecular genetic status of patients.
    a. Given an overview of the different basic bioinformatic analysis steps from getting your raw data to understanding (1.5p).
A

use galaxy -> look at the quality control of the data -> trimming tools -> aligment sequences against the human genome -> read count -> annotation
In R: batch effects, statistical testing analysis gene sets (GSEA)

First pre processing the data getting rid of noise to get better results and quality
Then aligning, mapping and quantification of the reads
MultiQC report to get a quality report of the data
Then you use R to format and filtrate your raw data and perform statistical tests such as t-test for example and visualize the data for example PC plot to understand the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

b.) Describe what type of tools will be used in three of these steps steps, what are their input/outputs, and what assumptions are they based on? (1.5p)

A

quality control: FastQC -> understand the sequences
trimming tools: CutAdapt
alignment: Bowtie, Hista2 -> against human genome
read count: featurecount
Annotation: edgeR
Enrichment analysis: clusterProfiler

fastp can be used to read FASTA sequences and trim the data and then give a quality report and the output is a trimmed FASTA sequence. Noise is a random error or variance in a measured variable. In biological experiments, such as genomics, proteomics or meta- bolomics, noise could arise from human errors and/or variability of the system itself.

HISAT2 will perform sequencing reads and compare it to a database that contains genome of the human population to find matches. It uses FASTA sequences and gives out FASTA sequences but matched with genes.

MultiQC gives alignment statistics and shows quality report

17
Q
  1. Online bioinformatics tools
    a.) Name at least two repositories of high-throughput data, and describe what data they store (1p).
A

SRA: raw sequence data files
ENA: nucleotides sequencing information

18
Q

b.) Describe two online tools to extract biological meaningful information from this data, and what information can be derived from that data? (2p).

A

Kegg: a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances.
Disgenet - collections of genes and variants associated to human diseases

Uniprot is a website that has a large database for proteins, and by searching with the proteins unique protein ID one can find the FASTA sequence, 3D structures of the protein and description of the protein such as function.

Swiss-Prot have all proteins that have been manually reviewed by professionals. Trembl has yet to be confirmed and can be submitted data or data retrieved from computer analysis but hasn’t been confirmed by approved experts yet.

19
Q
  1. Which bioinformatics questions can be answered with PCA and MDS techniques, compare with differential expression analysis. (3p)
A

They are different in how variation of principle components correlate to each other or how the variants are distant to each other. However, they both use SVD for dimensional reduction
PCA is a linear projection of data that constructs a new set of PCs that summarizes the data. It analyze the correlation among the PCs.
MDS is similar to PCA however it identifies the PCs that represent as a combination of dimensions. Thus, the spatial data is created from distance measure.

A PCA plot shows clusters of samples based on their similarity. It takes the two most significant similarities PCA1 and PCA2 which will be in the y and ax axis that explains the most of the similarities. The x and y value will show a percentage of how much of the similarities can be explained by that variable. Similarity MDS plot also shows how similar the objects are based on how close they are together. This can be used to answer biological questions like if certain genes are correlated to disease by comparing the gene expression with a patient group and control group to see if there’s a difference between the groups and similarity within the groups. Or if batch effect or the day an experiment was performed affected the results by seeing if they cluster together.

Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. PCA plot and MDS plot is to visualize the results of data while differential expression analysis includes statistical tests for example t-test to determine wether a gene for example is significant or not.

20
Q
  1. What questions can be addressed using set enrichment analysis and GSEA? Describe the main differences between the two, give examples and explain when each of the methods are preferred. (3p)
A

SEA: Works when a clear lists of genes of interested can be derived. For example CD4 T positive cell in its pathway.
GSEA: Works when no genes are clearly differentially expressed