Old exam Flashcards

1
Q

Explain the difference between qualitative and quantitative variables. Give an
example of each type of variable.

A

Quantitative data are data about numeric variables (e.g. how many; how much; or how often). Qualitative data are measures of ‘types’ and may be represented by a name, symbol, or a number code. Qualitative data are data about categorical variables (e.g. what type). Examples of qualative data is gender or religion vs quantative is age, height, BMI etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2.Explain when to use mean(SD) respectively median(IQR). (2p)

A

“The mean is typically better when the data follow a symmetric distribution. When the data are skewed, the median is more useful because the mean will be distorted by outliers.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

5.Explain when to use Pearson respectively Spearman? (2p)

A

Pearson only works with linear correlation when spearsson can be used even with monotonic relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. Before we perform a regression, we must start with correlation analysis, explain why. (2p)
A

A correlation analysis provides information on the strength and direction of the linear relationship between two variables, while a simple linear regression analysis estimates parameters in a linear equation that can be used to predict values of one variable based on the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain what square R=50% means in statistical tables

A

For multiple linear regression there are a few parameters we analyze. R stands for the correlation coefficient, which means how connected is the predictor value to the response value. Adjusted R squared is a modified version where it both shows how reliable the correlation is but also how much is determined by addition of independent variables. For example, an r-squared of 60% reveals that 60% of the variability observed in the target variable is explained by the regression model. Adj.R in square=50% means that 50% of the variability can be explained by the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

c)How do we use Beta in our interpretation? (2p)

A

Beta weights can be rank ordered to help you decide which predictor variable is the “best” in multiple linear regression. β is a measure of total effect of the predictor variables, so the top-ranked variable is theoretically the one with the greatest total effect. The first symbol is the unstandardized beta (B). This value represents the slope of the line between the predictor variable and the dependent variable. So for Variable 1, this would mean that for every one unit increase in Variable 1, the dependent variable increases by 1.57 units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. Given that H0 is true and we perform 20,000 independent hypothesis test,

a) What are the range of P-values? (1p)

A

To prove that H0 is true the P-value has to be below 5% so 0-0.05 is the range of possible p values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

b) Please sketch a plot of a histogram of the expected distribution of P-values. (2p)

A

do this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

b) How many are expected to have a nominal P-value <0.05 ? (1p)

A

If null hypothesis is correct 5% if the samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

c) What is meant by the concepts FWER, FDR, (1p)

A

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.

The false discovery rate (FDR) is a statistical approach used in multiple hypothesis testing to correct for multiple comparisons. It is typically used in high-throughput experiments in order to correct for random events that falsely appear significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. You have performed an RNA-seq analysis of 10 patients and 10 controls, which you now will perform bioinformatic analysis to better understand the molecular genetic status of patients.

a.) Given an overview of the different basic bioinformatic analysis steps from getting your raw data to understanding (1.5p).

A

First pre processing the data getting rid of noise to get better results and quality
Then aligning, mapping and quantification of the reads
MultiQC report to get a quality report of the data
Then you use R to format and filtrate your raw data and perform statistical tests such as t-test for example and visualize the data for example PC plot to understand the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. You have performed an RNA-seq analysis of 10 patients and 10 controls, which you now will perform bioinformatic analysis to better understand the molecular genetic status of patients.

b.) Describe what type of tools will be used in three of these steps steps, what are their input/outputs, and what assumptions are they based on? (1.5p)

A

b.) Describe what type of tools will be used in three of these steps steps, what are their input/outputs, and what assumptions are they based on? (1.5p)

fastp can be used to read FASTA sequences and trim the data and then give a quality report and the output is a trimmed FASTA sequence. Noise is a random error or variance in a measured variable. In biological experiments, such as genomics, proteomics or meta- bolomics, noise could arise from human errors and/or variability of the system itself.

HISAT2 will perform sequencing reads and compare it to a database that contains genome of the human population to find matches. It uses FASTA sequences and gives out FASTA sequences but matched with genes.

MultiQC gives alignment statistics and shows quality report

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Online bioinformatics tools

a.) Name at least two repositories of high-throughput data, and describe what data they store (1p).

A

Uniprot is a website that has a large database for proteins, and by searching with the proteins unique protein ID one can find the FASTA sequence, 3D structures of the protein and description of the protein such as function.

Swiss-Prot have all proteins that have been manually reviewed by professionals. Trembl has yet to be confirmed and can be submitted data or data retrieved from computer analysis but hasn’t been confirmed by approved experts yet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Online bioinformatics tools: b.) Describe two online tools to extract biological meaningful information from this data, and what information can be derived from that data? (2p).

A

BLAST, different BLAST programs to compare different molecules (protein-protein,
nucleotide-nucleotide, and all combinations protein-nucleotides, and also nucleotidenucleotides via protein translation)

KEGG, GO, DisGeNET, REACTOME

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. Which bioinformatics questions can be answered with PCA and MDS techniques, compared with differential expression analysis. (3p)
A

A PCA plot shows clusters of samples based on their similarity. It takes the two most significant similarities PCA1 and PCA2 which will be in the y and ax axis that explains the most of the similarities. The x and y value will show a percentage of how much of the similarities can be explained by that variable. Similarity MDS plot also shows how similar the objects are based on how close they are together. This can be used to answer biological questions like if certain genes are correlated to disease by comparing the gene expression with a patient group and control group to see if there’s a difference between the groups and similarity within the groups. Or if batch effect or the day an experiment was performed affected the results by seeing if they cluster together.

Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. PCA plot and MDS plot is to visualize the results of data while differential expression analysis includes statistical tests for example t-test to determine wether a gene for example is significant or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. What questions can be addressed using set enrichment analysis and GSEA? Describe the main differences between the two, give examples and explain when each of the methods are preferred. (3p)
A

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes.

SSEA (SNP Set Enrichment Analysis) is a tool to analyze pathway enrichment in genome-wide association studies (GWAS). The SSEA algorithm first identifies representative SNPs using adaptive truncated product statistics, ranks the selected SNPs, and tests their significance using a weighted Kolmogorov-Smirnov test.