Old exam Flashcards
Explain the difference between qualitative and quantitative variables. Give an
example of each type of variable.
Quantitative data are data about numeric variables (e.g. how many; how much; or how often). Qualitative data are measures of ‘types’ and may be represented by a name, symbol, or a number code. Qualitative data are data about categorical variables (e.g. what type). Examples of qualative data is gender or religion vs quantative is age, height, BMI etc
2.Explain when to use mean(SD) respectively median(IQR). (2p)
“The mean is typically better when the data follow a symmetric distribution. When the data are skewed, the median is more useful because the mean will be distorted by outliers.”
5.Explain when to use Pearson respectively Spearman? (2p)
Pearson only works with linear correlation when spearsson can be used even with monotonic relationships
- Before we perform a regression, we must start with correlation analysis, explain why. (2p)
A correlation analysis provides information on the strength and direction of the linear relationship between two variables, while a simple linear regression analysis estimates parameters in a linear equation that can be used to predict values of one variable based on the other.
Explain what square R=50% means in statistical tables
For multiple linear regression there are a few parameters we analyze. R stands for the correlation coefficient, which means how connected is the predictor value to the response value. Adjusted R squared is a modified version where it both shows how reliable the correlation is but also how much is determined by addition of independent variables. For example, an r-squared of 60% reveals that 60% of the variability observed in the target variable is explained by the regression model. Adj.R in square=50% means that 50% of the variability can be explained by the regression model.
c)How do we use Beta in our interpretation? (2p)
Beta weights can be rank ordered to help you decide which predictor variable is the “best” in multiple linear regression. β is a measure of total effect of the predictor variables, so the top-ranked variable is theoretically the one with the greatest total effect. The first symbol is the unstandardized beta (B). This value represents the slope of the line between the predictor variable and the dependent variable. So for Variable 1, this would mean that for every one unit increase in Variable 1, the dependent variable increases by 1.57 units.
- Given that H0 is true and we perform 20,000 independent hypothesis test,
a) What are the range of P-values? (1p)
To prove that H0 is true the P-value has to be below 5% so 0-0.05 is the range of possible p values
b) Please sketch a plot of a histogram of the expected distribution of P-values. (2p)
do this
b) How many are expected to have a nominal P-value <0.05 ? (1p)
If null hypothesis is correct 5% if the samples.
c) What is meant by the concepts FWER, FDR, (1p)
In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.
The false discovery rate (FDR) is a statistical approach used in multiple hypothesis testing to correct for multiple comparisons. It is typically used in high-throughput experiments in order to correct for random events that falsely appear significant.
- You have performed an RNA-seq analysis of 10 patients and 10 controls, which you now will perform bioinformatic analysis to better understand the molecular genetic status of patients.
a.) Given an overview of the different basic bioinformatic analysis steps from getting your raw data to understanding (1.5p).
First pre processing the data getting rid of noise to get better results and quality
Then aligning, mapping and quantification of the reads
MultiQC report to get a quality report of the data
Then you use R to format and filtrate your raw data and perform statistical tests such as t-test for example and visualize the data for example PC plot to understand the data
- You have performed an RNA-seq analysis of 10 patients and 10 controls, which you now will perform bioinformatic analysis to better understand the molecular genetic status of patients.
b.) Describe what type of tools will be used in three of these steps steps, what are their input/outputs, and what assumptions are they based on? (1.5p)
b.) Describe what type of tools will be used in three of these steps steps, what are their input/outputs, and what assumptions are they based on? (1.5p)
fastp can be used to read FASTA sequences and trim the data and then give a quality report and the output is a trimmed FASTA sequence. Noise is a random error or variance in a measured variable. In biological experiments, such as genomics, proteomics or meta- bolomics, noise could arise from human errors and/or variability of the system itself.
HISAT2 will perform sequencing reads and compare it to a database that contains genome of the human population to find matches. It uses FASTA sequences and gives out FASTA sequences but matched with genes.
MultiQC gives alignment statistics and shows quality report
- Online bioinformatics tools
a.) Name at least two repositories of high-throughput data, and describe what data they store (1p).
Uniprot is a website that has a large database for proteins, and by searching with the proteins unique protein ID one can find the FASTA sequence, 3D structures of the protein and description of the protein such as function.
Swiss-Prot have all proteins that have been manually reviewed by professionals. Trembl has yet to be confirmed and can be submitted data or data retrieved from computer analysis but hasn’t been confirmed by approved experts yet.
Online bioinformatics tools: b.) Describe two online tools to extract biological meaningful information from this data, and what information can be derived from that data? (2p).
BLAST, different BLAST programs to compare different molecules (protein-protein,
nucleotide-nucleotide, and all combinations protein-nucleotides, and also nucleotidenucleotides via protein translation)
KEGG, GO, DisGeNET, REACTOME
- Which bioinformatics questions can be answered with PCA and MDS techniques, compared with differential expression analysis. (3p)
A PCA plot shows clusters of samples based on their similarity. It takes the two most significant similarities PCA1 and PCA2 which will be in the y and ax axis that explains the most of the similarities. The x and y value will show a percentage of how much of the similarities can be explained by that variable. Similarity MDS plot also shows how similar the objects are based on how close they are together. This can be used to answer biological questions like if certain genes are correlated to disease by comparing the gene expression with a patient group and control group to see if there’s a difference between the groups and similarity within the groups. Or if batch effect or the day an experiment was performed affected the results by seeing if they cluster together.
Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. PCA plot and MDS plot is to visualize the results of data while differential expression analysis includes statistical tests for example t-test to determine wether a gene for example is significant or not.