Key questions Flashcards

1
Q
  • Normal distribution or not
A

Normal distribution follows the normal distrubution curve, it’s symmetric around the mean. If it’s not the results are not symetric around the mean and won’t have a curve like normal fistrubution curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  • Research hypothesis
A

A typical research hypothesis looks like x is affected by y because ….

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  • Hypothesis in statistical testing
A

Null hypothesis (H0) (there are no differences between the group) and the alternative hypothesis (H1) (there is a difference between the two groups).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  • Type I and Type II error
A

Type 1 error stands for false positive and is when the researches rejects the null hypothesis (that there is no difference between the groups) when that’s the correct hypothesis. Type 2 error stands for false negatives when the researcher accept the null hypothesis despite the alternative hypothesis being the correct one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • Power
A

Statistical power ranges from 0-1 and is usually written like 1-B(beta). It’s a hypothesis test that test how big the risk is that one correctly rejecting the null hypothesis. The higher power the lower probability is it that the results are from a type 2 error( falsly accepting the null hypothesis) as B diminshes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  • Parametric versus non-parametric analysis
A

In parametric analysis, the underlying probability distribution of the population is assumed to be known or can be reasonably estimated, and the analysis is based on specific parameters of that distribution, such as the mean and variance. The most common parametric tests include t-tests, ANOVA, regression analysis, and chi-square tests. Parametric methods generally require that certain assumptions are met, such as normality and homogeneity of variances.

Non-parametric analysis, on the other hand, is an approach that does not rely on specific assumptions about the distribution of the population, or the parameters of that distribution. Non-parametric methods are used when the data is not normally distributed, or the assumptions for parametric tests are not met. Non-parametric tests include Wilcoxon signed-rank test, Kruskal-Wallis test, Mann-Whitney U test, and Spearman’s rank correlation coefficient.

In summary, parametric analysis is based on specific assumptions about the distribution of the population, whereas non-parametric analysis does not require these assumptions and can be used when the data does not meet the assumptions for parametric tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  • Descriptive analysis
A

Descriptive analysis is a quantative statistical summary that shows trends and it can be for example scatter plots, PCA etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  • Bivariate analysis
A

Bivariate analysis in bioinformatics refers to the analysis of two variables in relation to each other. Here are some examples of bivariate analysis in bioinformatics:

Correlation analysis: This involves analyzing the strength and direction of the linear relationship between two variables. For example, correlation analysis can be used to investigate the relationship between gene expression levels of two different genes in a particular tissue or under certain experimental conditions.

Regression analysis: This involves modeling the relationship between two variables using a linear or nonlinear function. For example, regression analysis can be used to model the relationship between sequence features and protein-protein interaction affinity.

Chi-square test: This is a statistical test used to analyze the association between two categorical variables. For example, the chi-square test can be used to investigate the association between a gene mutation and a disease phenotype.

Bivariate analysis is an important tool for exploring the relationship between different variables in bioinformatics and can provide insights into biological processes and relationships that are not easily observed through individual variable analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  • Multivariate analysis
A

Multivariate analysis in bioinformatics refers to the analysis of multiple variables in relation to each other. Here are some examples of multivariate analysis in bioinformatics:

Principal component analysis (PCA): This involves reducing the dimensionality of a dataset by identifying linear combinations of variables that capture the most variation in the data. For example, PCA can be used to identify patterns in gene expression data across multiple samples or conditions.

Cluster analysis: This involves grouping together samples or variables based on similarity in gene expression patterns or other features. For example, cluster analysis can be used to group together samples with similar gene expression profiles in a large gene expression dataset.

Multivariate analysis is an important tool for exploring complex relationships between multiple variables in bioinformatics and can provide insights into biological processes that cannot be easily observed through univariate or bivariate analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  • Generalized linear models
A

Generalized linear models are non linear and have no normal distribution. The purpose of general linear model is to fit a straight line between the observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Applications of sequence comparisons

A

For structural alignmeant which means Find amino acids with the same function, find conserved positions which mean find evolotunary conserved amino acids, and find phylogeny which means findng homologus postions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  • Sequence Evolution
A

Throughout sequence evolution different mutations will occur which leads to us developing as species when they are beneficial. There are substitution, insertion and deletions or codon change on small scale and large scale mutations that affects the chromosomal structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  • Sequence alignment, global and local, gap, mismatch, insertion/deletions
A

Global sequence alignment is a whole end to end alignment where the aim is to find as high similarity as possible. Where there is a mismatch there is a gap penalty and then the highest alignment score is the best match. Local alignment is looking for small regions of exact similarity and there is no gap penalty but when there is a mismatch it stops. The longest matched sequence length is then the best match. The gap penalty for the mismatch is to represent insertions and deletions that occur through mutations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  • Database searches with sequence
A

There are different BLAST programs depending on if you wanna match nucleotide or DNA sequences. But it can be used to match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  • BLAST, different BLAST programs to compare different molecules
A

(protein-protein,
nucleotide-nucleotide, and all combinations protein-nucleotides, and also nucleotidenucleotides via protein translation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  • E-value, score significance
A

The Expect value (E) is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. Blast hits with an E-value smaller than 1e-50 includes database matches of very high quality. Blast hits with E-value smaller than 0.01 can still be considered as good hit for homology matches.

Score significance is trustable the match or data is, that it’s not just from random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  • Multiple Sequence Alignments, sequence conservation
A

Multiple Sequence Alignment (MSA) is generally the alignment of three or more biological sequences (protein or nucleic acid) of similar length. From the output, homology can be inferred and the evolutionary relationships between the sequences studied. This is due to conserved sequences are identical or similar sequences in nucleic acids (DNA and RNA) or proteins across species (orthologous sequences), or within a genome (paralogous sequences), or between donor and receptor taxa (xenologous sequences). Conservation indicates that a sequence has been maintained by natural selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

RNA-sequencing analysis pipeline

A

Preprocessing:
1.Raw reads: Basic output from the sequencing machine. We’re looking at the mRNA.
2.Quality control (FASTAQC)
3.Filtering and adapting trimming
4.Quality control again and then filtering again until one is satisfied with the results
5. Alignment against reference genome
6.Quantification with gene count for the different genes

Downstream analysis:
7.Normalization
8.Batch correction
9.Visualization
10.Differential expression analysis, gene set enrichment and interpretation of the results

19
Q
  • Raw count matrix
A

A count matrix is a single table containing the counts for all samples, with the genes in rows and the samples in columns.

20
Q
  • Batch effects and biological confounders
A

Batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. Confounder is a variable whose presence affects the variables being studied so that the results do not reflect the actual relationship. For example what day the experiment was made or by which technician it was performed by.

21
Q
  • Batch correction and covariates
A

Batch effect correction is the procedure of removing variability from your data that is not due to your variable of interest (e.g. cancer type). Batch effects are due to technical differences between your samples, such as the type of sequencing machine or even the technician that ran the sample. Covariates are different form independant variables in the way that it’s not of intrest, so they will be removed in a batch correction

22
Q
  • Properties of RNA-seq data
A

Sample quality: RNA-seq data quality can be assessed based on measures such as sequencing depth, mapping rates, and duplication rates, which indicate whether the data is of sufficient quality for downstream analysis.

Gene expression levels: RNA-seq data provides quantitative measurements of gene expression, which can be used to identify differentially expressed genes, calculate expression values for downstream analysis, and compare expression levels across different samples or conditions.

23
Q
  • Differential expression analysis
A

The differential expression analysis typically involves the following steps:

Data normalization: The expression values are adjusted to correct for technical variability such as batch effects, sequencing depth, or sample composition.

Statistical testing: A statistical test is applied to each gene or protein to determine whether its expression levels are significantly different between the conditions. The choice of statistical test depends on the study design and the distribution of the data, but common tests include t-test, ANOVA, or non-parametric tests such as Wilcoxon rank-sum test.

Multiple testing correction: To account for the large number of statistical tests performed, a multiple testing correction method such as the Benjamini-Hochberg procedure is used to control the false discovery rate.

Interpretation: The differentially expressed genes or proteins are annotated and analyzed to identify enriched pathways, gene ontology terms, or functional categories.

24
Q
  • False positives (calculate fraction of false positives)
A

The false positive rate is calculated as FP/FP+TN, where FP is the number of false positives and TN is the number of true negatives (FP+TN being the total number of negatives). It’s the probability that a false alarm will be raised: that a positive result will be given when the true value is negative.

25
Q
  • Multiple testing correction (FWER,, FDR)
A

When doing multiple testing the false discovery rate increases. In statistics, family-wise error rate is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. The FDR is the ratio of the number of false positive results to the number of total positive test results.

26
Q
  • Nominal p-values, adjusted p-values, q-values
A

The nominal p value is the calculated p value. Adjusted p value is to make up for multiple statistical testing to reduce the risk of false positives and can be determined either by bonferroni or benjamin hochberg which is less strict.

Bonferroni:
The FDR adjusted p value can also be called q value.

27
Q
  • High-dimensional data analysis (PCA, MDS, SVD,
A

The main difference between MDS and PCA is that MDS is a non-linear technique that focuses on preserving the pairwise distances between data points, while PCA is a linear technique that focuses on finding the directions of maximum variance in the data.

MDS aims to visualize the data in a lower-dimensional space while preserving the pairwise distances between data points. It can be used to identify clusters or patterns in the data, and to visualize the relationships between different data points or groups. MDS can be used with any pairwise distance metric, such as Euclidean distance or correlation distance.

PCA, on the other hand, aims to identify the directions of maximum variance in the data and project the data onto these directions to create a new set of uncorrelated variables, called principal components. PCA is a linear technique and can only capture linear relationships between variables. It is often used for data compression, feature extraction, and data visualization.

28
Q
  • Gene enrichment analysis (Pathway enrichment analysis, gene set enrichment analysis)
A

Both gene enrichmeant analysis methods study if a set of genes are under or overrepsented.

Gene set enrichmeant analysis can be used when there are no prior defined set of genes,

29
Q
  • Disease enrichment
A

Disease enrichmeant is studying up or downregulated genes which is ascoiated to disease

30
Q
  • Different annotation databases
A

Annotation databases is databases which labels and stores, for example proteins, genes, etc. Examples of some databases are KEGG has a collection of manually drawn database maps, GO stores function of genes, REACTOME is a database for biological pathways)

31
Q
  • Measures of enrichment effect size
A

Effect size is a measure of the magnitude or strength of a relationship between two variables or the difference between two groups in a study.

It measures the degree of association between two variables on a scale of -1 to +1, where:

A value of +1 indicates a perfect positive linear relationship, where as one variable increases, the other variable also increases.
A value of -1 indicates a perfect negative linear relationship, where as one variable increases, the other variable decreases.
A value of 0 indicates no linear relationship between the two variables.

32
Q
  • Fisher’s exact test
A

Fishers exact test is a 2x2 contingency table. It can be used to calculate fold enrichmeant and odds ratio for example.

33
Q
  • Measures of enrichment fold enrichment
A

Fold enrichmeant is comparing the frequency of genes of a representive sample collection and a patient group and can use fold change to see if the genes are up or down regulated in patient groups compared to control groups to find genes asscoaited with the pathology

34
Q
  • Measures of enrichment Odds ratio
A

Odds Ratio = Odds of Event A / Odds of Event B

35
Q

Give some examples of measures of enrichmeant

A

Measures of enrichment can show what are the odds that this thing will happen(odds ratio), how signifcant or how much does that effect(effect size) or how common or rare is it compared to the normal poplution (fold enrichmeant) for example.

36
Q

Bonferroni correction

A

In the bonferroni correction one adjusts the alpha() level which stands for the first p value, which most of the time is 0.05 to an adjusted p value which takes the multiple statistical tests into consideration. The formula for this is αnew = αoriginal / n where n stands for the number of statistical tests.

37
Q

Benjamini Hochberg

A

The Benjamini-Hochberg Procedure works as follows:

  1. Order p values from smallest to largest
  2. Rank the p values
    3.The largest FDR is the same as the largest p value
    4.The next largest adjusted value gets the smaller of two options:
    a) The previous adjusted pvalue
    b) The current pvalue X (total number of p values/ p value rank)
38
Q

What is the difference between filtering and normalization?

A

Filtering: We filter lowly expressed genes to increase the power of the statistical analysis. If the gene is lowly expressed in both control and patient group it’s mostly not significant but will reduce the statistical power since the more we’re testing the more false positives we’ll find. We therefore remove low count genes. This also makes the mean of the data to be compared with higher reliability.

Normalization: We adjust the data to account for factors that prevent direct comparison of expression measures. We have to normalize data because during sample preparation or sequencing processing factors are introduced that prevent direct comparison. For example we adjust for sequencing depth. So between the samples we adjust sequencing depth and RNA composition and within the sample we adjust gene length. Normalization is needed so the differences in gene composition is accurately reflected.

39
Q

What is the difference between PCA and MDS?

A

PCA is one dimensions while MDS is multi dimensions typically 2-3

40
Q

You have measured 13,000 genes in your RNA-seq experiment, and you find that
120 genes have an uncorrected p value <0.001.
How many genes do we expect to have a p value<0.001 under the null hypothesis?
What would be the FDR using this p value cutoff?

A

FDR = FP / (FP + TP)
FP=False positives
TP=True positives

We expect 13,000 x 0.001 = 13 genes to have a p<0.001
Which means that 13/120 will be false = 11%

41
Q

If you perform 625 tests, what would the FWER be given an alpha=0.05?
What would be the Bonferroni corrected p value and what would the new FWER be?

A

The formula to estimate the familywise error rate is:
FWE ≤ 1 – (1 – αIT)c
Where:

αIT = alpha level for an individual test (e.g. .05),
c = Number of comparisons.

FWER = 1-(1-0.05)625=1
Bonferroni-corrected p value = 0.05/625= 8x10-5
New FWER = 1-(1- 8x10-5)625 = 0.05

42
Q

You have measured the expression of 8 different genes and have gotten the
following p values:
0.001
0.95
0.52
0.04
0.01
0.02
0.78
0.07
Calculate the adjusted p values using the Benjamini-Hochberg method

A

Step 1: Conduct all of your statistical tests and find the p-value for each test.

Step 2: Arrange the p-values in order from smallest to largest, assigning a rank to each one – the smallest p-value has a rank of 1, the next smallest has a rank of 2, etc.

Step 3: Calculate the Benjamini-Hochberg critical value for each p-value, using the formula (i/m)*Q

where:

i = rank of p-value

m = total number of tests

Q = your chosen false discovery rate

Step 4: Find the largest p-value that is less than the critical value. Designate every p-value that is smaller than this p-value to be significant.

  1. Order p values from smallest to largest
    0.001 0.01 0.02 0.04 0.07 0.52 0.78 0.95
  2. Rank the p values
  3. The largest FDR p value is the same as the largest p value
    1 2 3 4 5 6 7 8
    0.008 0.04 0.05 0.08 0.11 0.69 0.89 0.95
  4. The next largest adjusted value gets the smaller of two options:
    a. The previous adjusted pvalue
    b. The current pvalue X (total #number of p values/ p value rank)

If you fon’t get number of tests:

Rank etc but the formula is instead: current p-value(total number of p-values/p value rank)

43
Q

High dimensional data analysis

A

The resulting tSNE plot can be used to visualize the data in two or three dimensions, and to identify clusters or patterns in the data that may not be apparent in the original high-dimensional space.

K-means is a clustering algorithm commonly used in translational bioinformatics to group similar data points together based on their features or characteristics. The algorithm works by partitioning a dataset into K clusters, where K is a user-specified parameter.

Hierarchical clustering is a clustering algorithm commonly used in translational bioinformatics to group similar data points together based on their features or characteristics. The algorithm works by recursively partitioning the data into clusters based on their pairwise distances or similarities.

44
Q

What is difference between multivariate analysis and high dimensional data analysis?

A

In summary, while both multivariate analysis and high dimensional data analysis deal with data sets with multiple variables, multivariate analysis typically focuses on analyzing the relationships between a smaller number of variables, while high dimensional data analysis focuses on developing methods to handle the complexity and scale of data sets with a large number of variables.