Key questions Flashcards
- Normal distribution or not
Normal distribution follows the normal distrubution curve, it’s symmetric around the mean. If it’s not the results are not symetric around the mean and won’t have a curve like normal fistrubution curve.
- Research hypothesis
A typical research hypothesis looks like x is affected by y because ….
- Hypothesis in statistical testing
Null hypothesis (H0) (there are no differences between the group) and the alternative hypothesis (H1) (there is a difference between the two groups).
- Type I and Type II error
Type 1 error stands for false positive and is when the researches rejects the null hypothesis (that there is no difference between the groups) when that’s the correct hypothesis. Type 2 error stands for false negatives when the researcher accept the null hypothesis despite the alternative hypothesis being the correct one.
- Power
Statistical power ranges from 0-1 and is usually written like 1-B(beta). It’s a hypothesis test that test how big the risk is that one correctly rejecting the null hypothesis. The higher power the lower probability is it that the results are from a type 2 error( falsly accepting the null hypothesis) as B diminshes.
- Parametric versus non-parametric analysis
In parametric analysis, the underlying probability distribution of the population is assumed to be known or can be reasonably estimated, and the analysis is based on specific parameters of that distribution, such as the mean and variance. The most common parametric tests include t-tests, ANOVA, regression analysis, and chi-square tests. Parametric methods generally require that certain assumptions are met, such as normality and homogeneity of variances.
Non-parametric analysis, on the other hand, is an approach that does not rely on specific assumptions about the distribution of the population, or the parameters of that distribution. Non-parametric methods are used when the data is not normally distributed, or the assumptions for parametric tests are not met. Non-parametric tests include Wilcoxon signed-rank test, Kruskal-Wallis test, Mann-Whitney U test, and Spearman’s rank correlation coefficient.
In summary, parametric analysis is based on specific assumptions about the distribution of the population, whereas non-parametric analysis does not require these assumptions and can be used when the data does not meet the assumptions for parametric tests.
- Descriptive analysis
Descriptive analysis is a quantative statistical summary that shows trends and it can be for example scatter plots, PCA etc
- Bivariate analysis
Bivariate analysis in bioinformatics refers to the analysis of two variables in relation to each other. Here are some examples of bivariate analysis in bioinformatics:
Correlation analysis: This involves analyzing the strength and direction of the linear relationship between two variables. For example, correlation analysis can be used to investigate the relationship between gene expression levels of two different genes in a particular tissue or under certain experimental conditions.
Regression analysis: This involves modeling the relationship between two variables using a linear or nonlinear function. For example, regression analysis can be used to model the relationship between sequence features and protein-protein interaction affinity.
Chi-square test: This is a statistical test used to analyze the association between two categorical variables. For example, the chi-square test can be used to investigate the association between a gene mutation and a disease phenotype.
Bivariate analysis is an important tool for exploring the relationship between different variables in bioinformatics and can provide insights into biological processes and relationships that are not easily observed through individual variable analysis.
- Multivariate analysis
Multivariate analysis in bioinformatics refers to the analysis of multiple variables in relation to each other. Here are some examples of multivariate analysis in bioinformatics:
Principal component analysis (PCA): This involves reducing the dimensionality of a dataset by identifying linear combinations of variables that capture the most variation in the data. For example, PCA can be used to identify patterns in gene expression data across multiple samples or conditions.
Cluster analysis: This involves grouping together samples or variables based on similarity in gene expression patterns or other features. For example, cluster analysis can be used to group together samples with similar gene expression profiles in a large gene expression dataset.
Multivariate analysis is an important tool for exploring complex relationships between multiple variables in bioinformatics and can provide insights into biological processes that cannot be easily observed through univariate or bivariate analysis.
- Generalized linear models
Generalized linear models are non linear and have no normal distribution. The purpose of general linear model is to fit a straight line between the observations.
Applications of sequence comparisons
For structural alignmeant which means Find amino acids with the same function, find conserved positions which mean find evolotunary conserved amino acids, and find phylogeny which means findng homologus postions
- Sequence Evolution
Throughout sequence evolution different mutations will occur which leads to us developing as species when they are beneficial. There are substitution, insertion and deletions or codon change on small scale and large scale mutations that affects the chromosomal structure.
- Sequence alignment, global and local, gap, mismatch, insertion/deletions
Global sequence alignment is a whole end to end alignment where the aim is to find as high similarity as possible. Where there is a mismatch there is a gap penalty and then the highest alignment score is the best match. Local alignment is looking for small regions of exact similarity and there is no gap penalty but when there is a mismatch it stops. The longest matched sequence length is then the best match. The gap penalty for the mismatch is to represent insertions and deletions that occur through mutations.
- Database searches with sequence
There are different BLAST programs depending on if you wanna match nucleotide or DNA sequences. But it can be used to match.
- BLAST, different BLAST programs to compare different molecules
(protein-protein,
nucleotide-nucleotide, and all combinations protein-nucleotides, and also nucleotidenucleotides via protein translation)
- E-value, score significance
The Expect value (E) is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. Blast hits with an E-value smaller than 1e-50 includes database matches of very high quality. Blast hits with E-value smaller than 0.01 can still be considered as good hit for homology matches.
Score significance is trustable the match or data is, that it’s not just from random chance.
- Multiple Sequence Alignments, sequence conservation
Multiple Sequence Alignment (MSA) is generally the alignment of three or more biological sequences (protein or nucleic acid) of similar length. From the output, homology can be inferred and the evolutionary relationships between the sequences studied. This is due to conserved sequences are identical or similar sequences in nucleic acids (DNA and RNA) or proteins across species (orthologous sequences), or within a genome (paralogous sequences), or between donor and receptor taxa (xenologous sequences). Conservation indicates that a sequence has been maintained by natural selection.