Key questions Flashcards

Question

* Multiple testing correction (FWER,, FDR)

Answer 1

When doing multiple testing the false discovery rate increases. In statistics, family-wise error rate is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. The FDR is the ratio of the number of false positive results to the number of total positive test results.

Answer 2

The nominal p value is the calculated p value. Adjusted p value is to make up for multiple statistical testing to reduce the risk of false positives and can be determined either by bonferroni or benjamin hochberg which is less strict. Bonferroni: The FDR adjusted p value can also be called q value.

Answer 3

The main difference between MDS and PCA is that MDS is a non-linear technique that focuses on preserving the pairwise distances between data points, while PCA is a linear technique that focuses on finding the directions of maximum variance in the data. MDS aims to visualize the data in a lower-dimensional space while preserving the pairwise distances between data points. It can be used to identify clusters or patterns in the data, and to visualize the relationships between different data points or groups. MDS can be used with any pairwise distance metric, such as Euclidean distance or correlation distance. PCA, on the other hand, aims to identify the directions of maximum variance in the data and project the data onto these directions to create a new set of uncorrelated variables, called principal components. PCA is a linear technique and can only capture linear relationships between variables. It is often used for data compression, feature extraction, and data visualization.

Answer 4

Both gene enrichmeant analysis methods study if a set of genes are under or overrepsented. Gene set enrichmeant analysis can be used when there are no prior defined set of genes,

Answer 5

Disease enrichmeant is studying up or downregulated genes which is ascoiated to disease

Answer 6

Annotation databases is databases which labels and stores, for example proteins, genes, etc. Examples of some databases are KEGG has a collection of manually drawn database maps, GO stores function of genes, REACTOME is a database for biological pathways)

Answer 7

Effect size is a measure of the magnitude or strength of a relationship between two variables or the difference between two groups in a study. It measures the degree of association between two variables on a scale of -1 to +1, where: A value of +1 indicates a perfect positive linear relationship, where as one variable increases, the other variable also increases. A value of -1 indicates a perfect negative linear relationship, where as one variable increases, the other variable decreases. A value of 0 indicates no linear relationship between the two variables.

Answer 8

Fishers exact test is a 2x2 contingency table. It can be used to calculate fold enrichmeant and odds ratio for example.

Answer 9

Fold enrichmeant is comparing the frequency of genes of a representive sample collection and a patient group and can use fold change to see if the genes are up or down regulated in patient groups compared to control groups to find genes asscoaited with the pathology

Answer 10

Odds Ratio = Odds of Event A / Odds of Event B

Answer 11

Measures of enrichment can show what are the odds that this thing will happen(odds ratio), how signifcant or how much does that effect(effect size) or how common or rare is it compared to the normal poplution (fold enrichmeant) for example.

Answer 12

In the bonferroni correction one adjusts the alpha() level which stands for the first p value, which most of the time is 0.05 to an adjusted p value which takes the multiple statistical tests into consideration. The formula for this is αnew = αoriginal / n where n stands for the number of statistical tests.

Answer 13

The Benjamini-Hochberg Procedure works as follows: 1. Order p values from smallest to largest 2. Rank the p values 3.The largest FDR is the same as the largest p value 4.The next largest adjusted value gets the smaller of two options: a) The previous adjusted pvalue b) The current pvalue X (total number of p values/ p value rank)

Answer 14

Filtering: We filter lowly expressed genes to increase the power of the statistical analysis. If the gene is lowly expressed in both control and patient group it’s mostly not significant but will reduce the statistical power since the more we’re testing the more false positives we’ll find. We therefore remove low count genes. This also makes the mean of the data to be compared with higher reliability. Normalization: We adjust the data to account for factors that prevent direct comparison of expression measures. We have to normalize data because during sample preparation or sequencing processing factors are introduced that prevent direct comparison. For example we adjust for sequencing depth. So between the samples we adjust sequencing depth and RNA composition and within the sample we adjust gene length. Normalization is needed so the differences in gene composition is accurately reflected.

Answer 15

PCA is one dimensions while MDS is multi dimensions typically 2-3

Answer 16

FDR = FP / (FP + TP) FP=False positives TP=True positives We expect 13,000 x 0.001 = 13 genes to have a p<0.001 Which means that 13/120 will be false = 11%

Answer 17

The formula to estimate the familywise error rate is: FWE ≤ 1 – (1 – αIT)c Where: αIT = alpha level for an individual test (e.g. .05), c = Number of comparisons. FWER = 1-(1-0.05)625=1 Bonferroni-corrected p value = 0.05/625= 8x10-5 New FWER = 1-(1- 8x10-5)625 = 0.05

Answer 18

Step 1: Conduct all of your statistical tests and find the p-value for each test. Step 2: Arrange the p-values in order from smallest to largest, assigning a rank to each one – the smallest p-value has a rank of 1, the next smallest has a rank of 2, etc. Step 3: Calculate the Benjamini-Hochberg critical value for each p-value, using the formula (i/m)*Q where: i = rank of p-value m = total number of tests Q = your chosen false discovery rate Step 4: Find the largest p-value that is less than the critical value. Designate every p-value that is smaller than this p-value to be significant. 1. Order p values from smallest to largest 0.001 0.01 0.02 0.04 0.07 0.52 0.78 0.95 2. Rank the p values 3. The largest FDR p value is the same as the largest p value 1 2 3 4 5 6 7 8 0.008 0.04 0.05 0.08 0.11 0.69 0.89 0.95 4. The next largest adjusted value gets the smaller of two options: a. The previous adjusted pvalue b. The current pvalue X (total #number of p values/ p value rank) If you fon't get number of tests: Rank etc but the formula is instead: current p-value(total number of p-values/p value rank)

Answer 19

The resulting tSNE plot can be used to visualize the data in two or three dimensions, and to identify clusters or patterns in the data that may not be apparent in the original high-dimensional space. K-means is a clustering algorithm commonly used in translational bioinformatics to group similar data points together based on their features or characteristics. The algorithm works by partitioning a dataset into K clusters, where K is a user-specified parameter. Hierarchical clustering is a clustering algorithm commonly used in translational bioinformatics to group similar data points together based on their features or characteristics. The algorithm works by recursively partitioning the data into clusters based on their pairwise distances or similarities.

Answer 20

In summary, while both multivariate analysis and high dimensional data analysis deal with data sets with multiple variables, multivariate analysis typically focuses on analyzing the relationships between a smaller number of variables, while high dimensional data analysis focuses on developing methods to handle the complexity and scale of data sets with a large number of variables.

Key questions Flashcards

(44 cards)