Exam Questions Flashcards

Question

(EXAM - VL7) A mutant animal grows 3cm per week faster than expected of a normal animal; CI 95%[1.3, 5.4]. Put a checkmark after the correct p value and write an estimate for CI for the other p values: * 0.2 * 0.1 * 0.05 * 0.01 (2024-2)

Answer 1

0.05 → 95% CI[1.3, 5.4] Others: * 0.2 → 80% CI * 0.1 → 90% CI * 0.01 → 91% CI As p-values correspond to different confidence levels, we can estimate the CIs for other p-values: p=0.2 (80% CI) * → Wider interval * Approximate estimate: [1.8, 4.9] * 80% CIs are typically narrower than 95% CIs. p=0.1 (90% CI) * → Slightly wider than 80% CI * Approximate estimate: [1.5, 5.1] p=0.01 (99% CI) * → Wider interval * Approximate estimate: [0.8, 5.9] Higher confidence levels require wider intervals to ensure the true value is captured. CI z-score (approx.) 80% 1.28 90% 1.645 95% 1.96 99% 2.576

Answer 2

- number of PCs = number of variables (metabolites) in dataset = 37 PCs (if there were less samples than variables, there would be n-1 PCs) - variables that influence the PCs are the metabolites -> 37 because each PC is a linear combination of all orig variables

Answer 3

- long and aligned with group separation → Most affected by T increase - short and perpendicular to group separation → Least sensitive to heat. -------------------------------------------------------------------- **Which metabolite is more affected by temperature increase?** - in biplot: length of arrow (loading vector) for each met indicates how strongly that met contributes to PCs - If met has long arrow pointing in direction of separation between two temperature groups (16°C and 6°C) → suggests that this met is strongly influenced by temperature diff. **Which metabolite is least sensitive to heat?** - conversely: mets with short arrows in biplot contribute less to variance captured by PCs → less influenced by T changes. - if met's arrow points perp to separation direction of two T groups → this met does not correlate with T diffs. → met with shortest arrow or one that does not align with group separation is least sensitive to heat.

Answer 4

* loadings (red entries) [scaled by PC standard deviations and sqrt(number of observations]

Answer 5

Because the PCs find the directions of greatest variance. Data are uncorrelated (r=0), but not independent! Choosing X limits Y -> X and Y are said to carry mutual Information PCA assumes linear relationships + Gaussian distributions for optimal interpretation. data non-gaussian -> PCA may fail to fully disentangle dependencies or make dimensions truly independent

Answer 6

**tSNE:** - Builds on the concept of density sensitive distance metric - *uses probabilistic approach to preserve local neighborhood structures, sensitive to density differences in data* - Repeated runs may result in different outcomes - *stochastic algorithm --> results can vary between runs unless seed is fixed* - Needs additional parameter setting - *requires tuning parameters like perplexity, learning rate, and iterations (cf PCA fewer parameters to tune)* **PCA** - Allow meaningful assessment of variance - *explicitly calculates and retains variance along principal components

Answer 7

a)Mv = l A vector ^→v is an eigenvector of a matrix M if: M^→v =λ^→v M^→v₁ = [ 5 8 ; 1 3 ] (-1 3) ... (19 12) → M^→v₁ ≠ λ^→v₁ → not a scalar multiple of ^→v₁, so ^→v₁ is not an eigenvector of M. M^→v₂ = [ 5 8 ; 1 3 ] (2 -1) ... (2 -1) → M^→v₂ = λ^→v₂ → is an eigenvector, scaled by an eigenvalue of 1 b) We already found one eigenvalue associated with ^→v₂: λ₁ = 1 To calculate all eigenvalues: M^→v =λ^→v (M - λI)^→v = 0 (I = identity matrix) → det(M - λI) = 0 (5 - λ)(3 - λ) - 8*1 = 0 ... λ₁ = 1, λ₂ = 7

Answer 8

Yes any scalar multiple of an eigenvector is also an eigenvector corresponding to the same eigenvalue If ^→v₁ is an eigenvector, then any vector of the form c^→v₁, where c ≠ 0, is also an eigenvector of M corresponding to the same eigenvalue.

Answer 9

**ICA considered superior in certain contexts because of ability to:** - Separate Independent Sources - Handle Non-Gaussian Data - No Orthogonality Constraint **To determine if ICA "got it right,": evaluate if extracted components are truly independent, either:** - Visual Inspection - measure stat. indep.: eg mutual information (should be minimized between components), kurtosis (higher kurtosis than gaussian signals),... -> compare w/ ground truth signals (if available) -------------------------------------------------------------------- more details: **ICA considered superior in certain contexts because of ability to:** - Separate Independent Sources: Unlike PCA, which only removes correlations (linear relationships), ICA removes both correlations and higher-order dependencies, making it ideal for separating independent signals (e.g., blind source separation)12. - Handle Non-Gaussian Data: ICA assumes that the underlying sources are non-Gaussian, allowing it to separate signals that PCA might fail to distinguish14. - No Orthogonality Constraint: ICA does not require components to be orthogonal, unlike PCA, which makes it more flexible in capturing real-world data structures **To determine if ICA "got it right,": evaluate if extracted components are truly independent, either:** - Measuring Statistical Independence: Use metrics like mutual information, kurtosis, or negentropy to assess the independence of the extracted components. Lower mutual information or higher kurtosis indicates better separation47. - Visual Inspection: Plot the separated components and inspect whether they represent meaningful independent signals (e.g., in blind source separation problems like separating mixed audio signals). *calculate it if ICA 'got it right'? ** Evaluate Results - calculate metrics such as: - Mutual Information: Should be minimized between components. - Kurtosis: Independent components often exhibit higher kurtosis than Gaussian signals. - Compare with ground truth signals (if available) to verify correctness.

Answer 10

* Lacks enough statistical power to reliably detect true effect. * Statistical power = probability of correctly rejecting H0 when it is false (i.e., detecting a real effect). * Type II error (false negative)—failing to detect a true effect.

Answer 11

(CHECK - NEED MORE) Cox proportional-hazards model: essentially a regression model commonly used for investigating association between survival time of patients and one or more predictor variables Models hazard, h(t), probability of dying at a point in time (P_d(t)), given survival to that time point (S(t)):

Answer 12

Normalisation procedure: - A Z-score (or standard score) --> how many standard deviations a data point is from the mean of a dataset. - data transformation--> - data now have mean of 0; SD of 1 - 95% of data are within a z-score of -1.96 and + 1.9 Formula: z = (x - x¯ ) / s

Answer 13

R = 0.3, CI 95%[0.2, 0.4] p < 0.001 R = 0.3, CI 95%[0.1, 0.5] p < 0.05 R = 0.3, CI 95%[0.0, 0.6] p = 0.05 R = 0.3, CI 95%[-0.1, 0.7] p = 0.20

Answer 14

see spicker

Answer 15

2 categorical variables: - disease type: covid, flu - survival: dead, alive **visualisation:** bar plot, association plot, (four fold plot) **table:** 2 x 2 contingency table → margin table → independence table **test:** prop.test, chisq.test (fisher if exp. cell counts < 5) H₀ (Null Hypothesis): Survival rates are independent of disease type. H₁ (Alternative Hypothesis): Survival rates depend on disease type. → p-value and confidence interval **Effect Size:** Cohen’s h (best because its a 2x2 table) **write a report** A chi-square test of independence was performed to examine whether mortality rates of covid and influenza differ. The relationship between the infection and the mortality rate was significant, χ²(1,N=___) = ___, p=____. Patients with _____ were ____ %, more likely to die CI95%[___,___] (___%) than _____ patients (___%), Cohen’s h = ____.

Answer 16

**Correlational or Experimental?** - observational study, not an experiment. - no random assignment patients already sick → correlational **Why is a t-test not appropriate?** - t-test: compare means of continuous variables (e.g., age, viral load). - Here both variables categorical →need tests for categorical data like χ² or logistic regression instead.

Answer 17

A chi-square test of independence was performed to examine the relation between _passenger class_ and _survival_ on the Titanic. There _was a significant_ relationship between passenger class and survival χ²(_2_,_2201_) = _133.05_, _p < 2.2e-16_. The survival rate was _higher_ for passengers in _1st class_ (_62.46%_) compared to those in _2nd_ (_41.40%_) and _3rd class_ (_25.21%_). _Cohen’s w_ = _0.318_, indicating a _moderate_ effect size.

Answer 18

In short, - probability of x determined by tracing through decision tree from root to leaf - look at class distribution at leaf node Subtypes: - classification tree: probabilities give likelihood of x belonging to each class. - regression tree: leaf node gives predicted value for x. **Structure of the Decision Tree:** - made up of nodes and branches. - Root Node: starting point where data is split based on feature - Internal Nodes: decision points, data is split based on conditions/features. - Leaf Nodes: terminal points of tree, final predictions are made. - Each node splits data based on feature, threshold. - eg: node may split based on whether "Age > 30" or "Income < 50000". **Interpret the Decision Tree:** - Start at root node - Move through internal nodes, following appropriate condition - reach leaf node -> provides predicted class (classification) or value (regression). - (if classification, leaf node also provides class probabilities.) **Determine the Probability of x:** classification tree: probability of observation x belonging to certain class is determined by distribution of target variable in leaf node that x falls into. Example: If final leaf node with 70% "Yes", 30% "No"-> P of x being "Yes" is 0.7 P of "No" is 0.3. **How to Calculate the Probability:** Leaf Node Counts: For each leaf node, count the number of instances of each class. Class Probabilities: The probability of each class for a given observation x is the proportion of that class in the leaf node. This can be calculated as: 𝑃 (class_𝑖 ∣𝑥 ) = Number of instances of class𝑖 / Total number of instances in leaf node **Example Walkthrough:** Suppose you have a decision tree for predicting whether a person buys a product, with features like "Age" and "Income". The root node splits based on "Age > 30". If x's "Age" is 25, you go to the left child, where the next node might split based on "Income < 50000". At the leaf node, you might find that out of 100 people who ended up in this leaf, 60 bought the product ("Yes") and 40 didn't ("No"). The probability that x buys the product would be 60/100= 0.6

Answer 19

* More samples * Larger a (but increases chance of type I error)

Answer 20

Larger a (but increases chance of type I error)

Answer 21

**Goal** - PCA: Finds directions (principal components) that maximize variance. - ICA: Finds statistically independent components in the data. **Assumptions** - PCA: Assumes components are uncorrelated (orthogonal). - ICA: Assumes components are statistically independent and non-Gaussian. **Output components ** - PCA: Orthogonal and ranked by explained variance. - ICA: Not orthogonal and not ranked. **Focus** - PCA: Captures maximum variance in the data. - ICA: Removes both correlations and higher-order dependencies. **Data type** - PCA: Works well with Gaussian data or when variance is the focus. - ICA: Works well with non-Gaussian data or when independence is needed. **Applications** - PCA: Dimensionality reduction, visualization, feature extraction. - ICA: Blind source separation (e.g., separating mixed audio signals). **when to choose ICA?** Use ICA for separating independent signals or when working with non-Gaussian data requiring statistical independence. If your analysis requires removing not only linear correlations but also higher-order dependencies, ICA is more appropriate.

Answer 22

UMAP is more scalable and efficient for large datasets. -------------- UMAP handles large-scale data faster due to its optimized graph-based computations, making it better suited for datasets with many samples compared to the computationally intensive t-SNE

Answer 23

GO TO OLD EXAM 2018 TO PRACTICE Look at angle between vectors: - Smaller angles (closer to 0°) → strong positive relationships - angles near 180° → strong negative relationships. - Perpendicular vectors (90°) → no relationship Check vector length: - Longer vectors → vars with higher variance, stronger contributions to PCs Projection onto the variable's vector: - Observations or variables projected closer to direction of vector are more related to that variable

Answer 24

second bar in scree plot corresponds to eigenvalue of PC2. This value represents the amount of variance explained by PC2. identify it by looking at biplot, which shows the contributions of variables to each principal component, match it to the second largest eigenvalue in the scree plot.

Answer 25

- Check for comparable ranges in boxplot and balanced arrow lengths in biplot to confirm scaling. - If ranges or arrow lengths differ significantly, data likely not scaled. Boxplot: - Look at the ranges of the variables in the boxplot. - If vars scaled (e.g., standardized), boxplots will show similar ranges (centered around 0 with comparable spreads if standardized). - If not scaled, variables with larger original ranges will dominate. Biplot: - In a PCA biplot, scaling affects both the scores (points) and loadings (arrows). - If analysis scaled, vars with different units or variances will have comparable arrow lengths,, distances between points reflect relative relationships rather than absolute magnitudes. - Without scaling, variables with larger variances or units will dominate, leading to disproportionately long arrows. ----------------------------------------------------------- Why scale? - PCA identifies directions of max variance. - Features with larger scales dominate variance, bias results. How to scale? - Use standardization: mean = 0, sdev = 1. Effect of scaling: - Ensures all features contribute equally to PCA. - Prevents bias toward features with larger numerical ranges. When to scale? - Always scale if features have different units or ranges (e.g., height vs. weight).

Answer 26

= probability of correctly rejecting H0 when it is false (i.e., detecting a real effect).

Answer 27

Key difference: - experimental manipulates independent variable to observe effect on dependent variable. - correlational measures relationship between two or more variables without manipulation. What does each method establish? - experimental - Establishes causation. - correlational - Establishes association, not causation. - experimental - uses random assignment and control groups to reduce bias. correlational - Commonly uses statistical measures like Pearson’s correlation coefficient (r).

Answer 28

Preserves pairwise distances as faithfully as possible in lower dimensions

Answer 29

- 4th moment of a distribution - measures how sharp or flat distribution is - positive kurtosis → very sharp distribution

Answer 30

Tests categorical- nominal data: → prop test (1-2 groups, ci und p, counts >=5, c~c) → chisquare (2+ groups, nur p, counts >=5, c~c) → fisher (2 groups (2x2 contigency table), p, CI and odds ratio, counts >=0, c~c) → how is our data distributed? → Shapiro-Wilk (test for normal distribution, n<=50) with many samples Shapiro-Wilk test is very easy significant H0 assumes normality → Kolgorov-Smirnov (generalized test for any distribution, if n > 50) checks if both samples might come from the same distribution H0 assumes that both samples are equally distributed normal numerical data: → 2 groups: t-test (n~c & n~na) (in function you can chose pared=TRUE or not) → 3+ groups in c: anova (n~c) parametric tests correlation: pearson correlation test (n~n) (H0: true correlation =0) non-normal, ordinal data: → 2 groups: wilcox (n~c) (in function you can chose pared=TRUE or not) → 3+ groups: if not matched: kruskal-wallis-test (n~c) if matched: friedman-test non parametric tests, working on ranks correlation: Spearman correlation test (n~non normal, n~c ordinal) (H0: true correlation =0) Kendall = even more robust against outliers → on nominal data, only two ranks allowed, otherwise the correlation doesn’t work

Answer 31

draw one now!

Answer 32

if variables have different ranges or units: → data should be scaled to prevent bias → correlation matrix if variables have same units and ranges: → cov matrix

Answer 33

- Not normally distributed or even any parametric distance; - number of patients keeps changing

Answer 34

Descriptive statistics: - summarize and organize data - providing measures such as - center: mean, median, mode, - spread: standard deviation, SEM - plots: graphs to describe a dataset Inferential statistics - use sample data to make predictions or generalizations about larger population - often through hypothesis testing (p-value), confidence intervals, effect siye, and regression analysis.

Answer 35

**Boxplot** - n~c - compare dist of numeric variable across cats **Mosaic Plot** - c~c - visualize relationship/dependency between two cat vars **XY Plot (Scatterplot)** - n~n - examine corr/trend between 2 numeric vars

Answer 36

Characterizing a Numerical Distribution: **Center: Describes the typical value. ** - Mean (average) - Median (middle value) - Mode (most common) **Scatter: (Dispersion): Describes spread.** - Range (min–max) - SD/Variance (spread around mean) - IQR (middle 50%). **Shape: Describes symmetry and tail behavior.** Skewness: Asymmetry - Positive = right tail - Negative = left tail Kurtosis: Tailedness - High = more outliers - Low = fewer outliers

Answer 37

(check exam pdf) Preparation for effect size measurement. Load data. Internal variables investigation Contingency table Extract data subset Proportions rowise > 5 in each cell Significant Effect size, medium effect ..... see summary

Answer 38

40 (or 39???)

Answer 39

variables have different ranges and units → scaled data (i.e. using the correlation matrix)

Answer 40

GO TO SAMPLE EXAM AND DO THIS i. Label the plots and explain A) B) C) scree plot score plot biplot ii. screeplot is relevant iii. Name a feature that - is negatively correlated with American-built cars, i.e. which feature/s has/have smaller values in American cars than in non-American cars: ……………………. - Is not informative with regard to distinguishing American cars from other car types: ……………………. - Has a relatively large loading on PC2: ………………………. (3 points)

Answer 41

“For the analysis of high-dimensional data, a number of projection methods have been established. PCA performs a _linear_ transformation and projection of the data, while tSNE can create meaningful projections of data into lower dimensions, even if they can only be separated by a _non-linear_ distance metric. Unlike PCA, in which dimensions are sought that explain the maximal _variance_, ICA (independent component analysis) searches for directions that maximize _independence_ between the new dimensions. ICA yields more meaningful projection results than PCA, if the underlying data is NOT distributed according to a _Gaussian_ distribution. As a metric, of how different from a _Gaussian_ a distribution is, the _kurtosis_ can be used.”

Answer 42

- Small sample size – Not enough data to detect effects. - Larger a (but increases chance of type I error) - Generally: test-type (parametric/non-parametric) chatgpt extra: - Weak manipulation – If experimental conditions are too similar, differences won’t stand out. - Improper statistical test – Using a test with low sensitivity. - Measurement error – Inaccurate data collection reduces precision.

Answer 43

DW: * magnitude of difference relative to average standard deviation (→relevance of effect, t-statistic→ significance of effect) Chatgpt: * - measure of strength/magnitude of relationship, difference, or effect in experiment, * - independent of sample size. * - tells how meaningful result is, rather than just whether stat. significant. Examples: * - medical trial: ES shows how much drug improves symptoms vs placebo. * education: ES → how much new teaching method improves perf.

Answer 44

**Simple Random Sampling** - Every member of pop has equal chance of being selected. - Use when pop is homog., complete list of members available **Stratified Sampling** - Divide pop into subgroups (strata), sample proportionally. - Use when pop is heterog, -> ensure all subgroups are represented. **Systematic Sampling** - Select every kth member from random list, random start point - Use when pop is ordered, need evenly dist. sample w/o bias.

Answer 45

For G3 and G4, the P-values should be adjusted to values below 0.05, consistent with their confidence intervals excluding zero. use rule 95% CI ... : - should not include null value if P-value < 0.05, - should include null value if P-value ≥ 0.05. (null value eg 0 for mean difference) Gene / Mean C / Mean T / CI (95%) / P / Consistency? / Correct P G1 13 11 0.5 to 4.0 0.02 Consistent — G2 21 26 -9.0 to 0.2 0.07 Consistent — G3 22 24 -0.3 to -4.2 0.09 Inconsistent Should be < 0.05 G4 802 744 23 to 125 0.13 Inconsistent Should be < 0.05 ----------------------------------- Explanation: - G1: CI (0.5 to 4.0) does not include 0, and p < 0.05 -> consistent. - G2: CI (-9.0 to 0.2) includes 0, p ≥ 0.05 -> consistent. - G3: CI (-0.3 to -4.2) does not include 0, but p ≥ 0.05 -> inconsistent because non-zero CI implies statistical significance (P < 0.05). correct p: < 0.05. - G4: CI (23 to 125) does not include 0, but p ≥ 0.05 -> inconsistent because non-zero CI implies statistical significance (P < 0.05). correct p: < 0.05. For G3 and G4, the P-values should be adjusted to values below 0.05, consistent with their confidence intervals excluding zero.

Answer 46

**Single variables:** Numeric Measures: - Weight: mean, median, sdev, min, max, quartiles. Graphical Visualizations: - weight: histogram or boxplot to visualize distributions - group: bar plot to show frequency of each group. **Two Variables Together** Numeric Measures: - Compare summary statistics (e.g., mean, max) of weight across different groups. - eg : ```aggregate(weight ~ group, data = PlantGrowth, FUN = mean)``` Graphical Visualizations: - boxplot to compare dist of weights across groups (weight ~ group). eg: ```boxplot(weight ~ group, data = PlantGrowth)```.

Answer 47

``` ---------------------------------------------------------------- Black Blue Total ---------------------------------------------------------------- Male | 25 | 35 | 60 Female | 20 | 20 | 40 ---------------------------------------------------------------- Total | 45 | 55 | 100 ---------------------------------------------------------------- ``` To get expected value: Contingency table → Margin table → independence table Independence table contains expected number: Rowtotal * Columntotal / Total: Male with black points: 25*45/100 = 27

Answer 48

pearson residuals The formula to calculate the Pearson residuals for every cell of a contingency table is: (_observed_ - _expected_) / _sqrt(expected)_ Chi square values --> chi square statistic The _chisq_value calcuation uses a similar formular (with out sqrt) and sums up the values for every cell. _ Higher_values of this measure are more likely to produce low p-values than _lower_ values) A larger Chi-Square value indicates a greater deviation from expected values, suggesting a stronger association between sex and point color. If chi square is small, the observed values are close to the expected values, indicating little to no association.

Answer 49

- association plot - mosaicplot assocplot (size of boxes big to small =1-4) : negative box(2) positive box(1) positive box(4) negative box(3)

Answer 50

**Why Report Effect Sizes Alongside P-Values?** - P-values indicate whether effect exists but do not show magnitude of effect. - can be misleading, especially with large sample sizes (small effects can still be statistically significant) - ES provide info about practical significance of result, helping to understand how meaningful difference or relationship is in real-world terms - Reporting effect sizes allows for better comparison across studies and supports meta-analyses **Effect Size Measure: Cohen's d** - commonly ES measure, compare means of two groups - Formula: d = (μ_1 − μ_2) / σ_pooled (σ_pooled = pooled sdev) - Interpretation: d = 0.2 → small; 0.5 → medium; 0.8 → large effect

Answer 51

**2 vars:** - weight: Dry weight gains of plants (continuous variable). - group: Light treatments (ctrl, trt1, and trt2). **Normality Test:** - Shapiro-Wilk test applied: check if weight data in each group follows normal distribution. - Results: p-values show all groups have p > 0.05 → assumption of normality satisfied. **Group Means:** - aggregete used to calculate means for dry weight gains - trt2 has highest weight gain, followed by ctrl, trt1. **ANOVA Test:** - one-way ANOVA performed: test whether significant diffs in mean weight gains across 3 groups. - Results: F-value = 4.846, p = 0.0159. p < 0.05→ significant diff in mean weight gain between at least two groups. **Post-hoc Pairwise Comparisons:** Pairwise t-tests w/ Holm adjustment to identify which groups differ significantly: ctrl vs trt1: p = 0.194 (not significant). ctrl vs trt2: p = 0.175 (not significant). trt1 vs trt2: p = 0.013 (significant). **FINAL RESULT** - ANOVA indicates significant diff in mean dry weight gains across 3 groups (p = 0.0159). - Post-hoc tests reveal trt2 has significantly higher weight gains than trt1 (p = 0.013). - However, neither treatment shows stat. significant diff compared to ctrl → Thus, trt2 outperformed trt1, but did not significantly outperform ctrl in terms of dry weight gains.

Answer 52

variable least discriminatory between Iris species can be identified by analyzing PCA loadings variable with smallest absolute loadings across PC1 and PC2 is least discriminatory.

Answer 53

LOOK AT OLD EXAM - Look for vars with similar directions in loadings plot (i.e., arrows point in nearly the same direction). - the closer two arrows are in angle (small angle between them), the stronger their correlation. - If two variables have loadings of similar magnitude and sign on both PC1 and PC2, they are likely highly correlated.

Answer 54

data should be scaled to prevent bias ast variaables have different ranges → correlation matrix