Exam Questions Flashcards

1
Q

(EXAM - VL1)

What is S3 in R? Explain.

(2024-2)

A
  • simpler / informal / lightweight / more flexible OOP system
  • generic functions –> objects have different behaviours based on class
  • Unlike S4: does not enforce strict definition of objects/methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(EXAM - VL2)

Write a function that calculates waist to hip ratio (as a new measurement instead of BMI), return the result where the user can choose how many decimal places he/she wants otherwise the default digits should be 3.

(2024-2, 2020 similar)

A
wth = function(w, h, digits = 3) {
  whr <- w / h
  return(round(whr, digits))
}

(could add check if hip >0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(EXAM - VL2)

Write a function who calculates the BMI/
BMI = kg/m2

(2020)

A
bmi = function(w, h, d = 2) { 
    bmi = w / (h^2) 
    return(round(bmi, d)) 
}

(could add check if height is > 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(EXAM - VL2)

Write a function tmean which calculates a trimmed mean by removing the highest and lowest value of a vector and calculating therafter the mean of the remaining vector elements. Return the normal mean if only one or two values are given. You can ignore the NA problem, but you are not allowed to use the mean function of R :( (na problem bonus points)

(2024-1, 2024-sample)

A
tmean = function(x) { 
    n = length(x) 
    if (n <= 2) { 
        return(sum(x) / n) } 
    x_sort = sort(x) 
    x_trim = x_sorted[-c(1, n)] 
    trim_mean = sum(x_trim) / length(x_trim)
    return(trim_mean) 
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(EXAM - VL2)

Write a function gmean which calculates the geometric mean of a vector. You can ignore the NA problem. Below is the formula for the geometric mean.

(2018)

A
gmean = function(v) {
    n = length(v)
		p = prod(v)
    return(p^(1/n))
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(EXAM - VL2)

What is three dot … operator in R (2024-1)

(VL2)

A

Three dots, or “ellipsis” argument

Used to allow function to accept additional arguments without explicitly defining in function signature.

→ makes functions more flexible and adaptable to different situations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(EXAM - VL2)

Write a function is_even() that returns TRUE if a numb
er is even and FALSE if it’s odd.

(ct)

A
is_even = function(x) {
  return(x %% 2 == 0)
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(EXAM - VL2)

Write a function factorial_custom() that calculates the factorial of a number without using factorial().

(ct)

A
factorial_custom = function(n) {
    if (n == 0) {
      return(1)
    } else {
      result = 1
      for (i in 1:n) {
        result = result * i
      }
      return(result)
    }
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

(EXAM - VL2)

Annotate this code:

gentoo=as.data.frame(penguins[penguins$species=="Gentoo",])
dim(gentoo)
cWeight=cut(adelie$body_mass_g,breaks=quantile(adelie$body_mass_g,c(0,1/3,2/3,1),na.rm=TRUE),include.lowest=TRUE)
table(cWeight)
table(cWeight,adelie$sex)
...

(2024-1)

A

Convert the filtered data into a data frame

gentoo=as.data.frame(penguins[penguins$species==”Gentoo”,])
→ extracts information for subset of penguins, gentoo species

dim(gentoo)
→ Display number of rows, columns of Gentoo subset

cWeight=cut(adelie$body_mass_g,breaks=quantile(adelie$body_mass_g,c(0,1/3,2/3,1),na.rm=TRUE),include.lowest=TRUE)
→ categorises Adelie penguins’ body mass into 3 weight groups based on quantiles
→ include.lowest = TRUE’ ensures the lowest value is included in the first category
→ numeric to categorical

table(cWeight)
→ create frequency table showing count of observations in each weight category

table(cWeight,adelie$sex)
→ create contingency table showing count of observations for each weight category, split by sex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(EXAM - VL2)

Fill in the blanks:
~~~
______(palmerpenguins)
______(penguins)
with(penguins,_______(body_mass_g ~ sex*species,col=c(“salmon”,”skyblue”)))
__________(body_mass_g ~ sex + species, data = penguins, FUN = mean, na.rm = TRUE)
~~~

(2024-1)

[what does this do?]

A

aggregate - calculates Mean Body Mass by Sex & Species

_library_(palmerpenguins)
_data_(penguins)
with(penguins, _boxplot_(body_mass_g ~ sex*species,col=c("salmon","skyblue")))
_aggregate_(body_mass_g ~ sex + species, data = penguins, FUN = mean, na.rm = TRUE)

[creates a boxplot to visualise the spread of body weights per sex and species]
————————————————————–
library(palmerpenguins)
data_(penguins)
with(penguins, boxplot(body_mass_g ~ sex*species,col=c(“salmon”,”skyblue”)))
aggregate(body_mass_g ~ sex + species, data = penguins, FUN = mean, na.rm = TRUE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

(EXAM - VL2)

The Titanic data set contains data about the Titanic from 1912. Given are different categories and the survival of passengers and crew members.

What does the following R code mean? Explain the commands and the output:

> options(width=70)

(This now more a command assignment task, so you have to place commands like names, str, dim, data (into empty command fields))

(2024-sample)

A

→ sets max number of chars / line when displaying output in console.

Why Use It?
- controls how wide printed output appears in console.
- helps format long outputs (e.g., df, lists, matrices) -> prevent wrapping in messy way.
- working with narrow terminal windows / printing wide tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

(EXAM - VL2)

The Titanic data set contains data about the Titanic from 1912. Given are different categories and the survival of passengers and crew members.

What does the following R code mean? Explain the commands and the output:

> ftable(Titanic[1:3,,,])

(This now more a command assignment task, so you have to place commands like names, str, dim, data (into empty command fields))

(2024-sample)

A

ftable()
→ creates flat contingency table (instead of displaying multi-level format)

Titanic[1:3,,,]
→ selects first 3 levels of “Class” variable (i.e., 1st, 2nd, 3rd), keeps all levels of other dims.
ftable()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

(EXAM - VL2)

The Titanic data set contains data about the Titanic from 1912. Given are different categories and the survival of passengers and crew members.

What does the following R code mean? Explain the commands and the output:

> names(dimnames(Titanic))
[1] "Class"    "Sex"      "Age"      "Survived"

(This now more a command assignment task, so you have to place commands like names, str, dim, data (into empty command fields))

(2024-sample)

A

dimnames(Titanic)
→ retrieves the dim names (or labels) of Titanic dataset.

names(dimnames(Titanic))
→ extracts just names of these dims, returning:

**[1] “Class” “Sex” “Age” “Survived” **
→ dataset is structured as 4D table with these cats

Titanic is a 4D contingency table → has 4 cat dims:
“Class” → Passenger class (1st, 2nd, 3rd, Crew)
“Sex” → Male, Female
“Age” → Child, Adult
“Survived” → No, Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

(EXAM - VL2)

You are analyzing the flipper lengths of Adelie and Chinstrap penguins. Fill in the missing parts (bold) in the R code below:

library(palmerpenguins)  
\_\_\_\_\_\_(penguins)  
adelie_chinstrap = penguins[penguins$species \_\_\_\_\_ c("Adelie", "Chinstrap"), ]  
boxplot(flipper_length_mm ~ \_\_\_\_\_\_\_\_\_\_ * sex, data=adelie_chinstrap, col=c("lightblue", "pink"))  
_aggregate_(flipper_length_mm ~ species, data=adelie_chinstrap, \_\_\_\_\_\_, \_\_\_\_\_\_)  

Anotate code and summarise the findings shortly (2022)
More code, fill in the gaps in code (2022)
describe code (2020)
Fill in the gaps on the code below using these R commands Options, by, colnames, TRUE, dim, read.table, with, (2024-2)

(ct)

A
library(palmerpenguins)  
_data_(penguins)  
adelie_chinstrap = penguins[penguins$species _%in%_ c("Adelie", "Chinstrap"), ]  
boxplot(flipper_length_mm ~ _species_ * sex, _data_=adelie_chinstrap, col=c("lightblue", "pink"))  
aggregate(flipper_length_mm ~ species, data=adelie_chinstrap, _mean_, _na.rm=TRUE_)  
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(EXAM - VL2)

The dataset mtcars contains data on different car models.

The following R code filters cars with more than 6 cylinders (mtcars$cyl) and calculates the max horsepower (mtcars$hp) by number of gears (mtcars$gears). Fill in the blanks:

\_\_\_\_\_\_(mtcars)  
high_cyl = mtcars\_\_\_\_\_\_   
aggregate(\_\_\_\_\_\_, data=high_cyl, \_\_\_\_\_\_)  

Anotate code and summarise the findings shortly (2022)
More code, fill in the gaps in code (2022)
describe code (2020)
Fill in the gaps on the code below using these R commands Options, by, colnames, TRUE, dim, read.table, with, (2024-2)

(ct)

A
_data_(mtcars)  
high_cyl = mtcars_[mtcars$cyl > 6, ]_ 
dim(high_cyl)  
aggregate(_hp ~ gear_, data=high_cyl, _max_)  
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(EXAM - VL2)

You were investigating two light schedule treatments (trt 1, trt2) against normal light conditions, 12 hours of continuous light (ctrl), on the daily dry weight increments of plants.

Please explain the R analysis below and the final result.

> data(PlantGrowth) 
> dim(PlantGrowth)  
[1] 30 2  
> head(PlantGrowth,n=3)  
 Weight group  
1 4.17 ctrl  
2 5.58 ctrl  
3 5.18 ctrl  
> with(PlantGrowth, aggregate(weight,by=list(group),max))   Group1 x  
1 ctrl 6.11  
2 trt1 6.03  
3 trt2 6.31  
> PlantGrowth[PlantGrowth$weight>quantile(PlantGrowth$weight,0.9),]   Weight group  
4 6.11 ctrl  
21 6.31 trt2  
28 6.15 trt2  

(2018)

A

with: saves having to write PlantGrowth again and again

> data(PlantGrowth) # loads the dataset
> dim(PlantGrowth)  # displays dims of dataset -> rows, cols
[1] 30 2  
> head(PlantGrowth,n=3)  # displays first 3 rows of the dataset 
 Weight group  
1 4.17 ctrl  
2 5.58 ctrl  
3 5.18 ctrl  
> with(PlantGrowth, aggregate(weight,by=list(group),max))
# aggregate: get summary of numeric data, computes max weight of each treatment group    
Group1 x  
1 ctrl 6.11  
2 trt1 6.03  
3 trt2 6.31  
> PlantGrowth[PlantGrowth$weight>quantile(PlantGrowth$weight,0.9),]
# Identifies plants with weights greater than 90th percentile (i.e., top 10%)   
Weight group  
4 6.11 ctrl  
21 6.31 trt2  
28 6.15 trt2  

Final Result
trt2 has highest recorded weight (6.31).

top 10% of weights include more trt2 plants -> trt2 might have had stronger effect on growth than other conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

(EXAM - VL3)

Describe how you would transform a quantitative variable into a qualitative one with around equal sized classes. Explain shortly why such approach could be useful. (3 points)

(2024-2, 2024-sample)

A
  • Sort data
  • Define # of bins (e.g., 3 or 4).
  • Divide range into equal-sized intervals (e.g., using quantiles).
  • Label bins (e.g., “Low”, “Medium”, “High”).
    in R:
    cut() function, assign levels with function
    ~~~
    data_cat <- cut(data, breaks = 3, labels = c(“Low”, “Medium”, “High”))
    ~~~
    (for good split -> use quantiles)

Why it’s useful:
- Simplifies interpretation.
- Facilitates comparisons (e.g., with chi-square tests).
- Handles non normal data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

(EXAM - VL3)

1- save
2- saveRDS
3- dev.copy2pdf
4- write.ftable
5- write.table
6- save.image
7- savehistory
?

(2024-2)

A

save – Saves multiple R objects to file in binary format (.RData).

saveRDS – Saves single R object to file in binary format (.rds), allowing selective loading.

dev.copy2pdf – Copies current graphics device output to PDF file.

write.ftable – Writes flat contingency table (ftable) to a text file.

write.table – Exports df or matrix to a text file (CSV-like).

save.image – Saves entire current R workspace (all objects) to .RData file.

savehistory – Saves command history to a file (.Rhistory)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

(EXAM - VL3)

Explain similarities and differences between the R commands
- data
- read.table
- source
- load. (3 points)

(This is now more an a, b, c, d, e assignment task!)

(2024-sample)

A

Similarities:
All these commands are used to import data into R for analysis.

Differences:
- data(): Loads datasets bundled with R or packages
- read.table(): Reads external text files (CSV, tab-delimited ) into R as df .
- source(): Executes R script file
- load(): Loads R objects (saved as .RData or .rda files) into environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

(EXAM - VL3)

Describe what these do:

  • table
  • readRDS
  • data.frame

(2022)

A
  • table - creates a frequency table of categorical variables.
  • readRDS - Reads a single R object saved in .rds format into R.
  • data.frame - reates a data frame, a table-like structure for storing data in R.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

(EXAM - VL3)

How would you transform a categorical variable into numerical? How such an approach might be useful?

(2024-2)

A

[probably they mean tthe other way around? but:]

df$category_num = as.numeric(as.factor(df$category))

why useful?
- Enables statistical and machine learning models to process categorical data.
- Helps in finding patterns and relationships in data.
- Allows numerical operations like computing correlations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

(EXAM - VL3)

what is
- save
- save.image
- saveRDS

(2022, 2020)

A
  • save: saves R objects to a file, typically in .RData or .rda format
  • save.image: Saves entire current R workspace (all objects) to .RData file.
  • save.rds: saves single R object to a file in .rds format.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

(EXAM - VL3)

Describe how you would transform a quantitative variable into a qualitative one with around equal sized classes. Explain shortly why such approach could be useful. (2024-sample) (3 points)

A

(probably mean the other way around? but here we go)

Using numeric encoding: Assign each category a unique number (e.g., “Low” = 1, “Medium” = 2, “High” = 3).
In R, use as.numeric(factor(variable)) to convert categorical values to numbers.

Using one-hot encoding: Convert each category into a binary column (0 or 1).
In R, use model.matrix(~ variable - 1) for one-hot encoding.

Usefulness:
Enables use of categorical data in machine learning algorithms that require numerical input.
Makes it easier to calculate statistics like correlation or regression when dealing with categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

(EXAM - VL5)

Explain Will Rogers Phenomenon (2024-2, 2024-1)

A

Will Rogers: “When the Okies left Oklahoma and moved to California, they raised average intelligence in both states.” (Due to Feinstein et al. (1985).)

CHATGPT:

  • moving individual from one group to another
  • –> raises AVG of both groups
  • even though no actual improvement has happened.

🔹 Example:
Imagine two classes:
- Low achievers: Average grade = 50
- High achievers: Average grade = 80

If we move a student with grade 60 from the low achievers to the high achievers:
- The low achievers’ new avg increases (60 above previous avg).
- The high achievers’ new avg also increases (60 below previous avg).

common in medicine, ecology, and statistics, where reclassification makes both groups look better without actual improvement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

(EXAM - VL7)

A mutant animal grows 3cm per week faster than expected of a normal animal; CI 95%[1.3, 5.4].
Put a checkmark after the correct p value and write an estimate for CI for the other p values:
* 0.2
* 0.1
* 0.05
* 0.01
(2024-2)

A

0.05 → 95% CI[1.3, 5.4]
Others:
* 0.2 → 80% CI
* 0.1 → 90% CI
* 0.01 → 91% CI
As p-values correspond to different confidence levels, we can estimate the CIs for other p-values:
p=0.2 (80% CI)
* → Wider interval
* Approximate estimate: [1.8, 4.9]
* 80% CIs are typically narrower than 95% CIs.
p=0.1 (90% CI)
* → Slightly wider than 80% CI
* Approximate estimate: [1.5, 5.1]
p=0.01 (99% CI)
* → Wider interval
* Approximate estimate: [0.8, 5.9]
Higher confidence levels require wider intervals to ensure the true value is captured.
CI z-score (approx.)
80% 1.28
90% 1.645
95% 1.96
99% 2.576

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

(EXAM - VL11)

(Given: a biplot) Metabolite levels in plants grown in two different temperature conditions (16°C, 6°C) were analysed. N = 243 for each temperature, total number = 486.
PCA was performed on 37 metabolite levels. The biplot shows only 2 PCs. How many PCs are generated in total? (1 point)
How many variables influence the PC ? (1 point)

(2024-2)

I think most of the other questions were about interpreting the biplot. (2022)

A
  • number of PCs = number of variables (metabolites) in dataset = 37 PCs (if there were less samples than variables, there would be n-1 PCs)
  • variables that influence the PCs are the metabolites -> 37
    because each PC is a linear combination of all orig variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

(EXAM - VL11)

(Given: a biplot) Metabolite levels in plants grown in two different temperature conditions (16°C, 6°C) were analysed. N = 243 for each temperature, total number = 486.
PCA was performed on 37 metabolite levels. The biplot shows only 2 PCs, but 37 PCs are generated, influenced by 37 variables.

Which metabolite is more affected by temp increase?
Which is least sensitive to heat ?

(2024-2)

A
  • long and aligned with group separation → Most affected by T increase
  • short and perpendicular to group separation → Least sensitive to heat.

Which metabolite is more affected by temperature increase?
- in biplot: length of arrow (loading vector) for each met indicates how strongly that met contributes to PCs
- If met has long arrow pointing in direction of separation between two temperature groups (16°C and 6°C) → suggests that this met is strongly influenced by temperature diff.

Which metabolite is least sensitive to heat?
- conversely: mets with short arrows in biplot contribute less to variance captured by PCs → less influenced by T changes.
- if met’s arrow points perp to separation direction of two T groups → this met does not correlate with T diffs.
→ met with shortest arrow or one that does not align with group separation is least sensitive to heat.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

(EXAM - VL11)

>wine=read.table(“wineData.txt”,header=T)
>p=prcomp(wine,scale=T)
>biplot(p)
>varimax(p$rotation)

What do the red arrows signify in this figure, write the most common term associated with PCA?

(2024-2)

A
  • loadings (red entries)

[scaled by PC standard deviations and sqrt(number of observations]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

(EXAM - VL12)

[shown is a square with a frame, then same image in PCA scoreplot, where it has been turned so it’s balancing on a corner]

Why is the PCA not straight ?

(2024-2)

A

Because the PCs find the directions of greatest variance.

Data are uncorrelated (r=0), but not independent!
Choosing X limits Y -> X and Y are said to carry mutual Information

PCA assumes linear relationships + Gaussian distributions for optimal interpretation.
data non-gaussian -> PCA may fail to fully disentangle dependencies or make dimensions truly independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

(EXAM - VL13)

Which of the following statements relates better to tSNE or PCA ?
* Builds on the concept of density sensitive distance metric
* Repeated runs may result in different outcomes
* Needs additional parameter setting
* Allow meaningful assessment of Var ….(i forgot the ending)

(2024-2)

A

tSNE:
- Builds on the concept of density sensitive distance metric - uses probabilistic approach to preserve local neighborhood structures, sensitive to density differences in data
- Repeated runs may result in different outcomes - stochastic algorithm –> results can vary between runs unless seed is fixed
- Needs additional parameter setting - requires tuning parameters like perplexity, learning rate, and iterations (cf PCA fewer parameters to tune)

PCA
- Allow meaningful assessment of variance - *explicitly calculates and retains variance along principal components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

(EXAM - VL11)

For the matrix, M, with M=[ 5 8 ; 1 3 ]:

a) Which of the following two vectors is an/ are eigenvector(s) of M:
⃗v1=(−1 3 )
⃗v2=(2 −1)

b) What is/are the associated eigenvalue/s?
Document your answer by presenting the necessary calculations!

(3 points)

(2024-sample, 2024-1, 2024-2, 2020)

A

a)Mv = l
A vector v is an eigenvector of a matrix M if:
Mv =λv

Mv1 = [ 5 8 ; 1 3 ] (-1 3)

(19 12)
→ Mv1 ≠ λv1
→ not a scalar multiple of v1, so v1 is not an eigenvector of M.

Mv2 = [ 5 8 ; 1 3 ] (2 -1)

(2 -1)
→ Mv2 = λv2
→ is an eigenvector, scaled by an eigenvalue of 1

b)
We already found one eigenvalue associated with v2: λ1 = 1
To calculate all eigenvalues:
Mv =λv
(M - λI)v = 0 (I = identity matrix)
→ det(M - λI) = 0
(5 - λ)(3 - λ) - 8*1 = 0

λ1 = 1, λ2 = 7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

(EXAM - VL11)

Consider the following matrix and vectors:

M =[ 4 2 ; 1 3 ]
⃗v1=(1 1 )
⃗v2=(-2 -2)

If v1 is an eigenvector of M, is v2 also an eigenvector?

[Eigenvector question. But no finding the eigenvalues. just saying “If a is an eigenvector, is b also an eigenvector?” ]
(2022)

A

Yes

any scalar multiple of an eigenvector is also an eigenvector corresponding to the same eigenvalue

If v1 is an eigenvector, then any vector of the form cv1, where c ≠ 0, is also an eigenvector of M corresponding to the same eigenvalue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

(EXAM - VL12)

Some example of dimensionality reduction.

Is ICA superior? Did ICA get it right and how would you calculate it if it did?

(2024-2)

A

ICA considered superior in certain contexts because of ability to:
- Separate Independent Sources
- Handle Non-Gaussian Data
- No Orthogonality Constraint

To determine if ICA “got it right,”: evaluate if extracted components are truly independent, either:
- Visual Inspection
- measure stat. indep.: eg mutual information (should be minimized between components), kurtosis (higher kurtosis than gaussian signals),… -> compare w/ ground truth signals (if available)

more details:
ICA considered superior in certain contexts because of ability to:
- Separate Independent Sources: Unlike PCA, which only removes correlations (linear relationships), ICA removes both correlations and higher-order dependencies, making it ideal for separating independent signals (e.g., blind source separation)12.
- Handle Non-Gaussian Data: ICA assumes that the underlying sources are non-Gaussian, allowing it to separate signals that PCA might fail to distinguish14.
- No Orthogonality Constraint: ICA does not require components to be orthogonal, unlike PCA, which makes it more flexible in capturing real-world data structures

To determine if ICA “got it right,”: evaluate if extracted components are truly independent, either:
- Measuring Statistical Independence: Use metrics like mutual information, kurtosis, or negentropy to assess the independence of the extracted components. Lower mutual information or higher kurtosis indicates better separation47.
- Visual Inspection: Plot the separated components and inspect whether they represent meaningful independent signals (e.g., in blind source separation problems like separating mixed audio signals).

*calculate it if ICA ‘got it right’? **
Evaluate Results - calculate metrics such as:
- Mutual Information: Should be minimized between components.
- Kurtosis: Independent components often exhibit higher kurtosis than Gaussian signals.
- Compare with ground truth signals (if available) to verify correctness.

34
Q

(EXAM - VL14)

Experiment design: What is an under-powered experiment? Which error type does it raise?

(2024-2)

A
  • Lacks enough statistical power to reliably detect true effect.
  • Statistical power = probability of correctly rejecting H0 when it is false (i.e., detecting a real effect).
  • Type II error (false negative)—failing to detect a true effect.
35
Q

(EXAM - VL15)

Survival question:
Explain cox proportional hazards. How does this relate to time function or its main components ? (don’t entirely remember)

(2024-2)

A

(CHECK - NEED MORE)

Cox proportional-hazards model:

essentially a regression model commonly used for investigating association between survival time of patients and one or more predictor variables

Models hazard, h(t), probability of dying at a point in time (Pd(t)), given survival to that time point (S(t)):

36
Q

(EXAM - VL5)

What is Z-score? How is it calculated?

(2024-1)

A

Normalisation procedure:
- A Z-score (or standard score) –> how many standard deviations a data point is from the mean of a dataset.
- data transformation–>
- data now have mean of 0; SD of 1
- 95% of data are within a z-score of -1.96 and + 1.9

Formula:
z = (x - x¯ ) / s

37
Q

(EXAM - VL7)

(also QUIZ 5)

You have determined the following correlation coefficient and the confidence intervals.
R = 0.3, CI 95%[0.2, 0.4] p _______?
R = 0.3, CI 95%[0.1, 0.5] p _______?
R = 0.3, CI 95%[0.0, 0.6] p _______?
R = 0.3, CI 95%[-0.1, 0.7] p _______?
(2024-2)

A

R = 0.3, CI 95%[0.2, 0.4] p < 0.001
R = 0.3, CI 95%[0.1, 0.5] p < 0.05
R = 0.3, CI 95%[0.0, 0.6] p = 0.05
R = 0.3, CI 95%[-0.1, 0.7] p = 0.20

38
Q

(EXAM - VL)

                    | Norm ~norm| Non-norm~cat | Cat~cat  | Norm~cat
Centre        |                        |                             |                |                          
plot             |                        |                             |                |                        		        
test             |                        |                             |                |                        		        	
ES               |                        |                             |                |                        		        				

(2024-1, 2022, 2020)

A

see spicker

39
Q

(EXAM - VL7)

Imagine you’re doing a study comparing Covid vs Flu patients and comparing results dead or alive. What tests & table do you do? What analysis, visualisation methods etc. do you use?

(2024-1)

A

2 categorical variables:
- disease type: covid, flu
- survival: dead, alive

visualisation:
bar plot, association plot, (four fold plot)

table:
2 x 2 contingency table → margin table → independence table

test:
prop.test, chisq.test (fisher if exp. cell counts < 5)
H₀ (Null Hypothesis): Survival rates are independent of disease type.
H₁ (Alternative Hypothesis): Survival rates depend on disease type.
→ p-value and confidence interval

Effect Size:
Cohen’s h (best because its a 2x2 table)

write a report
A chi-square test of independence was performed to examine whether mortality rates of covid and influenza differ. The relationship between the infection and the mortality rate was significant, χ2(1,N=___) = ___, p=____. Patients with _____ were ____ %, more likely to die CI95%[___,___] (___%) than _____ patients (___%),
Cohen’s h = ____.

40
Q

(EXAM - VL7)

Imagine you’re doing a study comparing Covid vs Flu patients and comparing results dead or alive.

Why is a t-test not appropriate? Is it correlational or experimental research? (2024-1)

A

Correlational or Experimental?
- observational study, not an experiment.
- no random assignment patients already sick
→ correlational

Why is a t-test not appropriate?
- t-test: compare means of continuous variables (e.g., age, viral load).
- Here both variables categorical
→need tests for categorical data like χ² or logistic regression instead.

41
Q

(EXAM - VL4)

Draw an outline of an assoc plot result from penguins

(2024-1, 2022 but not sure what dataset)

42
Q

(EXAM - VL7)

Write a report for the result of the chisq.test for the relation between passengers travel class and their survival at the Titanic (Often here a template to fill in with values from the previous task.) (2024-sample) (3 points)

We hypothesize that people at the Titanic had better survival changes if they were traveling in more expensive passenger classes.

(2024-sample, 2024-1 not saure which dataset)

> cohensW = function (tab) { pe=prop.table(chisq.test(tab)$expected); po=prop.table(tab); w=sqrt(sum(((po-pe)^2)/pe)); return(w[[1]])}
> data(Titanic)
> summary(Titanic)
Number of cases in table: 2201
Number of factors: 4
Test for independence of all factors:
        Chisq = 1637.4, df = 25, p-value = 0
        Chi-squared approximation may be incorrect
> str(Titanic)
 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
 - attr(*, "dimnames")=List of 4
  ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
  ..$ Sex     : chr [1:2] "Male" "Female"
  ..$ Age     : chr [1:2] "Child" "Adult"
  ..$ Survived: chr [1:2] "No" "Yes"
> tab=Titanic[1:3,,,]
> apply(tab,c(1,4),sum)
     Survived
Class  No Yes
  1st 122 203
  2nd 167 118
  3rd 528 178
> prop.table(apply(tab,c(1,4),sum),1)
     Survived
Class        No       Yes
  1st 0.3753846 0.6246154
  2nd 0.5859649 0.4140351
  3rd 0.7478754 0.2521246
> chisq.test(apply(tab,c(1,4),sum))
        Pearson's Chi-squared test
data:  apply(tab, c(1, 4), sum)
X-squared = 133.05, df = 2, p-value < 2.2e-16
> cohensW(apply(tab,c(1,4),sum))
[1] 0.3179676
A

A chi-square test of independence was performed to examine the relation between passenger class and survival on the Titanic. There was a significant relationship between passenger class and survival χ²(2,2201) = 133.05, p < 2.2e-16. The survival rate was higher for passengers in 1st class (62.46%) compared to those in 2nd (41.40%) and 3rd class (25.21%). Cohen’s w = 0.318, indicating a moderate effect size.

43
Q

(EXAM - VL10)

Given a decision tree, explain what is going on, what is the probability of x etc.

(2024-1)

A

In short,
- probability of x determined by tracing through decision tree from root to leaf
- look at class distribution at leaf node
Subtypes:
- classification tree: probabilities give likelihood of x belonging to each class.
- regression tree: leaf node gives predicted value for x.

Structure of the Decision Tree:
- made up of nodes and branches.
- Root Node: starting point where data is split based on feature
- Internal Nodes: decision points, data is split based on conditions/features.
- Leaf Nodes: terminal points of tree, final predictions are made.
- Each node splits data based on feature, threshold.
- eg: node may split based on whether “Age > 30” or “Income < 50000”.

Interpret the Decision Tree:
- Start at root node
- Move through internal nodes, following appropriate condition
- reach leaf node -> provides predicted class (classification) or value (regression).
- (if classification, leaf node also provides class probabilities.)

Determine the Probability of x:
classification tree:
probability of observation x belonging to certain class is determined by distribution of target variable in leaf node that x falls into.
Example:
If final leaf node with 70% “Yes”, 30% “No”-> P of x being “Yes” is 0.7 P of “No” is 0.3.

How to Calculate the Probability:
Leaf Node Counts: For each leaf node, count the number of instances of each class.
Class Probabilities: The probability of each class for a given observation x is the proportion of that class in the leaf node. This can be calculated as: 𝑃 (class𝑖 ∣𝑥 ) = Numberofinstancesofclass<sun>𝑖</sub> / Totalnumberofinstancesinleafnode</sun>

Example Walkthrough:
Suppose you have a decision tree for predicting whether a person buys a product, with features like “Age” and “Income”.
The root node splits based on “Age > 30”.
If x’s “Age” is 25, you go to the left child, where the next node might split based on “Income < 50000”.
At the leaf node, you might find that out of 100 people who ended up in this leaf, 60 bought the product (“Yes”) and 40 didn’t (“No”).
The probability that x buys the product would be 60/100= 0.6

44
Q

(EXAM - VL14)

Experiment design: How can power be increased in a study?

(2020)

A
  • More samples
  • Larger a (but increases chance of type I error)
45
Q

(EXAM - VL14)

Experiment design:
How to increase power without changing number of observations
How might this “fix” cause problems?

(2024-1)

A

Larger a (but increases chance of type I error)

46
Q

(EXAM - VL12)

Difference between PCA & ICA, when to choose ICA over PCA (specifically what kind of data)

A

Goal
- PCA: Finds directions (principal components) that maximize variance.
- ICA: Finds statistically independent components in the data.

Assumptions
- PCA: Assumes components are uncorrelated (orthogonal).
- ICA: Assumes components are statistically independent and non-Gaussian.

**Output components **
- PCA: Orthogonal and ranked by explained variance.
- ICA: Not orthogonal and not ranked.

Focus
- PCA: Captures maximum variance in the data.
- ICA: Removes both correlations and higher-order dependencies.

Data type
- PCA: Works well with Gaussian data or when variance is the focus.
- ICA: Works well with non-Gaussian data or when independence is needed.

Applications
- PCA: Dimensionality reduction, visualization, feature extraction.
- ICA: Blind source separation (e.g., separating mixed audio signals).

when to choose ICA?
Use ICA for separating independent signals or when working with non-Gaussian data requiring statistical independence.
If your analysis requires removing not only linear correlations but also higher-order dependencies, ICA is more appropriate.

47
Q

(EXAM - VL13)

Give one advantage of UMAP over t-SNE.

A

UMAP handles large-scale data faster due to its optimized graph-based computations, making it better suited for datasets with many samples compared to the computationally intensive t-SNE

48
Q

(EXAM - VL11)

Given a biplot, which parameters are more related to given variable.

(2024-1)

GO TO OLD EXAM 2018 TO PRACTICE

A

GO TO OLD EXAM 2018 TO PRACTICE

Look at angle between vectors:
- Smaller angles (closer to 0°) → strong positive relationships
- angles near 180° → strong negative relationships.
- Perpendicular vectors (90°) → no relationship

Check vector length:
- Longer vectors → vars with higher variance, stronger contributions to PCs

Projection onto the variable’s vector:
- Observations or variables projected closer to direction of vector are more related to that variable

49
Q

(EXAM - VL11)

Given a biplot, and a scree plot, which should be the second bar in scree plot.

(2024-1)

GO TO OLD EXAM 2018 TO PRACTICE

A

second bar in scree plot corresponds to eigenvalue of PC2.

This value represents the amount of variance explained by PC2.

identify it by looking at biplot, which shows the contributions of variables to each principal component, match it to the second largest eigenvalue in the scree plot.

50
Q

(EXAM - VL11)

Looking at a boxplot of data, and the biplot, was analysis scaled?

(2024-1)

GO TO OLD EXAM 2018 TO PRACTICE

A
  • Check for comparable ranges in boxplot and balanced arrow lengths in biplot to confirm scaling.
  • If ranges or arrow lengths differ significantly, data likely not scaled.

Boxplot:
- Look at the ranges of the variables in the boxplot.
- If vars scaled (e.g., standardized), boxplots will show similar ranges (centered around 0 with comparable spreads if standardized).
- If not scaled, variables with larger original ranges will dominate.

Biplot:
- In a PCA biplot, scaling affects both the scores (points) and loadings (arrows).
- If analysis scaled, vars with different units or variances will have comparable arrow lengths,, distances between points reflect relative relationships rather than absolute magnitudes.
- Without scaling, variables with larger variances or units will dominate, leading to disproportionately long arrows.

Why scale?
- PCA identifies directions of max variance.
- Features with larger scales dominate variance, bias results.
How to scale?
- Use standardization: mean = 0, sdev = 1.
Effect of scaling:
- Ensures all features contribute equally to PCA.
- Prevents bias toward features with larger numerical ranges.
When to scale?
- Always scale if features have different units or ranges (e.g., height vs. weight).

51
Q

(EXAM - VL14)

Experiment design:
Define/ explain what is meant by “power” in statistical hypothesis testing! (1 point)

(2024-sample, 2024-1)

A

= probability of correctly rejecting H0 when it is false (i.e., detecting a real effect).

52
Q

(EXAM - VL5)

Difference between experimental and correlational research (2022, 2020)

A

Key difference:
- experimental manipulates independent variable to observe effect on dependent variable.
- correlational measures relationship between two or more variables without manipulation.

What does each method establish?
- experimental - Establishes causation.
- correlational - Establishes association, not causation.

  • experimental - uses random assignment and control groups to reduce bias.
    correlational - Commonly uses statistical measures like Pearson’s correlation coefficient (r).
53
Q

(EXAM - VL)

Looking at % explained variance from PC1 and PC2 and then saying which Scree bar plot is correct (one correct bar was given and you had to choose between a second bar, A or B)

(2022)

54
Q

(EXAM - VL12)

Explain briefly the main objective of multidimensional scaling! (1 point)

(2024-sample, 2022)

SPICK

A

Preserves pairwise distances

as faithfully as possible

in lower dimensions

55
Q

(EXAM - VL8 )

What is kurtosis and what happens when we have a positive Kurtosis.

(2020)

A
  • 4th moment of a distribution
  • measures how sharp or flat distribution is
  • positive kurtosis → very sharp distribution
56
Q

(EXAM - VL9)

Write a report from this analysis (code shown)
(data set of birth weight of children from smoking and non-smoking mothers)
t- test was performed
CI and p- value

(2020)

A

Tests
categorical- nominal data:
→ prop test (1-2 groups, ci und p, counts >=5, c~c)
→ chisquare (2+ groups, nur p, counts >=5, c~c)
→ fisher (2 groups (2x2 contigency table), p, CI and odds ratio, counts >=0, c~c)

→ how is our data distributed?
→ Shapiro-Wilk (test for normal distribution, n<=50)
with many samples Shapiro-Wilk test is very easy significant
H0 assumes normality
→ Kolgorov-Smirnov (generalized test for any distribution, if n > 50)
checks if both samples might come from the same distribution
H0 assumes that both samples are equally distributed

normal numerical data:
→ 2 groups: t-test (n~c & n~na) (in function you can chose pared=TRUE or not)
→ 3+ groups in c: anova (n~c)
parametric tests
correlation: pearson correlation test (n~n) (H0: true correlation =0)

non-normal, ordinal data:
→ 2 groups: wilcox (n~c) (in function you can chose pared=TRUE or not)
→ 3+ groups: if not matched: kruskal-wallis-test (n~c)
if matched: friedman-test
non parametric tests, working on ranks
correlation: Spearman correlation test (n~non normal, n~c ordinal) (H0: true correlation =0)
Kendall = even more robust against outliers
→ on nominal data, only two ranks allowed, otherwise the correlation doesn’t work

57
Q

(EXAM - VL4)

Draw Boxplots

(used the results from the test with smoking( non-smoking mothers)

(2020)

A

draw one now!

58
Q

(EXAM - VL11)

mtcar data.
What would you choose and why: correlation or covariance matrix?

(2020)

A

if variables have different ranges or units:
→ data should be scaled to prevent bias
→ correlation matrix

if variables have same units and ranges:
→ cov matrix

59
Q

(EXAM - VL11)

(mtcar data?)
Describe and name plots: score plot, scree plot, score plot with loading

what has the most positive correlation with non American cars? (I think it was the same example we had in the lab lecture with him)

(2020)

CHECK OLD EXAM and VLS FOR IMAGES AND EXAMPLES (IRIS)

60
Q

(EXAM - VL11)

Why t-test can not be applied to survival time if patients with and without drugs

(2020)

A
  • Not normally distributed or even any parametric distance;
  • number of patients keeps changing
61
Q

(EXAM - VL5)

Explain briefly the aims of inferential and descriptive statistics. (3 points)

(2024-sample)

A

Descriptive statistics:
- summarize and organize data
- providing measures such as
- center: mean, median, mode,
- spread: standard deviation, SEM
- plots: graphs to describe a dataset

Inferential statistics
- use sample data to make predictions or generalizations about larger population
- often through hypothesis testing (p-value), confidence intervals, effect siye, and regression analysis.

62
Q

(EXAM - VL5)

Describe shortly in which situation you would use a boxplot, a mosaicplot and a xyplot to describe the relationship between two variables. (often now table and assign the right plot to the right combination of variables., in case of free text, not too much free text.) (3 points)

(2024-sample)

A

Boxplot
- n~c
- compare dist of numeric variable across cats

Mosaic Plot
- c~c
- visualize relationship/dependency between two cat vars

XY Plot (Scatterplot)
- n~n
- examine corr/trend between 2 numeric vars

63
Q

(EXAM - VL9)

Shortly explain the measures to characterize the distribution of a numerical variable.
Focus on center, scatter and shape of the distribution (3 points)

(Often now an assignment task like skewness, kurtosis, what does a positive skewness value mean …)

(2024-sample)

A

Characterizing a Numerical Distribution:

**Center: Describes the typical value. **
- Mean (average)
- Median (middle value)
- Mode (most common)

Scatter: (Dispersion): Describes spread.
- Range (min–max)
- SD/Variance (spread around mean)
- IQR (middle 50%).

Shape: Describes symmetry and tail behavior.
Skewness: Asymmetry
- Positive = right tail
- Negative = left tail
Kurtosis: Tailedness
- High = more outliers
- Low = fewer outliers

64
Q

(EXAM - VL7)

Explain and summarize the following statistical analysis. You can write your comments directly on the paper. We hypothesize that people at the Titanic had better survival changes if they were traveling in more expensive passenger classes. (2024-sample) (6 points)
~~~
> cohensW = function (tab) { pe=prop.table(chisq.test(tab)$expected); po=prop.table(tab); w=sqrt(sum(((po-pe)^2)/pe)); return(w[[1]])}
> data(Titanic)
> summary(Titanic)
Number of cases in table: 2201
Number of factors: 4
Test for independence of all factors:
Chisq = 1637.4, df = 25, p-value = 0
Chi-squared approximation may be incorrect
> str(Titanic)
‘table’ num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 …
- attr(*, “dimnames”)=List of 4
..$ Class : chr [1:4] “1st” “2nd” “3rd” “Crew”
..$ Sex : chr [1:2] “Male” “Female”
..$ Age : chr [1:2] “Child” “Adult”
..$ Survived: chr [1:2] “No” “Yes”
> tab=Titanic[1:3,,,]
> apply(tab,c(1,4),sum)
Survived
Class No Yes
1st 122 203
2nd 167 118
3rd 528 178
> prop.table(apply(tab,c(1,4),sum),1)
Survived
Class No Yes
1st 0.3753846 0.6246154
2nd 0.5859649 0.4140351
3rd 0.7478754 0.2521246
> chisq.test(apply(tab,c(1,4),sum))
Pearson’s Chi-squared test
data: apply(tab, c(1, 4), sum)
X-squared = 133.05, df = 2, p-value < 2.2e-16
> cohensW(apply(tab,c(1,4),sum))
[1] 0.3179676
~~~

A

(check exam pdf)

Preparation for effect size measurement.

Load data.

Internal variables investigation

Contingency table

Extract data subset

Proportions rowise

> 5 in each cell

Significant

Effect size, medium effect

…..

see summary

65
Q

(EXAM - VL11)

For a dataset consisting of 40 samples containing metabolite level data of 100 metabolites each, you wish to generate a PCA-score plot that shows the 40 samples in the space of the Principal Components (PCs) defined by the 100 metabolites.

How many PCs with non-zero eigenvalue can you expect to obtain from the PCA computations? (1 point)

(2024-sample)

A

40

(or 39???)

66
Q

(EXAM - VL11)

The table below shows an excerpt of R-dataframe “mtcars”, holding information on various car types. In total, 32 different cars are contained in the dataset. Each car is characterized by the following 11 variables:
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburettors
(2024-sample)

You wish to perform a PCA-analysis to visualize the different cars and to understand, by which features the cars can be separated. Assume that you want to understand whether or not American-built cars are different from non-American cars.

**What type of PCA would you perform on the data: scaled data (i.e. using the correlation matrix), or unscaled data (using the data as shown in the table, corresponding to using a co-variance matrix as input). Explain! (2 points) **

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

(2024-sample)

A

variables have different ranges and units
→ scaled data (i.e. using the correlation matrix)

67
Q

(EXAM - VL11)

GO TO SAMPLE EXAM AND DO THIS

The table below shows an excerpt of R-dataframe “mtcars”, holding information on various car types. In total, 32 different cars are contained in the dataset. Each car is characterized by the following 11 variables:
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburettors
(2024-sample)

You wish to perform a PCA-analysis to visualize the different cars and to understand, by which features the cars can be separated. Assume that you want to understand whether or not American-built cars are different from non-American cars.

**The Figures below show the result of a PCA-analysis.
i. Label the plots A), B), and C) (one label only!) AND explain briefly, what they show! Choose labels from the set of the following options: “scatter plot”, “scree plot”, “star plot”, “biplot”, “score plot” (3 points) (2024-sample)
ii. How many Principal Components (PCs) would you consider relevant for the characterization of cars given the set of features? Which plot of the three below is relevant for answering this question and why? (3 points) (2024-sample)
iii. Name a feature that
- is negatively correlated with American-built cars, i.e. which feature/s has/have smaller values in American cars than in non-American cars: …………………….
- Is not informative with regard to distinguishing American cars from other car types: …………………….
- Has a relatively large loading on PC2: ………………………. (3 points)

**

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

(2024-sample)

A

GO TO SAMPLE EXAM AND DO THIS

i. Label the plots and explain
A)
B)
C)
scree plot
score plot
biplot

ii.
screeplot is relevant

iii. Name a feature that
- is negatively correlated with American-built cars, i.e. which feature/s has/have smaller values in American cars than in non-American cars: …………………….
- Is not informative with regard to distinguishing American cars from other car types: …………………….
- Has a relatively large loading on PC2: ………………………. (3 points)

68
Q

(EXAM - VL13)

Fill in the following text.

Select from the following terms (more terms than needed to fill in the text, terms may occur several times or not at all): Gaussian, non-Gaussian, Kurtosis, variance, linear, correlation, non-linear, exponential, independence, mutual information, standard deviation. (3.5 points)

“For the analysis of high-dimensional data, a number of projection methods have been established. PCA performs a ……..……………. transformation and projection of the data, while tSNE can create meaningful projections of data# into lower dimensions, even if they can only be separated by a …………………. distance metric. Unlike PCA, in which dimensions are sought that explain the maximal ………………………., ICA (independent component analysis) searches for directions that maximize ……………………. between the new dimensions. ICA yields more meaningful projection results than PCA, if the underlying data is NOT distributed according to a ……………….…… distribution. As a metric, of how different from a ………………….. a distribution is, the …………….…….…. can be used.”

(2024-sample)

A

“For the analysis of high-dimensional data, a number of projection methods have been established. PCA performs a linear transformation and projection of the data, while tSNE can create meaningful projections of data into lower dimensions, even if they can only be separated by a non-linear distance metric. Unlike PCA, in which dimensions are sought that explain the maximal variance, ICA (independent component analysis) searches for directions that maximize independence between the new dimensions. ICA yields more meaningful projection results than PCA, if the underlying data is NOT distributed according to a Gaussian distribution. As a metric, of how different from a Gaussian a distribution is, the kurtosis can be used.”

69
Q

(EXAM - VL14)

Experiment design:
Which experiment-design factors can contribute to low power? Name at least one! (1 points)

(2024-sample)

A
  • Small sample size – Not enough data to detect effects.
  • Larger a (but increases chance of type I error)
  • Generally: test-type (parametric/non-parametric)
    chatgpt extra:
  • Weak manipulation – If experimental conditions are too similar, differences won’t stand out.
  • Improper statistical test – Using a test with low sensitivity.
  • Measurement error – Inaccurate data collection reduces precision.
70
Q

(EXAM - VL14)

Experiment design: Explain qualitatively the term “effect size”! (1 point)

(2024-sample)

A

DW:
* magnitude of difference relative to average standard deviation (→relevance of effect, t-statistic→ significance of effect)
Chatgpt:
* - measure of strength/magnitude of relationship, difference, or effect in experiment,
* - independent of sample size.
* - tells how meaningful result is, rather than just whether stat. significant.
Examples:
* - medical trial: ES shows how much drug improves symptoms vs placebo.
* education: ES → how much new teaching method improves perf.

71
Q

(EXAM - VL5)

Shortly explain three approaches to ensure representative sampling for a statistical analysis.

(2018)

A

Simple Random Sampling
- Every member of pop has equal chance of being selected.
- Use when pop is homog., complete list of members available

Stratified Sampling
- Divide pop into subgroups (strata), sample proportionally.
- Use when pop is heterog, -> ensure all subgroups are represented.

Systematic Sampling
- Select every kth member from random list, random start point
- Use when pop is ordered, need evenly dist. sample w/o bias.

72
Q

(EXAM - VL)

Which of the following P value and CI relations for the mean of two groups and the difference between those two means are not consistent? Explain shortly and write down a more consistent P value on the right.

Gene / Mean Control / Mean Treatm. / CI 95% / P-value
G1 / 13 / 11 / 0.5 to 4.0 / 0.02
G2 / 21 / 26 / -9.0 to 0.2 / 0.07
G3 / 22 / 24 / -0.3 to -4.2 / 0.09
G4 / 802 / 744 / 23 to 125 / 0.13

(2018)

A

For G3 and G4, the P-values should be adjusted to values below 0.05, consistent with their confidence intervals excluding zero.

use rule 95% CI … :
- should not include null value if P-value < 0.05,
- should include null value if P-value ≥ 0.05.
(null value eg 0 for mean difference)

Gene / Mean C / Mean T / CI (95%) / P / Consistency? / Correct P
G1 13 11 0.5 to 4.0 0.02 Consistent —
G2 21 26 -9.0 to 0.2 0.07 Consistent —
G3 22 24 -0.3 to -4.2 0.09 Inconsistent Should be < 0.05
G4 802 744 23 to 125 0.13 Inconsistent Should be < 0.05

Explanation:
- G1: CI (0.5 to 4.0) does not include 0, and p < 0.05 -> consistent.
- G2: CI (-9.0 to 0.2) includes 0, p ≥ 0.05 -> consistent.
- G3: CI (-0.3 to -4.2) does not include 0, but p ≥ 0.05 -> inconsistent because non-zero CI implies statistical significance (P < 0.05). correct p: < 0.05.
- G4: CI (23 to 125) does not include 0, but p ≥ 0.05 -> inconsistent because non-zero CI implies statistical significance (P < 0.05). correct p: < 0.05.

For G3 and G4, the P-values should be adjusted to values below 0.05, consistent with their confidence intervals excluding zero.

73
Q

(EXAM - VL8)

You would like to describe the PlantGrowth data table presented below. What numeric measures and which graphical visualizations would you use in order to describe this data. Explain in both for the single variables and for the two variables together. (2018)

You were investigating two light schedule treatments (trt 1, trt2) against normal light conditions, 12 hours of continuous light (ctrl), on the daily dry weight increments of plants.
~~~
> data(PlantGrowth)
> dim(PlantGrowth)
[1] 30 2
> head(PlantGrowth,n=3)
Weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
> with(PlantGrowth, aggregate(weight,by=list(group),max)) Group1 x
1 ctrl 6.11
2 trt1 6.03
3 trt2 6.31
> PlantGrowth[PlantGrowth$weight>quantile(PlantGrowth$weight,0.9),] Weight group
4 6.11 ctrl
21 6.31 trt2
28 6.15 trt2
~~~

A

Single variables:
Numeric Measures:
- Weight: mean, median, sdev, min, max, quartiles.

Graphical Visualizations:
- weight: histogram or boxplot to visualize distributions
- group: bar plot to show frequency of each group.

Two Variables Together
Numeric Measures:
- Compare summary statistics (e.g., mean, max) of weight across different groups.
- eg :

aggregate(weight ~ group, data = PlantGrowth, FUN = mean)

Graphical Visualizations:
- boxplot to compare dist of weights across groups (weight ~ group). eg:
boxplot(weight ~ group, data = PlantGrowth)
.
74
Q

(EXAM - VL6)

You found a new species of lady beetles, Coccinella golmensis, which have either black or blue points.
Your hypothesis is that males have blue points more often than females - blue points look nicer and might impress the females and therefore give males with blue points mating advantage.
You collect 35 male beetles with blue and 25 with black points, and you have further found 20 female beetles with blue and also 20 with black points.
Create a contingency table and explain how to calculate expected numbers of your collection, ie assuming that there are no differences in point colors for the two sexes of your species.
How many males with black points would you expect if there would be no effect in color distribution?

(2018)

A
----------------------------------------------------------------
                  Black            Blue            Total
----------------------------------------------------------------
Male     |       25         |        35        |        60
Female |       20         |        20        |        40
----------------------------------------------------------------
Total     |       45         |        55        |        100     
----------------------------------------------------------------

To get expected value: Contingency table → Margin table → independence table
Independence table contains expected number: Rowtotal * Columntotal / Total:
Male with black points: 25*45/100 = 27

75
Q

(EXAM - VL)

Which measure characterizes the deviation from the expected values in an independence table, and how is it calculated?

(2018)

A

pearson residuals

The formula to calculate the Pearson residuals for every cell of a contingency table is: (observed - expected) / sqrt(expected)

Chi square values –> chi square statistic
The _chisq_value calcuation uses a similar formular (with out sqrt) and sums up the values for every cell. _
Higher_values of this measure are more likely to produce low p-values than lower values)

A larger Chi-Square value indicates a greater deviation from expected values, suggesting a stronger association between sex and point color.

If chi square is small, the observed values are close to the expected values, indicating little to no association.

76
Q

(EXAM - VL4)

You found a new species of lady beetles, Coccinella golmensis, which have either black or blue points.
You collect 35 male beetles with blue and 25 with black points, and you have further found 20 female beetles with blue and also 20 with black points.
You create a contingency table:
~~~
—————————————————————-
Black Blue Total
—————————————————————-
Male | 25 | 35 | 60
Female | 20 | 20 | 40
—————————————————————-
Total | 45 | 55 | 100
—————————————————————-

**Draw a plot to visualize the relationship between sex and point color for the lady beetles. **

(2018)

A
  • association plot
  • mosaicplot

assocplot (size of boxes big to small =1-4) :
negative box(2) positive box(1)
positive box(4) negative box(3)

77
Q

(EXAM - VL7-9)

Explain why reporting as well effect sizes and not only P values is important.

Explain further shortly one effect size measure to report the difference between the means of two groups.

(2018)

A

Why Report Effect Sizes Alongside P-Values?
- P-values indicate whether effect exists but do not show magnitude of effect.
- can be misleading, especially with large sample sizes (small effects can still be statistically significant)
- ES provide info about practical significance of result, helping to understand how meaningful difference or relationship is in real-world terms
- Reporting effect sizes allows for better comparison across studies and supports meta-analyses

Effect Size Measure: Cohen’s d
- commonly ES measure, compare means of two groups
- Formula: d = (μ_1 − μ_2) / σ_pooled (σ_pooled = pooled sdev)
- Interpretation: d = 0.2 → small; 0.5 → medium; 0.8 → large effect

78
Q

(EXAM - VL9)

Explain the following analysis and give the final result. Did one of your light treatments give significantly higher dry weight gains than your control?
~~~
> head(PlantGrowth,3)
Weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
> mtest = function (x) { return(shapiro.test(x)$p.value) } > with(PlantGrowth, aggregate(weight,by=list(group),mtest)) Group1 x
1 ctrl 0.7474734
2 trt1 0.4519440
3 trt2 0.5642519
> with(PlantGrowth, aggregate(weight,by=list(group),mean)) Group1 x
1 ctrl 5.032
2 trt1 4.661
3 trt2 5.26
> aov.res=with(PlantGrowth,aov(weight~group))
>summary(aov.res)
Df Sum Sq Mean Sq F value Pr(>F)
Group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886

Signif. Codes: 0
>with(PlantGrowth,pairwise.t.test(weight,group))
Pairwise comparisons using t tests with pooled SD
Data: weight and group
ctrl trt1
trt1 0.194 -
trt2 0.175 0.013
P value adjustment method: holm
~~~

(2018)

A

2 vars:
- weight: Dry weight gains of plants (continuous variable).
- group: Light treatments (ctrl, trt1, and trt2).

Normality Test:
- Shapiro-Wilk test applied: check if weight data in each group follows normal distribution.
- Results: p-values show all groups have p > 0.05 → assumption of normality satisfied.

Group Means:
- aggregete used to calculate means for dry weight gains
- trt2 has highest weight gain, followed by ctrl, trt1.

ANOVA Test:
- one-way ANOVA performed: test whether significant diffs in mean weight gains across 3 groups.
- Results: F-value = 4.846, p = 0.0159. p < 0.05→ significant diff in mean weight gain between at least two groups.

Post-hoc Pairwise Comparisons:
Pairwise t-tests w/ Holm adjustment to identify which groups differ significantly:
ctrl vs trt1: p = 0.194 (not significant).
ctrl vs trt2: p = 0.175 (not significant).
trt1 vs trt2: p = 0.013 (significant).

FINAL RESULT
- ANOVA indicates significant diff in mean dry weight gains across 3 groups (p = 0.0159).
- Post-hoc tests reveal trt2 has significantly higher weight gains than trt1 (p = 0.013).
- However, neither treatment shows stat. significant diff compared to ctrl
→ Thus, trt2 outperformed trt1, but did not significantly outperform ctrl in terms of dry weight gains.

79
Q

(EXAM - VL11)

Dimensionality reduction/PCA, Principle Component Analysis
For the three Iris species (Iris stetosa, Iris virginica, Iris versicolor), four morphological parameters (Petal (dt. Blütenblatt), Length and Width, Sepal (dt. Kelchblatt) Length and Width, all in the same length units – cm) were measured for 50 specimens each. Figure 1 shows the PCA-score plot (PC1 vs PC2) of the dataset. Figure 2 shows the associated loadings.

**Which variable (morphological parameters) is least discriminatory between Iris species and how do you arrive at your answer? **

(2018)

A

variable least discriminatory between Iris species can be identified by analyzing PCA loadings

variable with smallest absolute loadings across PC1 and PC2 is least discriminatory.

80
Q

(EXAM - VL11)

LOOK AT OLD EXAM

Dimensionality reduction/PCA, Principle Component Analysis
For the three Iris species (Iris stetosa, Iris virginica, Iris versicolor), four morphological parameters (Petal (dt. Blütenblatt), Length and Width, Sepal (dt. Kelchblatt) Length and Width, all in the same length units – cm) were measured for 50 specimens each. Figure 1 shows the PCA-score plot (PC1 vs PC2) of the dataset. Figure 2 shows the associated loadings.

**Which two parameters are correlated the most between each other and how did you arrive at this answer? **

(2018)

A

LOOK AT OLD EXAM

  • Look for vars with similar directions in loadings plot (i.e., arrows point in nearly the same direction).
  • the closer two arrows are in angle (small angle between them), the stronger their correlation.
  • If two variables have loadings of similar magnitude and sign on both PC1 and PC2, they are likely highly correlated.
81
Q

(EXAM - VL11)

Dimensionality reduction/PCA, Principle Component Analysis

For the three Iris species (Iris stetosa, Iris virginica, Iris versicolor), four morphological parameters (Petal (dt. Blütenblatt), Length and Width, Sepal (dt. Kelchblatt) Length and Width, all in the same length units – cm) were measured for 50 specimens each. Figure 1 shows the PCA-score plot (PC1 vs PC2) of the dataset. Figure 2 shows the associated loadings.

**PCA can be based on the covariance matrix (unscaled data) or the correlation matrix (scaled data), where scaling means to transform all data to zero mean, unit standard variation. Which option would you chose for the Iris data and why? **

(2018)

A

data should be scaled to prevent bias ast variaables have different ranges
→ correlation matrix