Exam Questions Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

A suffix array of a string T is a sorted array of all of its suffices.
Write down a suffix array of T=ACCTTGA.

A

Suffix Array: [1, 7, 2, 3, 6, 4, 5]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Show how to search for the signal string P1=ATG with the help of the suffix array. Is P1 in T?

A

Binary search on the suffix array for ATG:
Start with the whole array [7, 6, 4, 2, 5, 3, 1].
Compare ATG with the middle suffix (TGA$ at position 2).
Since ATG < TGA$, discard the right half.
Continue binary search until finding ATG$ at position 6.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe all steps needed for aligning two sequences with the Needleman-Wunschs algorithm

A

Initialization:
Create a matrix with dimensions (m+1) x (n+1).
Initialize the first row and column with gap penalties.

Scoring Scheme:
Define a scoring scheme for matches, mismatches, and gap penalties.

Filling the Matrix:
Fill the matrix by calculating scores for each cell based on neighboring cell scores.

Traceback:
Trace back through the matrix to determine the optimal alignment.
Start from the bottom-right corner (position (m,n)).
Move diagonally, upwards, or left based on the highest score.

Alignment Output:
Construct the aligned sequences based on the traceback path, adding gaps as needed.

Score Calculation:
Calculate the alignment score based on the chosen scoring scheme and traceback path.

Output:
Output the aligned sequences and alignment score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the key property of a (simple) Markov process?

A

the probability of transitioning to a future state depends only on the current state and not on the sequence of events that preceded it. In other words, it is memoryless.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In an HMM, what is meant by “hidden” layer?

A

In an HMM (Hidden Markov Model), the “hidden” layer refers to the sequence of hidden states that are not directly observable. These states represent underlying or latent variables that influence the observed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In an HMM, what is meant by “observation”?

A

In an HMM, “observation” refers to the sequence of observable symbols or data points generated by the model. These observations are influenced by the underlying hidden states but are directly accessible or measurable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In an HMM, which of the following are generated by a Markov process:
 the sequence of states
 the sequence of observations
 both
 none

A

In an HMM, both the sequence of states and the sequence of observations are generated by a Markov process. The sequence of states follows the Markov property as transitions between states depend only on the current state. Similarly, the sequence of observations is dependent on the hidden states, which follow the Markov property.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Please explain why sequence similarity searches (BLAST etc) are more sensitive for protein sequences than for DNA/RNA

A
  • Degeneracy of Genetic Code:
  • Conservation of Function Across Distantly Related Sequences:
  • Structural Constraints and Evolutionary Pressure:
    Sequence similrity,
    structural similarity
    functional similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why can modern ‚neural network‘-based protein secondary structure prediction still be considered a ‚sliding window‘ method?

A

it operates by moving a fixed-size window along the protein sequence and predicting the secondary structure of the central residue based on the sequence information within that window. The neural network model takes as input the amino acid sequence within the window and outputs the predicted secondary structure of the central residue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What other protein properties (besides 2ndary structure) can be addressed easily by sliding window methods? Name at least two more and estimate
appropriate window size ranges.

A

Solvent Accessibility: the degree to which AA are exposed to the solvent. window size range: 15-25
Transmembrane Helices: window size range 19-25 amino acids, to capture the characteristic periodicity of transmembrane helices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Are sliding window methods useful for the prediction of tertiary structure? If yes, what window size range would be appropriate? If no, why not?

A

Sliding window methods are not particularly useful for the prediction of tertiary structure. Tertiary structure prediction involves predicting the spatial arrangement of amino acids in three dimensions, which requires considering long-range interactions between distant residues.
methods such as homology modeling, ab initio modeling, or molecular dynamics simulations are commonly used for tertiary structure prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Homology Modeling vs Threading (Fold recognition) What are the fundamental differences between these two methods?

A

Homology modeling (sequence similarity)
Fold recognition (structural similarity) can identify distant structural similarities compared to homology modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Homology Modeling vs Threading (Fold recognition) Specify a situation, where one method cannot be applied while the other one can

A

One situation where threading can be applied while homology modeling cannot is when the query protein does not have significant sequence similarity to any protein with a solved structure in the PDB database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Homology Modeling vs Threading (Fold recognition) What is the connection between the applicability of these methods and evolution?

A

Homology modeling relies on the assumption that proteins with similar sequences (i.e., close homologs) share similar structures due to their common evolutionary ancestry. The opposite is true for fold recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

By what method(s) have the structures stored in PDB been determined?

A

X-ray Crystallography
Nuclear Magnetic Resonance (NMR) Spectroscopy
Alphafold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Not all structures in PDB are of the same ‚quality‘. What is a major criterion used for assessing the quality of an experimentally-solved structure?

A

the resolution for structures determined by X-ray crystallography. The resolution refers to the clarity or level of detail in the electron density map and is typically measured in angstroms (Å). A higher resolution indicates a clearer and more detailed map, which allows for more accurate determination of the atomic coordinates and overall structure of the protein.

17
Q

The PDB database is growing fast. If (within a certain time period) the number of PDB entries doubles, will the database be approx. twice as useful for homology modeing or threading? Why/why not?

A

it does not necessarily mean that the database will be approximately twice as useful for homology modeling or threading.
Quality of Entries
Diversity of Structures
Relevance to Query Proteins
Database Curation and Annotation

18
Q

The PDB database contains a structure that is overall very similar to the prediction query. How important is this for tertiary structure prediction via alphafold?

A

Importance: Absolutely required.
AlphaFold relies on deep learning models trained on known protein structures in the PDB database.

19
Q

There are many relatives of the query sequence in the proteinsequence database. How important is this for tertiary structure prediction via alphafold?

A

Importance: Helpful.
AlphaFold primarily relies on the structural information encoded in the protein sequence, and having many relatives in the sequence database can provide additional sequence data for training and potentially improve prediction accuracy.

20
Q

It is known from which organism the query sequence is derived. How important is this for tertiary structure prediction via alphafold?

A

Importance: Not used.

21
Q

It is known if the protein is membrane-associated. How important is this for tertiary structure prediction via alphafold?

A

Importance: Helpful.
can be helpful for refining predictions, especially for predicting the orientation of transmembrane regions or membrane-associated domains.

22
Q

Predicting proteins with a machine leqarning approach:
S1: the entire dataset is used both for training and for
validation.
S2: the human proteins are used for training, the mouse proteins for testing
S3: protein names starting with letters A-M are used for training, N-Z are used for testing (irrespective of species).
Which strategy will lead to better-looking ‚validation‘ results?

A

S1: May introduce overfitting -> may lead to good validation results
S2: More realistic validation result
S3: Introducing bias via naming convention

Optional question on how to proceed: Using cross validation or randomization. Or seperating both datasets in Training, testing and Validation sets for model training etc.

23
Q

Assume a simple linear model y = beta0 + beta1*x.
If you transform x (creating a new predictor x’ in the following way: x’ = x/2 and then fitting the model with x’, thusgetting new regression coefficients β0’
and β1’. How will the transformation of x
affect β0’and β1’?

A

y=β0+β1* (x/2) = y=β0+(β1/2) * x
Intersept: unchanged because intersept represents x = 0 so dividing by two doesnt do anything
Slope = halved because change in slopein x is only halve of that of x’

24
Q

Homoesdacity is not given normally but when log transforming it (i.e. using x’ = log(x))
Under which circumstances could you use this transformation log(x) in a simple linear model?

A

homoscedasticity (constant variance of the residuals)
1. Spread of variables is consitent across all predictors
2. Relation between variable x and size of mice is non-linear but becomes linear after log transformation

25
Q

Which of the following models is/are a general linear model?
1. y = ∑ βi * xi
2. log ( y ) = β1 * x1+ β2*x2
3. y = sin( β1 * x1) + cos ( β2 * x2 )
4. y = a + β * x^2

A

1: is a general linear model
2: is not a general linear model since its the log of y and not just y
3: not a general linear model since sin and cosin transformations are not linear
4: not a general linear model sind x^2 is not linear

26
Q

Why does DESeq2 shrink the dispersion estimate towards the average dispersion of similar genes?

A

DESeq is used for identifying differentially expressed genes.
DESeq2 employs a method known as “shrinkage estimation” to improve the accuracy of dispersion estimation.
improves the accuracy, stability, and statistical power of differential expression analysis in RNA sequencing data.

27
Q

What is FDR?

A

The FDR is the expected fraction of false positive results.

28
Q

What are the loadings of a Principle Component Analysis (PCA)?

A

The contribution of each original parameter to a given PC.

29
Q

What are Eigenvalues?

A

The variance that is explained by each PC.

30
Q

What are residuals

A

The residuals between the data points and the PCs. , which are the differences between the actual data points and the values predicted by the principal components.
Residuals are not loadings in PCA

31
Q

Which PC exyplains the most variance and which explains the least variance?

A

The first PC explains the most and the last the least.