Bioinformatics Exam Questions Flashcards

1
Q

During Sanger sequencing, it is commonly observed that base call quality deteriorates toward the terminal regions of sequencing reads. What is the main technical factor contributing to this decline
in data quality at the read termini?

A) Low concentration of chain-terminating nucleotides, leading to incomplete termination events.
B) Increased likelihood of dNTP misincorporation near the end of the sequence.
C) Variability in fragment mass and electrophoretic mobility affecting resolution.
D) Reduced signal intensity due to the lower quantity of fragments.

A

D) Reduced signal intensity due to the lower quantity of fragments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Select all advantages of Sanger sequencing.

A) Enhanced scalability and throughput suitable for high-volume sequencing projects.
B) Superior accuracy in homopolymeric tracts due to lower indel error rates.
C) Ability to generate the longest read lengths among sequencing technologies.
D) Greater cost efficiency when performing large-scale sequencing.

A

B) Superior accuracy in homopolymeric tracts due to lower indel error rates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary functional role of Illumina adapters within the sequencing process?

A) They incorporate fluorescent markers essential for the detection of base incorporations during
sequencing.
B) They facilitate the binding of DNA fragments to complementary oligonucleotides on the flow cell.
C) They protect DNA fragments from degradation during the sequencing process by providing a
stable binding interface.
D) They allow for the simultaneous sequencing of multiple DNA fragments by providing unique barcode sequences.

A

B) They facilitate the binding of DNA fragments to complementary oligonucleotides on the flow cell.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which issue is generally not associated with standard base call accuracy concerns for Illumina sequencing?

A) Cross-talk between adjacent detection channels.
B) Delayed or incomplete polymerase activity.
C) Variability in the removal of fluorescent terminator molecules.
D) Temperature fluctuations affecting sequencing cycle rates.

A

D) Temperature fluctuations affecting sequencing cycle rates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is sequencing coverage, and how does it affect downstream genomic analyses?

A) The proportion of the total genome size that has at least one read aligned, affecting assembly
completeness.
B) The mean number of sequencing reads encompassing each nucleotide position, influencing confidence in base accuracy.
C) The evenness of read distribution across the genome, impacting variant detection reliability.
D) The overall fidelity of nucleotide identification during sequencing, affecting error rates.
E) The maximum read length achieved during sequencing runs, affecting the ability to span large
genomic features.

A

B) The mean number of sequencing reads encompassing each nucleotide position, influencing confidence in base accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In high-throughput sequencing data, which of the following provides essential per-base error probability metrics for quality control?
A) The N50 contiguity statistic.
B) The sequence in FASTQ file format.
C) Phred scores embedded in FASTQ file format.
D) Coverage depth information within assembled contigs.
E) Per-sequence GC content from the sequencing summary file.

A

C) Phred scores embedded in FASTQ file format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the immediate biochemical consequence when a ddNTP is incorporated by DNA polymerase?

A It terminates DNA strand synthesis, preventing further elongation.
B It allows the DNA strand to continue synthesizing until the next dNTP is encountered.
C It enhances sequencing accuracy by preventing incorporation of incorrect nucleotides.
D It labels the DNA strand with a fluorescent marker that signals successful base pairing.

A

A It terminates DNA strand synthesis, preventing further elongation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main purpose of performing bridge amplification during high-throughput sequencing?

A To minimize the occurrence of sequencing errors by proofreading base incorporation through
amplification cycles.
B To generate localized clonal clusters of DNA fragments for robust signal detection.
C To enable high-throughput sequencing by amplifying template DNA fragments, ensuring sufficient quantities for downstream enzymatic reactions.
D To selectively amplify longer DNA fragments, allowing improved sequencing coverage.
E To enhance base calling accuracy by reducing background noise during sequencing.

A

B To generate localized clonal clusters of DNA fragments for robust signal detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the following best defines a contig in the context of genome assembly?

A A contiguous region within the genome that exhibits high sequencing coverage.
B A sequence of nucleotides constructed by overlapping sequencing reads.
C A segment of the genome prone to errors due to consistently low base quality scores.
D A repetitive genomic region that complicates the assembly process.

A

B A sequence of nucleotides constructed by overlapping sequencing reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the most accurate definition of a “k-mer”, and how does it contribute to the construction of
de Bruijn graphs in genome assembly algorithms?

A A substring of fixed length derived from reads and form de Bruijn graph nodes.
B A short sequence of nucleotides from a reference genome and represent de Bruijn graph edges.
C A DNA fragment of variable length for constructing nodes and edges between reads in de Bruijn
graphs.
D Sequences with lengths less than five derived from reads to build a de Bruijn graph.

A

A A substring of fixed length derived from reads and form de Bruijn graph nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is the N50 value defined, and what does it convey about the quality of an assembly?

A The sum of all contig lengths divided by the number of contigs, reflecting the average contig size
across the assembly.
B The length of the smallest contig in a sorted list of contigs that reaches 50% of the assembly
length.
C The contig length at which 50% of the total reads have been mapped during the assembly process,
reflecting sequencing coverage uniformity.
D The number of contigs whose cumulative length adds up to 50% of the total genome size, reflecting contig distribution in the assembly.

A

B The length of the smallest contig in a sorted list of contigs that reaches 50% of the assembly
length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which of the following genomic characteristics is most likely to increase the complexity of de novo
genome assembly efforts?

A Genomic regions with high GC content.
B Genomes with extensive repetitive elements and high repeat content.
C Sequencing data with minimal depth of coverage across the genome.
D Use of sequencing platforms that generate only short read lengths.

A

B Genomes with extensive repetitive elements and high repeat content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What fundamental principle is the basis for building a de Bruijn graph from sequencing reads?

A Aligning sequencing reads to an existing reference genome to identify overlaps and reconstruct
the target sequence.
B Identifying and utilizing overlapping k-mers to construct nodes and edges within the graph.
C Aggregating short sequencing reads to form longer contigs by prioritizing overlaps between
reads.
D Incorporating base quality scores into the assembly algorithm to weigh k-mers based on read
accuracy.

A

B Identifying and utilizing overlapping k-mers to construct nodes and edges within the graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In graph-based genome assembly approaches, which of the following actions is typically not incorporated into the graph traversal algorithms?

A Initiating traversal from nodes with optimal coverage levels to avoid erroneous low-coverage
paths.
B Extending the current path in the graph until a termination node, such as a dead-end or circular
path, is encountered.
C Employing a linear search algorithm to examine all nodes in islands for potential paths systematically.
D Implementing backtracking strategies to resolve ambiguous branches or repetitive regions in the
sequencing graph.

A

C Employing a linear search algorithm to examine all nodes in islands for potential paths systematically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Within De Bruijn graph-based genome assembly frameworks, selecting shorter k-mer lengths facilitates which of the following aspects?

A Enhanced resolution in repetitive genomic regions by reducing the number of ambiguous nodes.
B Increased number of overlaps due to the higher frequency of shorter k-mers, which can complicate the identification of unique sequences.
C Improved management of regions with sparse sequencing coverage, as shorter k-mers provide
more reliable read alignment.
D Increased connectivity in the assembly graph, allowing for better handling of sequencing errors
and structural variations

A

B Increased number of overlaps due to the higher frequency of shorter k-mers, which can complicate the identification of unique sequences.

OR

C Improved management of regions with sparse sequencing coverage, as shorter k-mers provide
more reliable read alignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When determining the most favorable pathway for contig extraction in genome assembly, which of
the following factors is considered least critical?

A Length of the traversed sequences.
B Uniformity of read coverage across the path.
C Frequency of branching points within the graph.
D Existence of unique, linear sequencing paths.

A

C Frequency of branching points within the graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the primary rationale for employing paired-end sequencing reads to assemble genomes?

A To decrease computational processing time during assembly.
B To supply information regarding inter-contig spacing.
C To rectify low-quality nucleotide bases at read termini.
D To prevent misassembly of repetitive elements.

A

B To supply information regarding inter-contig spacing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the main repercussion of retaining untrimmed adapter sequences?

A Causing misalignment of reads, resulting in the generation of shorter contigs.
B Leading to erroneous gene annotations and the omission of open reading frames.
C Inflating estimates of the total genome size.
D Introducing inaccuracies in assembly due to incorrect indels.

A

D Introducing inaccuracies in assembly due to incorrect indels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Select all statements regarding the SPAdes genome assembly tool that are incorrect.

A SPAdes utilizes De Bruijn graph structures for genome assembly.
B SPAdes is capable of processing both short and long sequencing reads during assembly.
C SPAdes constructs a singular k-mer size graph to streamline the assembly procedure.
D SPAdes effectively assembles bacterial genomes even in regions with low sequencing coverage

A

C SPAdes constructs a singular k-mer size graph to streamline the assembly procedure

20
Q

In the context of prokaryotic genome assembly, what do bulges in the assembly graph typically represent?

A Errors or misassemblies that cause small divergences in sequence paths.
B Gaps between reads that could not be closed by paired-end reads.
C Repeats in the genome that are unresolved during assembly.
D Regions of high GC content that are difficult to sequence accurately.

A

A Errors or misassemblies that cause small divergences in sequence paths.

21
Q

Which methodological approach does Prokka employ to detect open reading frames (ORFs) within
prokaryotic genomes?

A Utilization of Hidden Markov Models (HMMs).
B Precise alignment of ribosomal RNA gene sequences.
C Identification based on learned codon motifs.
D Alignment with gene databases.

A

C Identification based on learned codon motifs

22
Q

What does “functional annotation” primarily entail within the scope of gene annotation?

A Inferring the biological roles of genes through sequence similarity analyses.
B Detecting open reading frames (ORFs) within genomes.
C Constructing contigs from cleaned sequencing data.
D Identifying gene start and stop codons using probabilistic modeling techniques

A

A Inferring the biological roles of genes through sequence similarity analyses.

23
Q

Assemble a genome sequence by using the greedy algorithm and the following reads: ATTAGACCTG,
CCTGCCGGAA, AGACCTGCCG, GCCGGAATAC

A

First merge (1 & 3): ATTAGACCTGCCG

Second merge (2 & 4): CCTGCCGGAATAC

Third merge (5 & 6)

This is our final merge, resulting in ATTAGACCTGCCGGAATAC.

24
Q

Build a De Bruijn graph with 𝑘edge of 5 with the following reads: GATTAC, TACAGATT, AGATTAC, TACCGG,
GGATTA Then, using your De Bruijn graph, determine the optimal contig.

A

Optimal contig: TACAGATTAC

25
Q

Select all aspects that distinguish the Needleman-Wunsch from the Smith-Waterman algorithm.

A Needleman-Wunsch performs global alignment, ensuring the entire length of both sequences is
aligned.
B Needleman-Wunsch is inherently parallelizable, unlike Smith-Waterman.
C Smith-Waterman is optimized for aligning sequences of significantly different lengths.
D Smith-Waterman allows for multiple optimal local alignments to be identified simultaneously.
E Needleman-Wunsch uses affine gap penalties, whereas Smith-Waterman employs linear gap
penalties.
F Needleman-Wunsch initializes the scoring matrix with gap penalties along the first row and column.

A

A Needleman-Wunsch performs global alignment, ensuring the entire length of both sequences is
aligned.

C Smith-Waterman is optimized for aligning sequences of significantly different lengths.

D Smith-Waterman allows for multiple optimal local alignments to be identified simultaneously.

F Needleman-Wunsch initializes the scoring matrix with gap penalties along the first row and column.

26
Q

Problem 26
How are affine gap penalties characterized within the sequence alignment algorithms?

A Applying a uniform penalty to each gap irrespective of its length.
B Assigning a higher penalty for initiating a gap and a lower penalty for each gap extension.
C Exempting terminal gaps from penalty assignments.
D Imposing progressively larger penalties as the length of the gap increases.

A

B Assigning a higher penalty for initiating a gap and a lower penalty for each gap extension.

27
Q

What is the principal objective of performing a multiple sequence alignment (MSA)?

A Identifying conserved functional domains across diverse species.
B Executing fast local alignments between pairwise sequences.
C Detecting point mutations within individual DNA sequences.
D Comparing tertiary structures of proteins within a singular organism.

A

A Identifying conserved functional domains across diverse species.

28
Q

Perform a Needleman-Wunch and Smith-Waterman alignment with the following sequences: AATCG
and AACG. Use a scoring scheme of 1 for match, −1 for mismatch, and −2 for gap for both. Show all
possible tracebacks and their respective alignments.

A

NW:
A A T C G
| | | |
A A - C G

SW:
A A
| |
A A
C G
| |
C G

***see answer key for numbers in array

29
Q

Which of the following best describes the main difference between transcriptomics and genomics?

A Genomics studies the DNA sequence while transcriptomics analyzes RNA transcripts.
B Genomics focuses on protein function while transcriptomics deals with gene expression.
C Transcriptomics can predict protein structures while genomics predicts gene sequences.
D Transcriptomics focuses on the non-coding regions of DNA while genomics focuses on exons.

A

A Genomics studies the DNA sequence while transcriptomics analyzes RNA transcripts.

30
Q

Which of the following statements best describes the primary function of the RNA Integrity Number
(RIN) in evaluating RNA samples for applications like next-generation sequencing and microarray
analysis?

A It quantifies the absolute amount of mRNA in a sample.
B It assesses the extent of RNA degradation by analyzing fragmentation patterns.
C It measures RNA purity by detecting contaminants like proteins and DNA.
D It evaluates the efficiency of reverse transcription in cDNA synthesis.

A

B It assesses the extent of RNA degradation by analyzing fragmentation patterns.

31
Q

What is the key benefit of using the Burrows-Wheeler Transform (BWT) in sequence alignment algorithms?

A It enables fast and memory-efficient searching of large genomes.
B It helps detect novel splice sites during RNA-seq analysis.
C It eliminates the need for gap penalties in sequence alignment.
D It provides an efficient way to construct phylogenetic trees.

A

A It enables fast and memory-efficient searching of large genomes.

32
Q

Which of the following describes the search strategy in hash-based alignment?

A Finding a short exact match and then attempting to extend it in both directions.
B Using suffix arrays to find starting indices of k-mers.
C Constructing a sequence alignment by iterating through each possible alignment.
D Immediately calculating the optimal alignment score.

A

A Finding a short exact match and then attempting to extend it in both directions.

33
Q

How do suffix trees improve sequence alignment in comparison to hash-based methods?

A Suffix trees allow for faster exact pattern matching.
B Suffix trees can align sequences in parallel, speeding up the process.
C Suffix trees handle larger genomes with lower memory requirements.
D Suffix trees remove the need for gap penalties in alignment algorithms.

A

A Suffix trees allow for faster exact pattern matching.

34
Q

Given the initial string banana, what are the zero-based starting indices of the suffix array?
A [6, 5, 3, 1, 0, 4, 2]
B [6, 5, 3, 1, 0, 2, 4]
C [6, 3, 5, 1, 4, 2]
D [1, 3, 0, 5, 2, 4]

A

A [6, 5, 3, 1, 0, 4, 2]

35
Q

Perform the Burrows-Wheeler Transform (BWT) of the string CANADA. Show all intermediate steps.

A

Rotations:
CANADA$
ANADA$C
NADA$CA
ADA$CAN
DA$CANA
A$CANAD
$CANADA

Sorted Rotations:
$CANADA
A$CANAD
ADA$CAN
ANADA$C
CANADA$
DA$CANA
NADA$CA

Transform:
ADNC$AA

36
Q

Given the Burrows-Wheeler Transform (BWT) GT$ATCTGCGA, determine the original string. Show all
of your work

A

L column: G₀ T₀ $₀ A₀ T₁ C₀ T₂ G₁ C₁ G₃ A₁
F column: $₀ A₀ A₁ C₀ C₁ G₀ G₁ G₂ T₀ T₁ T₂
Original string: ATGGTCTACG$

37
Q

What is the key distinction between pseudoalignment and full alignment?

A Pseudoalignment is slower but more accurate than full alignment.
B Pseudoalignment only determines transcript compatibility.
C Full alignment does not account for sequencing errors.
D Pseudoalignment requires fewer transcripts in the reference.

A

B Pseudoalignment only determines transcript compatibility.

38
Q

Which of the following best explains the purpose of Salmon’s generative model in RNA-seq data
analysis?

A It identifies the precise position of each read on the transcript.
B It generates new reads from known transcripts for quality control.
C It models how sequencing reads are generated from a population of transcripts.
D It adjusts the effective length of each transcript to correct for biases.

A

C It models how sequencing reads are generated from a population of transcripts.

39
Q

What is the main goal of the Expectation-Maximization (EM) algorithm used in Salmon’s inference
process?

A To minimize the variance in transcript abundance estimates.
B To optimize the likelihood of observing the RNA-seq reads.
C To identify the most compatible transcripts for each read.
D To estimate the total number of fragments generated by the sequencing experiment.

A

B To optimize the likelihood of observing the RNA-seq reads.

40
Q

How does Salmon’s two-phase inference process achieve both speed and accuracy?

A By running the Expectation-Maximization algorithm in both the online and offline phases.
B By performing a quasi-mapping in the online phase and refining results with full alignment in the
offline phase.
C By using quasi-mapping for initial abundances and refining these estimates using more accurate
methods.
D By combining read mapping and assembly-based quantification in both phases.

A

By using quasi-mapping for initial abundances and refining these estimates using more accurate
methods.

41
Q

What is the significance of the transcript-fragment assignment matrix, 𝑍, in Salmon’s model?

A It records the exact positions of all fragments within each transcript.
B It is used to map each fragment to all possible compatible transcripts.
C It stores the probabilities of fragments originating from each transcript
D It provides the nucleotide sequences of all transcripts.

A

B It is used to map each fragment to all possible compatible transcripts.

42
Q

Why is the Negative Binomial distribution commonly used in differential gene expression analysis of
RNA-seq data?

A It accounts for the large number of zeros often observed in RNA-seq data.
B It models the overdispersion in the data where the variance exceeds the mean.
C It ensures that only the most highly expressed genes are analyzed.
D It simplifies the analysis by assuming equal variance across all genes.

A

B It models the overdispersion in the data where the variance exceeds the mean.

43
Q

In RNA-Seq data analysis, what does a very low p-value suggest when comparing gene expression
between two conditions?

A The observed differences in gene expression are likely due to random chance.
B The observed differences in gene expression are statistically significant.
C The gene expression is identical between the two conditions.
D The statistical model is not suitable for the data.

A

B The observed differences in gene expression are statistically significant.

44
Q

Which of the following best describes the TPM normalization method in RNA-Seq?

A It adjusts for differences in gene length and sequencing depth.
B It normalizes based on the total number of mapped reads, ignoring gene length.
C It uses logarithmic scaling to adjust for gene expression levels.
D It is identical to the FPKM method but includes a correction for GC content.

A

A It adjusts for differences in gene length and sequencing depth.

45
Q

Select all factors that could contribute to overdispersion in normalized RNA-seq count data.

A Variability in gene length across the transcriptome.
B Presence of unmodeled confounding covariates such as batch effects.
C Heterogeneous RNA integrity (RIN) scores among samples.
D Differential isoform expression is not accounted for in the analysis.
E Technical variability in sequencing depth among samples.
F Low detection rate of highly expressed genes.
G Uniform gene expression levels across all genes and samples.
H Nonuniform coverage across inserts.

A

B Presence of unmodeled confounding covariates such as batch effects.
C Heterogeneous RNA integrity (RIN) scores among samples.
D Differential isoform expression is not accounted for in the analysis.
E Technical variability in sequencing depth among samples.
H Nonuniform coverage across inserts.

46
Q
A