Exam 1 Homework Flashcards

Question 1

Q

The Human Genome Project started in the year ______________.

Question 2

Q

Dr. Green presented five domains of genome research:

Answer

A

understanding the structure of genomes
understanding the biology of genomes
understanding the biology of disease
advancing the science of medicine
improving the effectiveness of healthcare

Question 3

Q

What is the largest current bottleneck in genomics? Three were mentioned but all 3 fall under one broad category

Answer

A

the data analysis bottleneck

Question 4

Q

Dr. Green talks about “Why the world has changed” in the last 10 years. What are the five areas that
he described?

Answer

A

genomics
electronic health records
technologies
data science
participant partnerships

Question 5

Q

Massively parallel DNA sequencing instruments all have the following steps/characteristics:

Answer

A

a library obtained by either amplification or ligation with custom linkers
library fragments amplified on a solid surface with adapters
direct step-by step-by-step detection of the nucleotide base
lots of reactions detected per instrument run
digital read type that enables direct quantitative comparisons
shorter read lengths than capillary sequencers

Question 6

Q

In the Illumina process, the nucleotides are very specialized. They have two key attributes:

Answer

A

a fluor that is specific to the identity of the nucleotide
the 3’ hydroxyl group is blocked with a chemical blocker

Question 7

Q

__________________ of reads to the reference sequence is the first step to identify variation of all types.

Answer

A

alignment

Question 8

Q

Long read sequencers such as the PacBio instrument are a departure from short read sequencers such
as Illumina. What is the first major requirement for these long read technologies that is different from the short read technologies?

Answer

A

very long read length sizes, high molecular weight of genomic DNA

Question 9

Q

A typical workflow of whole exome sequencing analysis consists of the following steps:

Answer

A

raw data QC
preprocessing
mapping
post-alignment processing
variant calling
annotation
prioritization

Question 10

Q

Standard preprocessing procedure includes:

Answer

A

3’ end adapter removal
trimming of low quality bases at the ends of the reads

Question 11

Q

Many different tools have been developed for short reads mapping. In general, they use two
algorithms for aligning sequences:

Answer

A

Burrows-Wheeler Transformation (BWT)
Smith-Waterman (SW) Dynamic

Question 12

Q

Of the sequence aligners they evaluated, which two were the fastest?

Answer

A

Bowtie 2
BWA

Question 13

Q

After mapping reads to the reference genome, a three-step post-alignment processing procedure is
recommended to minimize the artifacts that may affect the quality of downstream variant calling. It
consists of:

Answer

A

read duplicate removal
indel realignment
base quality score recalibration (BQSR)

.

Question 14

Q

Variant analysis consists of:

Answer

A

genotyping
variant calling
annotation
prioritization

Question 15

Q

The authors mention three sequencing coverage levels High, Medium and Low. What are the
coverage ranges for these three levels?

Answer

A

low: <5 x coverage
medium: 5-20 x coverage
high: >20 x coverage

Question 16

Q

What is the formula for a Phred score: Qphred =

Answer

A

-10log(error)

Question 17

Q

What is the Phred quality value corresponding to a 1% error:

Question 18

Q

Alignment is more difficult in which regions of the genome?

Answer

A

regions with higher levels of diversity between the reference genome and the sequenced genome

Question 19

Q

Why should per-base quality scores be recalibrated?

Answer

A

the raw pared-scaled quality scores produced by base-calling algorithms may not accurately reflect the tru base-calling error rates. So the raw quality scores need to be recalibrated do that a phred score of Q more accurately corresponds to an error rate of 10^(-Q/10)

Question 20

Q

Several probabilistic methods have been developed that use the quality score to provide a posterior
probability for each genotype. What is the name of this value that is estimated for a genotype call?

Answer

A

genotype likelihood