Exam 1 Homework Flashcards
The Human Genome Project started in the year ______________.
1990
Dr. Green presented five domains of genome research:
- understanding the structure of genomes
- understanding the biology of genomes
- understanding the biology of disease
- advancing the science of medicine
- improving the effectiveness of healthcare
What is the largest current bottleneck in genomics? Three were mentioned but all 3 fall under one broad category
the data analysis bottleneck
Dr. Green talks about “Why the world has changed” in the last 10 years. What are the five areas that
he described?
- genomics
- electronic health records
- technologies
- data science
- participant partnerships
Massively parallel DNA sequencing instruments all have the following steps/characteristics:
- a library obtained by either amplification or ligation with custom linkers
- library fragments amplified on a solid surface with adapters
- direct step-by step-by-step detection of the nucleotide base
- lots of reactions detected per instrument run
- digital read type that enables direct quantitative comparisons
- shorter read lengths than capillary sequencers
In the Illumina process, the nucleotides are very specialized. They have two key attributes:
- a fluor that is specific to the identity of the nucleotide
- the 3’ hydroxyl group is blocked with a chemical blocker
__________________ of reads to the reference sequence is the first step to identify variation of all types.
alignment
Long read sequencers such as the PacBio instrument are a departure from short read sequencers such
as Illumina. What is the first major requirement for these long read technologies that is different from the short read technologies?
very long read length sizes, high molecular weight of genomic DNA
A typical workflow of whole exome sequencing analysis consists of the following steps:
- raw data QC
- preprocessing
- mapping
- post-alignment processing
- variant calling
- annotation
- prioritization
Standard preprocessing procedure includes:
- 3’ end adapter removal
- trimming of low quality bases at the ends of the reads
Many different tools have been developed for short reads mapping. In general, they use two
algorithms for aligning sequences:
- Burrows-Wheeler Transformation (BWT)
- Smith-Waterman (SW) Dynamic
Of the sequence aligners they evaluated, which two were the fastest?
- Bowtie 2
- BWA
After mapping reads to the reference genome, a three-step post-alignment processing procedure is
recommended to minimize the artifacts that may affect the quality of downstream variant calling. It
consists of:
- read duplicate removal
- indel realignment
- base quality score recalibration (BQSR)
.
Variant analysis consists of:
- genotyping
- variant calling
- annotation
- prioritization
The authors mention three sequencing coverage levels High, Medium and Low. What are the
coverage ranges for these three levels?
- low: <5 x coverage
- medium: 5-20 x coverage
- high: >20 x coverage
What is the formula for a Phred score: Qphred =
-10log(error)
What is the Phred quality value corresponding to a 1% error:
20
Alignment is more difficult in which regions of the genome?
regions with higher levels of diversity between the reference genome and the sequenced genome
Why should per-base quality scores be recalibrated?
the raw pared-scaled quality scores produced by base-calling algorithms may not accurately reflect the tru base-calling error rates. So the raw quality scores need to be recalibrated do that a phred score of Q more accurately corresponds to an error rate of 10^(-Q/10)
Several probabilistic methods have been developed that use the quality score to provide a posterior
probability for each genotype. What is the name of this value that is estimated for a genotype call?
genotype likelihood