L6 Flashcards
Fastq format
-A FASTQ file normally uses four lines per sequence.
Line 1 of fastq format
begins with a ‘@’ character and is followed by a sequence identifier and an optional description.
Line 2 of fastq format
raw sequence letters.
Line 3 of the fastq format
begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 of the fastq format
encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
What do genome assembly programs do
stitch together an organism’s
chromosomes from fragmented reads of DNA
Read
a DNA “word” that comes out of sequencer
Contig
a contiguous sequence formed by several overlapping reads with no gaps that represent a consensus region of DNA
Supercontig
an ordered and oriented set of contigs, usually by mate pairs
N50
contig size of N means that 50% of the assembled bases are contained in
contigs of length N or larger.
Coverage
The number of times a genome has been sequenced (the depth of
sequencing). C = LN / G
Assembly size
Number of nucleotides successfully assembled
A scaffold is made of
contigs and gaps
How can gap length be guessed correctly
by incorporating information from paired ends or mate pairs of
different insert sizes
Resequencing
- Allows us to investigate potential SNPs associated with
disease. - Allows us to investigate potential SNPs associated with
individual populations. - Allows us to investigate potential SNPs associated with
niche specification.
De novo sequence assembly
assembling reads together so that they form a new, previously unknown sequence. orders of magnitude slower and more memory intensive than mapping assemblers. No reference genome
Comparative sequence assembly
assembling reads against and existing backbone or reference sequence, building a sequence that is similar but not necessarily identical to the backbone sequence.
In absence of reference genome, what do we rely on
de novo assemblers
What do de novo assemblers rely on
fact that 2 reads that overlap significantly in their sequence are likely to represent neighboring segments of a
genome. (Kmer value)
When do problems arise with de novo assemblers
when overlapping regions belong to
repetitive regions.
What is the popular sequencing choice for De novo assembly
PacBio sequencing due to low costs
Greedy assembly algorithm
It is used for organisms such as bacteria, single-celled eukaryotes as they have single genomes and aren’t repetitive. It has some efficiency limitation.
What has the greedy algorithm been superseded by
Graph methods
Steps of greedy algorithm
Calculate pairwise alignments of all fragments.
(2) Choose two fragments with the largest overlap.
(3) Merge chosen fragments.
(4) Repeat step 2 and 3 until only one fragment is left.
Alternative to greedy algorithm
de Brujin graphs- ask if its eulerian
What is an eulerian graph
A graph is considered Eulerian if the graph is both connected and has a closed trail (a walk with no repeated edges)
containing all edges of the graph.
What does it mean when a graph is connected
if each node can be reached by some other
node.
* Node is balanced if indegree equals outdegree.
* Node is semi-balanced if indegree differs from outdegree by 1.
* A directed, connected graph is Eulerian if and only if it has at most
2 semi-balanced nodes and all other nodes are balanced.