L6 Flashcards
Fastq format
-A FASTQ file normally uses four lines per sequence.
Line 1 of fastq format
begins with a ‘@’ character and is followed by a sequence identifier and an optional description.
Line 2 of fastq format
raw sequence letters.
Line 3 of the fastq format
begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 of the fastq format
encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
What do genome assembly programs do
stitch together an organism’s
chromosomes from fragmented reads of DNA
Read
a DNA “word” that comes out of sequencer
Contig
a contiguous sequence formed by several overlapping reads with no gaps that represent a consensus region of DNA
Supercontig
an ordered and oriented set of contigs, usually by mate pairs
N50
contig size of N means that 50% of the assembled bases are contained in
contigs of length N or larger.
Coverage
The number of times a genome has been sequenced (the depth of
sequencing). C = LN / G
Assembly size
Number of nucleotides successfully assembled
A scaffold is made of
contigs and gaps
How can gap length be guessed correctly
by incorporating information from paired ends or mate pairs of
different insert sizes
Resequencing
- Allows us to investigate potential SNPs associated with
disease. - Allows us to investigate potential SNPs associated with
individual populations. - Allows us to investigate potential SNPs associated with
niche specification.