sequencing & assembly Flashcards

1
Q

stages of genome sequencing

A
  • fragmentation/cloning
    • fragment library, amplification
  • sequencing
    • from both ends to get multiple reads
  • processing
    • base calling, quality assessment, repeat masking
    • trim ends (decrease on polymerase affinity)
  • assembly
    • overlapping reads → contigs
    • contigs → scaffolds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

pac bio

A
  • 3rd gen
  • longest reads ~20,000
  • high error rate but random
    • mutliple reads → consensus
  • 99.999% accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

phred scores

A
  • quality score
  • estimated confidence in each base call
  • use to:
    • filter and trim reads
    • create consensus
    • distinguish between variants and errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q value

A
  • given by phred
  • QV = -10log10(Pe)
    • Pe = probability that base call is an error
  • ignore call if lower than 30
    • <99.9% accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

chastity filter

A
  • illumina base call algorithm
  • assign and filter intensity score for nucleotides at each position
  • highest score divided by sum of highest and 2nd highest score for that position
  • less than threshold (0.6) base marked N
  • if higher assign base call
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

factors affecting quality

A
  • end of read deterioration (pol affinity)
  • adaptor attached to reads
  • high AT or GC content
    • reduced complexity
  • homopolymeric tracts
    • unsure of length
    • SNPs ignored (assumed as error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

depth of coverage

A
  • eliminate errors
  • depends on genome complexity, read length, sequencer error rate
  • HGP - 12x or greater
    • each base present in 12 reads on average
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

paired end sequencing

A
  • sequence both fragment ends
  • distance known → filter fragments by size
  • knowing one position anchors the other
  • better read alignment
  • important for repeats
  • improved prediction of structural variations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

repeats

A
  • fragments with identical repeat regions can be assembled together
  • in between sequences lost
  • sequencing may be impossible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

repeats and paired end reads

A
  • pair of overlapping reads, 1 unique, 1 repetitive
  • map unique read
  • position second as distance known
  • enough paired reads allows sequencing across whole repeat region
  • small repeats only
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

mate pairs

A
  • longer than paired ends
    • kb vs 500bp
  • bridge across repeats or structural rearrangements
  • don’t sequence repeat but don’t lose information
  • fill gaps with paired ends
  • helps resolve correct order of repeat fragments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

scaffolding

A
  • resolution of conflicting areas
  • order non-overlapping contigs into scaffolds
    • gaps with known or predicted size
    • spanned by N (unknown sequence)
  • bridge contigs with mate pairs
  • de novo assembly - gaps remain
    • need wet-lab work and paired end reads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

rearrangments and paried end reads

A
  • compare to reference genome mapping
  • decrease in size → deletion
  • increase in size → insertion
  • wrong way round → inversion
  • maps to different region → translocation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

finishing

A
  • fill in gaps, resequencing, different technology or longer reads
  • design primer probe for PCR to reach end
  • improve ocnsensus
  • expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

limitations of assembly

A
  • next gen small read lengths
  • AT rich genomes
  • repetitive genomes
  • de novo sequencing
    • no reference
    • use multiple technologies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

entamoeba histolytica

A
  • AT rich
  • long runs of Ts
  • many contig ends are T
  • unknown ploidy
  • 1500 contigs, 20Mb
17
Q

blumeria araminis

A
  • repeat rich
  • 7000 contigs, 120Mb
  • contig assembly difficult
  • need longer reads or targeted sequencing