Week 22: (B) Putting the Genome Together Flashcards
What are the repetitive sequences?
short, noncoding sequences that are repeated hundreds of times in a tandem
Particularly in the centromere
What are transposons?
jumping genes
ancestral viral bits
Mobile genetic elements – sequences of a few kb that can move about the genome. Thousands of copies in eukaryotes
What part of the sequence creates a problem when we try and put our genomes together?
repetitive sequences
short reads make it hard to overcome repeat regions
What is a contig?
A ‘contiguous’ (continuous) consensus sequence from an
assembly
What s a Scaffold?
A series of contigs where we have additional information to place them together in the right order and orientation but the sequence between the contigs is not complet
What is an assembly? (genome assembly)
The set of scaffolds for one genome.
What is an N50?
The size of the largest contig/scaffold of which 50% of the assembled data is in a contig/scaffold of that size or larger.
medium length contig where the median is measured interns of the total measured genome.
Can be used to describe how complete an assembly is
What is coverage?
number of reads covering any one position on average
What is read length?
length of read
What is overlap?
number of bases overlapping
Number of bases used to join one read to another
How do you coverage?
how many bp worth pf reads you do divided by the total genome length
What is a read?
an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment.
one sequence
How do we overcome repetitive regions? solution 1
need longer reads to span over repeat regions. Illumina was god at this, up to 300 bp, Sanger was up to ~1500
How long can repeats spand?
10 bases to tens of thousands
How can we reduce the number of repeats we have to deal with?
sequencing smaller chunks
THE REPEAT MAY ONLY OCUR ONCEIN THE BAC but many times in the genome
how do we overcome repetitive genes? solution 2
getting the sequence from the end of long fragments
even though we don’t know what’s in the middle
> If we know how long that fragment is we know how far apart those 2 sequencing are
> paired-end reads
How do we sequence the end of a fragment?
sequence each end with different primer
What is paired end sequencing or mate paired sequencing?
When we sequence each end of a long fragment
What areas in our genome?
Protein coding regions
repetitive regions
tRNA(many)
rRNA(many)
Transposons
What are examples of repetitive regions?
Microsatellites, telomeres, intron sequences
How would you describe protein coding regions?
generally not repetitive but there are some exceptions, e.g. fillagrin and high copy number genes
Why does size matter of the fragments?
The gap between paired-end reads (mate pairs)
can range from 20kb to 500bp
How long ae the longest repeats?
~7kb
What graph shown the distance between 2 paired-end reads?
assemblygram
arcs represent the known number of bases between known sequences of bases
used in Illumina method
What does a coverage of 10X mean?
A coverage of 10X means that each base is on average found in 10 reads. The deeper the coverage, the more clearly any sequence or structure changes can be discerned from sequence error
What is ploidy?
The number of copies of the genome in the organism.
• Bacteria =1; Human=2; Potato=4; Strawberry=8
The higher the ploidy, the harder it is to accurately assemble.
What happens the deeper the coverage?
the more clearly any sequence or structure changes can be discerned (distinguished) from sequence error
more reliable
genuine variant
not sequenced correctly
What is an example of variation between genomes?
e.g. humans 2 genomes
SNV
single nucleotide variant
How do you know a gene is variant?
take a reference sequence from a bacterium
sequence at a high read-depth
If there is a consistent SNV at the same position the it is a genuine variant
What are the challenges of short read and re-sequencing?
is it a sequencing error or os the gene really missing
hard to tell when genes are small
> duplication, would you see the duplication with small reads
> inversions and translocations (structural variants)
as we mapping gaits a reference, we are not think that in that other genome sequence come form somewhere else
What are the structural variants?
deletion
duplication
inversion
translocation
What is phasing?
being able to assign different alleles to specific chromosomes (haplotypes)
What happens as poidy increases for sequencing genomes?
the harder it is to analyse structural and sequence variants
need more data and longer reads
What is a type of state of the art sequencing techniques?
PacBio
What type of sequencing does PacBio do?
single molecule real time sequencing
long reads (10kb+)
high error rate (5-15%)
How does nanopore sequencing work?
membrane impermeable to the current but the pore an pass it through
so electrons can flow through that particular pore
If we take DNA and a particular accessory protein
Will feed DNA through that pore
What is the outcome when different sizes bases block the pore?
change the currently that an flow through the pore
What does the hairpin stand do?
go round and sequence one strand then go round and sequence the other to check the stands to see if you have the same sequence
What are the 2 strategies are there for a Nanopore MinION?
single sequencing
hairpin adapter
What is the graph produced by Nanopore MinION?
Flow electropherogram
What are you measuring in Nanopore MinION?
the raw sequence
How can a base be modified?
- By an epigenetic marker
- Methyl cytosine
What is unique about the Oxford Nanopore MinION?
it can detect epigenetic markers rom the raw DNA sequence, no other sequencing machinery can do this
What is the issue of sequencing long DNA sequences?
snapping them in half
sticky and easy to break
How do you get round all the errors?
do many reads to overcome errors
around 95-98% error