Genomes and Genome Sequencing Flashcards
Application of studying genomic
Research
Health (e.g. diagnostic)
Environment (e.g. pollutants)
Agriculture (e.g. livestock, nutrients)
Health example for genomics
Causes of severe intellectual disability in children (42% of cases linked to DNA compared to 12% using other methods)
Disease example for genomics
Inflammatory Bowel Disease (Crohn’s disease)
more viral DNA = more viruses
viruses were bacteriophages
they infected gut bacteria and affected gut bacteria population -> Crohn’s disease
Disease Outbreak Tracking for genomics (only need one)
Ebola - finding point of origin, watching it change over time
HIV - identified known origin, identified species crossovers
Influenza - track current outbreaks of influenza to inform vaccine choices for coming winter in opposite hemisphere/ identify crossover/ crossover potential for strains
The third generation of DNA sequences
Longer DNA sequences
Sanger Sequencing
Chain termination sequencing
Uses DDNTPs (fluorescently labelled nucleotides)
How does Sanger Sequencing work
polymerase rebuilds double helix using normal nucleotides, then randomly adds a fluorescently labelled base, polymerase stops and sequence cut at that point
-> strands of DNA of varying lengths, each ending with a fluorescently-labelled base
(* as many times req. so substitute each base in length)
Then run small pieces on capillary electrophoresis gel
Record fluorescence
Each base is a diff. colour
Downsides of Sanger Sequencing
Slow
Expensive
Not high throughput
Errors in repetitive regions (lots of bases similar to each other, next to each other)
Bias in sequencing (certain regions better amplified than others)
Library Preparation
Extract DNA from cells
Fragment DNA (50-1000bp)
Add adaptors (either end of seq.) one will stick to seq., other will be start point for seq. reaction
Amplification
Issues with Library Preparation
Bias in amplification
How does Illumina Sequencing work?
Fragements added to the flow cell - bind to flow cell (adapter-flow cell)
Polymerases starts at top (furthest from flow cell) and add in fluorescently labelled nucelotides (randomly, on at a time)
+laser excitation, fluorescence recorded
Benefits of Illumina Sequencing
Fast
Cheap
High throughput
Issues with Illumina Sequencing
Repetitive regions
Amplification
Length resistrictions
Third generation sequencing
prevent length resistriction
take out need to amplification
PacBio SMRT
uses Single Molecule, Real-time Technology
Zero-mode wave-guides
One piece of DNA per well
Polymerase in well adds fluorescence like Illumina to single piece of DNA
PacBio Considerations
Higher error rates
No need for amplification
Longer, but not genome-length
Oxford Nanopore Minion
Very small
Membrane with many pores
Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read
Oxford Nanopore MinION
Very small
Membrane with many pores
Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read
Oxford Nanopore MinION Consideration
Does not use fluorescently-labelled nucleotides
Not as accurate as Illumina (99.9%), but close (95%)
Long read (up to 2 million bp)
What is the Prometheon?
48 MinIONS
large amounts of sequencing
Single-Cell Sequencing
uses Illumina
BUT with diff. lib preparation - single-cell
Each cell in a ‘gem’ - when gel broken open all contents labelled with barcode for indv. gem
Can say where DNA comes from -> cell types/spatial transcriptomics
Challenges to genome projects
Sequencing technologies not perfect (e.g. Illumina 99.9% not 100%)
Some DNA harder to seq. than others (e.g. centromere/telomere) - secondary structures
Population representation (variation)
Gaps. errors, lack of variation
Accuracy of assemblage
Genomes keep being corrected (diff. versions from same individual)
Alignments
Reference genome available
Compare and align
Assembly
Does not have an available reference genome
Assemble reads into a reference genome
Is a BEST REPRESENTATION not exact
Steps in an alignment
Find an approrpriate reference genome - diff. versions
Find fragment matches on reference genome
Steps of the alignment analysis
Base calling
Quality control
Alignment/Mapping
Alignment Post-Processing
Base calling
process of determining bases in the sequencing data
Quality control
Phred score
Q value
Mapping vs. Alignment
Mapping = position of the sequence on the reference genome
Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)
Alignment
position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)
Options for Alignment Post-Processing
Variant calling
Methylation studies
RNA seq. expression
Structural variants
Mapping vs. Alignment
Mapping = position of the sequence on the reference genome
Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)
Mapping
position of the sequence on the reference genome
Ways to align fragment sequence to a reference genome
Brute Force Method - by eye, move along reference a base pair at a time until matches
Alignment Software
What is the “Brute Force” method?
by eye, move along reference a base pair at a time until matches
Considerations with the “Brute Force” method
Easy to do
Very slow
Requires a lot of repetitive computations - inefficient
Alignment Software Types
RNA/DNA/bisulpide sequencing
Alignment Software Algorithms
Burrows Wheeler Transform
Suffix Arrays
Considerations with Alignment Software
Works as replacement for BLAST (BLAST-like methods do not scale well)
Trade-off between speed and accuracy
(quicker software may be less accurate)
Some newer tools use kmers (only mapping data)
Considerations with Alignment Software
Works as replacement for BLAST (BLAST-like methods do not scale well)
Trade-off between speed and accuracy
(quicker software may be less accurate)
Some newer tools use kmers (only mapping data)
How you make a Suffix Array?
all seq. end with a dollar
lining up positional order & number
then line up lexicographically (alphabetically with $ first)
Then take the positional information in lexicographical order of the new list
Alignment to a suffix array
See whether substring (fragment) matches the middle point (higher or lower lexicographically than the list)
If not, cut in half, discount second half.
Repeat until found location (matches at that point)
How you make a Burrows Wheeler Transform?
Uses rotations
Uses $ symbol
all seq. end with a dollar
lining up positional order & number
then line up lexicographically (alphabetically with $ first)
DOES NOT store positional information
Stores last column (last character in each line)
Considerations with Burrows Wheeler Transform
More efficient - binary storage (FM index)
Compressed further
Uses last-first principle
Makes substring search quicker (too complex to explain)
SAM format
Sequence Alignment/Map Format
tab deliminated file (columns)
Information about mapping of the read
Difficulties during alignments
Exact VS Inexact matching
Multi mapping sequences
Exact vs Inexact matches
Will be comparing for difference/ checking that they are there (allow for mismatch of X% - set limit)
versus
certainty that read from that location
[Software will have default value - but changeable]
Multi mapping sequences
Regions of ref. genome will be identical in more than one place
- repetitive regions
- gene families (have similar sequences)
Alignment visualisation steps
software IGV
reference genome along bottom, reference genomes aligned above, with base differences highlighted
software Tablet
reference at top, shows all bases, highlight differences
depth/coverage (alignments)
amount of reads aligned to that region
biological regions for sequence alignments
differential gene expression
studying the regulome
Differential gene expression using alignments
Amount of alignments aligned to that region = level of expression
Studying the regulome
regulatory regions in the genome
ChIP Seq Chromatin Immunoprecipitation - studying sequence where proteins bound (e.g. transcription factor)
BIS Seq - studying methylation
ChIP Seq
Chromatin Immunoprecipitation
looks at regions bounds by proteins (e.g. transcription factors)
Fix protein to DNA
Use antibody to pull those bits on DNA out
Unfix DNA
Sequence those bits of DNA
BIS Seq
methylation of base pairs
treat with bisulphide
replaces non-methylated Cs to a U
sequence and compare to ref. genome
any bases where see a T (DNA U), is unmethylated
Variant Calling
detecting single nucleotide polymorphisms (SNP) or insertions/deletions compared to reference genome
work out biological implications
How to detect variation (variant calling)?
Software e.g. GATK (human), FreeBayes (others)
Uses SAM formatting file
Number of reads at a location
Quality of reads
Certainty of alignment
-> probability
Challenges in variant calling
sequencing error rate (e.g. Illumina 99.9% accurate)
PCR duplications (amplification of an error) - based on location (usually)
Poor coverage
Polyploidy (differences due to different alleles, not functional (phenotype) difference)
Missing regions of reference genome
How the GATK software overcome variant calling issues
“golden standards” - sequencing sample with know variant, should be see these variant in this sample
How do variant callers work?
x number of reads out of y total are different \+ read quality \+ mapping probability \+ genotype calculation \+ standards information
Different approaches of variant callers depend on…
single individual or multiple indivduals
each variant locus independently or as a haplotype
variant locus independent approach to variant calling
variant is unrelated to everything else
haplotype approach to variant calling
looks for consistency in variant in haplotype
looks for links between variant (e.g. if change at x always a change at y)
how to choose variant calling software
species speciality
e.g. GATK best for humans
FreeBayes better for everything else
Filtering Variants
Make sure that certain that that variant is certain
Variant Quality Score (like read quality)
Coverage (min. req. for number of reads)
Fraction of reads as an alternate allele - which have diff base
Base quality of alternate allele
Tools for filtering variants
vcflib or vcftools NOT variant calling software itself
Interpreting filtered variants
Location in genome
- coding/non-coding (alter protein product?
- synonymous/non-synonymous
- what sort of seq. is it binding to - e.g. transcription factor binding (non-coding regions)/stop codon (coding region)
- type of impact (e.g. frameshift/INDEL…etc.)
contigs
Pieces of genome in genome assembly
scaffolds
pieces two contigs together using scaffolds (gap between two contigs)
Genome Assembly Output
FASTA formatted sequence
Challenges to Genome Assembly
Common sequences (repetitive - e.g. the word ‘the’ in a book)
Repetiive regions
Gene families/pseudogenes - multiple copies of genes
Sequencing errors
Uneven Coverage
Single end sequencing data
e.g. DNA fragment ~1000bp, first 300bp sequenced (Illumina limit)
Paired-end sequencing data
e.g. DNA fragment ~1000bp
first 300 bp sequences and last 300 bp sequenced, with gap for middle sequence`
Mate pair sequencing data
Similar to paired-end
Used for scaffolding
Can have larger middle gap
Up to 20kbp
Mate pair sequencing data
Similar to paired-end - know that two seq. (contigs) should be near each other
Used for scaffolding
Can have larger middle gap
Up to 20kbp
Long reads sequencing data
Using new tech. - e.g. PacBio/MinIon
Up to 2Mbp
Not as accurate
Initial assembly Illumina + long reads for scaffolding
Types of Assemblers
String Graph
de Bruijn Graph
String Graphs
theory for sequence assembly Look for overlaps in reads - set minimum overlap requirement (e.g. 3 base pairs) Add nodes and edges Remove redundancy -> graph
Concept of overlaps
take sequences and see how overlap with each other, based on whether identical
Concept of graphs
idea of nodes joined with edges
e.g. node = known sequence
edge = overlap in sequences (seem to be lines between sequences)
de Bruijn Graphs
Split sequence into kmers (string of shorter seq. of k length (e.g. 3 = 3bp))
Looks for overlap of kmers, sets minimum overlap of k-1 (e.g. 3-1 = 2).
atc-tcg-gtc…etc
de Bruijn Graphs and repeating regions
which was to read the graph
atg cat gta (two atg repeating seq.)
so the atg seq. could line up with same region on genome
How do assemblers use graphs?
Path that goes through each node of graph at least once, with minimal length
->rebuilds genome (contigs)
contigs come from when cannot join two regions
How to choose an assembler
types of work: single cell genomes/transcriptomics and metagenomics
Sequencing data = length of reads/Illumina (types single-end…etc.)
species - eukaryotic/prokaryotic
Long read data assemblers
e. g. Peregrine
e. g. Shasta
Examples of Assembler Software
e.g. SPAdes (bacterial genome)
A5 - sequencer-specific
ALLPATHS-LG - humans
Canu - long reads
Important of kmer length
amount of nodes and edges
smaller kmer = more nodes and edges
quality vs contiguity (length of contigs) of data
What is a kmer?
length of DNA that DNA sequence is split into for assembler graph - de Bruijn
e.g. 3 kmer = 3 bp sections
How to determine best kmer length?
Assembly quality Matrix statistics - number of contigs - length of assembly (close to length of expected genome - related species) - is number of genes what expected - accuracy of assembly
Assembly quality determination
Assembly quality
Matrix statistics
- number of contigs
- length of assembly (close to length of expected genome - related species)
- is number of genes what expected (marker genes)
- accuracy of assembly (coverage and contamination)
Consider heterozygosity (diploid vs haploid)
What is the N50?
point at which 50% of genome covered by contigs of x size or larger
e.g. 20 16 12 10 8 5 - N50 = 16
(higher contig value is better)
does not take into account missing regions
Presence of marker genes…
looks for orthologues in related species
shows that expected number of genes
BUSCO - relies on evolutionary data (prone to error)
Coverage and contamination…
based on CG content and coverage
GC content different between species (identifier) & different sequencing depth for diff. species
Assembly annotation
Promoters
Telomeres/centromeres
Levels of genome annotation
Look for start and stop codons - ORF
Compare start/stop location to database of another species - try and find orthologues (BLAST)
Look at transcritpomic data - this is transcribed