Sequencing Flashcards
Genome Sequencing Methodology
5-10 times the number of anonymous participants as needed provided DNA samples
Taken from local sites. DNA extracted from blood
Sequenced from composite of genomes of fraction of participants, known by nobody
BACS libraries
Bacterial Artificial Chromosomes
Sorted chromosomes from which DNA is isolated
Restriction Enzymes cut specific palindromic sequences
Restriction enzymes cut isolates DNA into multiple fragments
Creation of BACS libraries
DNA fragments inserted into circular DNA and included into bacteria (BACS)
Single sequences called CONTIGS
BACS clones
Dilute solution of bacteria can be cultured on agar plate and the colonies produced are clones
Single colony contains clones of DNA sequence
Clones then used for sequencing
BACS automation
Automated massively parallel creation of BACS
Copied DNA isolated and sequenced
Computational tools applied to obtain the physical map
Production of physical map
Select clones for sequencing (overlapping)
Sequence to at least draft coverage
Merge data
Order and orient with mRNA, paired end reads and other data
Genetic mapping
Produced using a physical map by assessing the location of the genes.
Genes on same chromosome are ‘linked’.
More recently. Position of genes is determined by the exact frequency of recombination has occurred.
FISH mapping
Fluorescence in situ hybridization
Attach fluorescent labels to DNA sequences
Process chromosomes on glass so location of specific genes within the chromosome can be identified
Sequencing developments
Can do 20kb with 99.5% accuracy
Can sequence mRNA directly
Only suitable for a single strand of DNA
Current sequencing methods
PacBio HiFi - Mid length, Mid accuracy
Illumina - Low length, High accuracy
Oxford Nanopore - High length, Low accuracy
Not available during Human Genome Mapping Project
PacBio Hifi
Polymerase enzyme, nano-sized hole
Single strand of DNA introduced
Fluorescent nucelotides emit light as they are ‘stitched’ into the complementary double strand
Colour of light emmission provides accurate sequencing
Illumina Sequencing
Individual pieces of DNA attached to glass surface
Sequencing by synthesis
As complementary nucleic acid attached, fluorescence produced
Oxford Nanopore
Double strand of DNA unzipped
Single strand inserted into protein nanopore
Electric current created by flow of ions which is a function of the nucleic acid base
Current as a function of time provides sequence information
Linkage distance
Distance in bp between genes on the same chromosome
Smaller linkage distance = more likely to be inherited together
Make up of Human Genome
Only 2% contains exons
26% introns
Only recently been able to understand role of other sequence information (lots of repetitive sequences)
Sequence reassembly - Reducing computational efforts
Sequencing a large array of overlapping short fragments (contigs) created from the BACS
Short sequences are called reads
Gel electrophoresis
Comparing size of fragments/contigs
Fragments migrate in an applied electric field
Shortest move the fastest
Digital Trees/Trie
Multiway tree often used for storing large sets of words
Trees with a possible branch for every letter of an alphabet
Words end with $
Trie usage
Implementation of sets
Quicker insertion, deletion and find
Quicker than binary trees and hash tables
Spell checkers, completion algorithms, longest-prefix matching, hyphenation
Search finds longest match between words in set and query
Sequence analysis - Tries
Can store DNA/proteins
Finding next fitting section in DNA reconstruction
Useful for finding errors, only need to search a small sub-tree
DNA, 4 way tree meaning your tree is deep but doesn’t waste so much memory
Searching for particular sequence motifs
FInding protein coding genes
Ab initio Computer approaches Finding common sequences (start and end of protein coding genes) Promoter regions - protein binding Start codons Stop codons
Regulatory Region
Promoter - TATA box - Start of 5’ UTR
Transcription and Splicing
Removal of introns in transcribed regions
Results in mRNA
Regulatory Region Function
In this sequence, RNA polymerase will bind to initiate the transcription of the cDNA into RNA
Promoter Sequence
Firsts binds the RNA polymerase
upstream / 5’ end of the transcription initiation site
100-1000 base pairs long
High occurrence of AA,AT,TA and TT dinucleotides (also A+T trinucleotides)
Over representation of GC,GG,CG,AG,GA,TG downstream of promoter
TATA box
30% of human genes
Contains sequence TATAWAW
W = A or T
Benefits of sequencing the mRNA
Start codons, stop codons and exon sequences can be looked for in both the chromosomal DNA and the mRNA
Can find them with tries
Subsequent codons in mRNA are in groups of three for coding amino acids in sequence
Start codon unique
Memory issues with tries/Time issues with tries
Can use a regular trie for a suffix tree, would typically use far too much memory to be useful
Use of pointers to the original text
Can build a suffix tree using O(n) memory where n is the length of the text
Also linear time O(n) algorithm for trie construction (non-trivial)
When to use suffix trees
Efficient when it is likely that you will need to do multiple searches
Exact word matching
Use with dynamic programming for inexact matching (match with smallest edit distance)
Bioinformatics, Advanced ML
Suffix trees with genome sequences
Suffix trees are valuable given the number of repeats present in the genome sequences
With more unique reads in the genome, becomes less efficient
Genome Homology
Genomes of human are 99.9% homologous
Variants Removal of Negative Mutations
100s of new mutations in offspring for each generation
Most mutations neutral in phenotypical effect or removed by negative selection
Many mutations corrected by repair enzyme machinery of the cell
Variants - Mutations causing an advantage
Occasionally mutations create an advantage w.r.t survival or reproduction advantage to offspring (positive selection)
Mutations occurring in the genome
Mutations don’t occur randomly.
Occur in particular regions in the genome known as hotspots
Variant definition
Permenant change in the DNA sequence which makes up a gene
Variant as opposed to gene mutation
Such changes do not always cause disease and can be present in non-coding regions
Allele
Variation of a given gene at the same position (locus) on the chromosome
Can also be present in non-coding regions
Typically multiple alleles at locus between different individuals in population
Polymorphism
Allelic variation determined as the number of alleles present
Phenotypic traits
Derived from the transmission of genes and alleles to an organism’s offspring
SNP
Single nucleotide polymorphism
Most common variation in human genomic DNA
Single nucleotide differs between members of the population/chromosome pairs
4-5 million in each person’s genome
Other genomic polymorphisms
Deletions and insertions
Chromosome synteny
Used to define genes which lie on the same chromosome
More recently term used for the conservation of blocks of order within two compared chromosomes
Repetitive Sequences
aka repetitive elements, repeating units, repeats
Make up approximately 50% of the human genome
Dispersed repeats
Recognized as potential source of genetic variation and regulation
Tandem repeat sequences (trinucleotide repeats)
Important in several human diseases
Implication of repeats within exon region causes protein misfolding when present in high numbers (>40 copies for huntington’s disease)
CpG islands
Sequences containing repeats of CG closer to the 5’ end of the gene sequence (promoter)
At least 200bp long
% c+g >50%
Observed/expected frequency >0.6
Expected frequency of CpG islands
Human genome has 42% GC content
Expected frequency of a CpG = 0.21 ** 2
Actual frequency is 1%
Location of alleles or genes in chromosomes
Defined by bands (historically created by G-stain)
BCRA2
Breast/Prostate cancer
One BRCA1 and BRCA2 are sequenced from blood samples
Can use suffix trees to detect which of the stable mutations are present
Short specific sequence motifs (mutations) within the flanking base pairs can be mined