Sequencing Flashcards

Question 1

Q

Genome Sequencing Methodology

Answer

A

5-10 times the number of anonymous participants as needed provided DNA samples
Taken from local sites. DNA extracted from blood
Sequenced from composite of genomes of fraction of participants, known by nobody

Question 2

Q

BACS libraries

Answer

A

Bacterial Artificial Chromosomes
Sorted chromosomes from which DNA is isolated
Restriction Enzymes cut specific palindromic sequences
Restriction enzymes cut isolates DNA into multiple fragments

Question 3

Q

Creation of BACS libraries

Answer

A

DNA fragments inserted into circular DNA and included into bacteria (BACS)
Single sequences called CONTIGS

Question 4

Q

BACS clones

Answer

A

Dilute solution of bacteria can be cultured on agar plate and the colonies produced are clones
Single colony contains clones of DNA sequence
Clones then used for sequencing

Question 5

Q

BACS automation

Answer

A

Automated massively parallel creation of BACS
Copied DNA isolated and sequenced
Computational tools applied to obtain the physical map

Question 6

Q

Production of physical map

Answer

A

Select clones for sequencing (overlapping)
Sequence to at least draft coverage
Merge data
Order and orient with mRNA, paired end reads and other data

Question 7

Q

Genetic mapping

Answer

A

Produced using a physical map by assessing the location of the genes.
Genes on same chromosome are ‘linked’.
More recently. Position of genes is determined by the exact frequency of recombination has occurred.

Question 8

Q

FISH mapping

Answer

A

Fluorescence in situ hybridization
Attach fluorescent labels to DNA sequences
Process chromosomes on glass so location of specific genes within the chromosome can be identified

Question 9

Q

Sequencing developments

Answer

A

Can do 20kb with 99.5% accuracy
Can sequence mRNA directly
Only suitable for a single strand of DNA

Question 10

Q

Current sequencing methods

Answer

A

PacBio HiFi - Mid length, Mid accuracy
Illumina - Low length, High accuracy
Oxford Nanopore - High length, Low accuracy

Not available during Human Genome Mapping Project

Question 11

Q

PacBio Hifi

Answer

A

Polymerase enzyme, nano-sized hole
Single strand of DNA introduced
Fluorescent nucelotides emit light as they are ‘stitched’ into the complementary double strand
Colour of light emmission provides accurate sequencing

Question 12

Q

Illumina Sequencing

Answer

A

Individual pieces of DNA attached to glass surface
Sequencing by synthesis
As complementary nucleic acid attached, fluorescence produced

Question 13

Q

Oxford Nanopore

Answer

A

Double strand of DNA unzipped
Single strand inserted into protein nanopore
Electric current created by flow of ions which is a function of the nucleic acid base
Current as a function of time provides sequence information

Question 14

Q

Linkage distance

Answer

A

Distance in bp between genes on the same chromosome

Smaller linkage distance = more likely to be inherited together

Question 15

Q

Make up of Human Genome

Answer

A

Only 2% contains exons
26% introns
Only recently been able to understand role of other sequence information (lots of repetitive sequences)

Question 16

Q

Sequence reassembly - Reducing computational efforts

Answer

A

Sequencing a large array of overlapping short fragments (contigs) created from the BACS
Short sequences are called reads

Question 17

Q

Gel electrophoresis

Answer

A

Comparing size of fragments/contigs
Fragments migrate in an applied electric field
Shortest move the fastest

Question 18

Q

Digital Trees/Trie

Answer

A

Multiway tree often used for storing large sets of words
Trees with a possible branch for every letter of an alphabet
Words end with $

Question 19

Q

Trie usage

Answer

A

Implementation of sets
Quicker insertion, deletion and find
Quicker than binary trees and hash tables
Spell checkers, completion algorithms, longest-prefix matching, hyphenation
Search finds longest match between words in set and query

Question 20

Q

Sequence analysis - Tries

Answer

A

Can store DNA/proteins
Finding next fitting section in DNA reconstruction
Useful for finding errors, only need to search a small sub-tree
DNA, 4 way tree meaning your tree is deep but doesn’t waste so much memory
Searching for particular sequence motifs

Question 21

Q

FInding protein coding genes

Answer

A

Ab initio
Computer approaches
Finding common sequences (start and end of protein coding genes)
Promoter regions - protein binding
Start codons
Stop codons

Question 22

Q

Regulatory Region

Answer

A

Promoter - TATA box - Start of 5’ UTR

Question 23

Q

Transcription and Splicing

Answer

A

Removal of introns in transcribed regions

Results in mRNA

Question 24

Q

Regulatory Region Function

Answer

A

In this sequence, RNA polymerase will bind to initiate the transcription of the cDNA into RNA

Question 25

Q

Promoter Sequence

Answer

A

Firsts binds the RNA polymerase
upstream / 5’ end of the transcription initiation site
100-1000 base pairs long
High occurrence of AA,AT,TA and TT dinucleotides (also A+T trinucleotides)

Over representation of GC,GG,CG,AG,GA,TG downstream of promoter

Question 26

Q

TATA box

Answer

A

30% of human genes
Contains sequence TATAWAW
W = A or T

Question 27

Q

Benefits of sequencing the mRNA

Answer

A

Start codons, stop codons and exon sequences can be looked for in both the chromosomal DNA and the mRNA
Can find them with tries
Subsequent codons in mRNA are in groups of three for coding amino acids in sequence
Start codon unique

Question 28

Q

Memory issues with tries/Time issues with tries

Answer

A

Can use a regular trie for a suffix tree, would typically use far too much memory to be useful
Use of pointers to the original text
Can build a suffix tree using O(n) memory where n is the length of the text
Also linear time O(n) algorithm for trie construction (non-trivial)

Question 29

Q

When to use suffix trees

Answer

A

Efficient when it is likely that you will need to do multiple searches
Exact word matching
Use with dynamic programming for inexact matching (match with smallest edit distance)
Bioinformatics, Advanced ML

Question 30

Q

Suffix trees with genome sequences

Answer

A

Suffix trees are valuable given the number of repeats present in the genome sequences
With more unique reads in the genome, becomes less efficient

Question 31

Q

Genome Homology

Answer

A

Genomes of human are 99.9% homologous

Question 32

Q

Variants Removal of Negative Mutations

Answer

A

100s of new mutations in offspring for each generation
Most mutations neutral in phenotypical effect or removed by negative selection
Many mutations corrected by repair enzyme machinery of the cell

Question 33

Q

Variants - Mutations causing an advantage

Answer

A

Occasionally mutations create an advantage w.r.t survival or reproduction advantage to offspring (positive selection)

Question 34

Q

Mutations occurring in the genome

Answer

A

Mutations don’t occur randomly.

Occur in particular regions in the genome known as hotspots

Question 35

Q

Variant definition

Answer

A

Permenant change in the DNA sequence which makes up a gene

Question 36

Q

Variant as opposed to gene mutation

Answer

A

Such changes do not always cause disease and can be present in non-coding regions

Question 37

Q

Allele

Answer

A

Variation of a given gene at the same position (locus) on the chromosome
Can also be present in non-coding regions
Typically multiple alleles at locus between different individuals in population

Question 38

Q

Polymorphism

Answer

A

Allelic variation determined as the number of alleles present

Question 39

Q

Phenotypic traits

Answer

A

Derived from the transmission of genes and alleles to an organism’s offspring

Question 40

Q

SNP

Answer

A

Single nucleotide polymorphism
Most common variation in human genomic DNA
Single nucleotide differs between members of the population/chromosome pairs
4-5 million in each person’s genome

Question 41

Q

Other genomic polymorphisms

Answer

A

Deletions and insertions

Question 42

Q

Chromosome synteny

Answer

A

Used to define genes which lie on the same chromosome

More recently term used for the conservation of blocks of order within two compared chromosomes

Question 43

Q

Repetitive Sequences

Answer

A

aka repetitive elements, repeating units, repeats

Make up approximately 50% of the human genome

Question 44

Q

Dispersed repeats

Answer

A

Recognized as potential source of genetic variation and regulation

Question 45

Q

Tandem repeat sequences (trinucleotide repeats)

Answer

A

Important in several human diseases
Implication of repeats within exon region causes protein misfolding when present in high numbers (>40 copies for huntington’s disease)

Question 46

Q

CpG islands

Answer

A

Sequences containing repeats of CG closer to the 5’ end of the gene sequence (promoter)
At least 200bp long
% c+g >50%
Observed/expected frequency >0.6

Question 47

Q

Expected frequency of CpG islands

Answer

A

Human genome has 42% GC content
Expected frequency of a CpG = 0.21 ** 2
Actual frequency is 1%

Question 48

Q

Location of alleles or genes in chromosomes

Answer

A

Defined by bands (historically created by G-stain)

Question 49

Q

BCRA2

Answer

A

Breast/Prostate cancer
One BRCA1 and BRCA2 are sequenced from blood samples
Can use suffix trees to detect which of the stable mutations are present
Short specific sequence motifs (mutations) within the flanking base pairs can be mined