Genomes and Genome Sequencing Flashcards

Question 1

Q

Application of studying genomic

Answer

A

Research
Health (e.g. diagnostic)
Environment (e.g. pollutants)
Agriculture (e.g. livestock, nutrients)

Question 2

Q

Health example for genomics

Answer

A

Causes of severe intellectual disability in children (42% of cases linked to DNA compared to 12% using other methods)

Question 3

Q

Disease example for genomics

Answer

A

Inflammatory Bowel Disease (Crohn’s disease)
more viral DNA = more viruses
viruses were bacteriophages
they infected gut bacteria and affected gut bacteria population -> Crohn’s disease

Question 4

Q

Disease Outbreak Tracking for genomics (only need one)

Answer

A

Ebola - finding point of origin, watching it change over time

HIV - identified known origin, identified species crossovers

Influenza - track current outbreaks of influenza to inform vaccine choices for coming winter in opposite hemisphere/ identify crossover/ crossover potential for strains

Question 5

Q

The third generation of DNA sequences

Answer

A

Longer DNA sequences

Question 6

Q

Sanger Sequencing

Answer

A

Chain termination sequencing

Uses DDNTPs (fluorescently labelled nucleotides)

Question 7

Q

How does Sanger Sequencing work

Answer

A

polymerase rebuilds double helix using normal nucleotides, then randomly adds a fluorescently labelled base, polymerase stops and sequence cut at that point

-> strands of DNA of varying lengths, each ending with a fluorescently-labelled base
(* as many times req. so substitute each base in length)

Then run small pieces on capillary electrophoresis gel

Record fluorescence

Each base is a diff. colour

Question 8

Q

Downsides of Sanger Sequencing

Answer

A

Slow

Expensive

Not high throughput

Errors in repetitive regions (lots of bases similar to each other, next to each other)

Bias in sequencing (certain regions better amplified than others)

Question 9

Q

Library Preparation

Answer

A

Extract DNA from cells

Fragment DNA (50-1000bp)

Add adaptors (either end of seq.) one will stick to seq., other will be start point for seq. reaction

Amplification

Question 10

Q

Issues with Library Preparation

Answer

A

Bias in amplification

Question 11

Q

How does Illumina Sequencing work?

Answer

A

Fragements added to the flow cell - bind to flow cell (adapter-flow cell)

Polymerases starts at top (furthest from flow cell) and add in fluorescently labelled nucelotides (randomly, on at a time)
+laser excitation, fluorescence recorded

Question 12

Q

Benefits of Illumina Sequencing

Answer

A

Fast

Cheap

High throughput

Question 13

Q

Issues with Illumina Sequencing

Answer

A

Repetitive regions

Amplification

Length resistrictions

Question 14

Q

Third generation sequencing

Answer

A

prevent length resistriction

take out need to amplification

Question 15

Q

PacBio SMRT

Answer

A

uses Single Molecule, Real-time Technology

Zero-mode wave-guides

One piece of DNA per well

Polymerase in well adds fluorescence like Illumina to single piece of DNA

Question 16

Q

PacBio Considerations

Answer

A

Higher error rates
No need for amplification
Longer, but not genome-length

Question 17

Q

Oxford Nanopore Minion

Answer

A

Very small

Membrane with many pores

Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read

Question 18

Q

Oxford Nanopore MinION

Answer

A

Very small

Membrane with many pores

Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read

Question 19

Q

Oxford Nanopore MinION Consideration

Answer

A

Does not use fluorescently-labelled nucleotides

Not as accurate as Illumina (99.9%), but close (95%)

Long read (up to 2 million bp)

Question 20

Q

What is the Prometheon?

Answer

A

48 MinIONS

large amounts of sequencing

Question 21

Q

Single-Cell Sequencing

Answer

A

uses Illumina
BUT with diff. lib preparation - single-cell

Each cell in a ‘gem’ - when gel broken open all contents labelled with barcode for indv. gem

Can say where DNA comes from -> cell types/spatial transcriptomics

Question 22

Q

Challenges to genome projects

Answer

A

Sequencing technologies not perfect (e.g. Illumina 99.9% not 100%)

Some DNA harder to seq. than others (e.g. centromere/telomere) - secondary structures

Population representation (variation)

Gaps. errors, lack of variation

Accuracy of assemblage

Genomes keep being corrected (diff. versions from same individual)

Question 23

Q

Alignments

Answer

A

Reference genome available

Compare and align

Question 24

Q

Assembly

Answer

A

Does not have an available reference genome

Assemble reads into a reference genome

Is a BEST REPRESENTATION not exact

Question 25

Q

Steps in an alignment

Answer

A

Find an approrpriate reference genome - diff. versions

Find fragment matches on reference genome

Question 26

Q

Steps of the alignment analysis

Answer

A

Base calling
Quality control
Alignment/Mapping
Alignment Post-Processing

Question 27

Q

Base calling

Answer

A

process of determining bases in the sequencing data

Question 28

Q

Quality control

Answer

A

Phred score

Q value

Question 29

Q

Mapping vs. Alignment

Answer

A

Mapping = position of the sequence on the reference genome

Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

Question 30

Q

Alignment

Answer

A

position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

Question 31

Q

Options for Alignment Post-Processing

Answer

A

Variant calling

Methylation studies

RNA seq. expression

Structural variants

Question 32

Q

Mapping vs. Alignment

Answer

A

Mapping = position of the sequence on the reference genome

Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

Question 33

Q

Mapping

Answer

A

position of the sequence on the reference genome

Question 34

Q

Ways to align fragment sequence to a reference genome

Answer

A

Brute Force Method - by eye, move along reference a base pair at a time until matches

Alignment Software

Question 35

Q

What is the “Brute Force” method?

Answer

A

by eye, move along reference a base pair at a time until matches

Question 36

Q

Considerations with the “Brute Force” method

Answer

A

Easy to do
Very slow
Requires a lot of repetitive computations - inefficient

Question 37

Q

Alignment Software Types

Answer

A

RNA/DNA/bisulpide sequencing

Question 38

Q

Alignment Software Algorithms

Answer

A

Burrows Wheeler Transform

Suffix Arrays

Question 39

Q

Considerations with Alignment Software

Answer

A

Works as replacement for BLAST (BLAST-like methods do not scale well)

Trade-off between speed and accuracy
(quicker software may be less accurate)

Some newer tools use kmers (only mapping data)

Question 40

Q

Considerations with Alignment Software

Answer

A

Works as replacement for BLAST (BLAST-like methods do not scale well)

Trade-off between speed and accuracy
(quicker software may be less accurate)

Some newer tools use kmers (only mapping data)

Question 41

Q

How you make a Suffix Array?

Answer

A

all seq. end with a dollar
lining up positional order & number

then line up lexicographically (alphabetically with $ first)

Then take the positional information in lexicographical order of the new list

Question 42

Q

Alignment to a suffix array

Answer

A

See whether substring (fragment) matches the middle point (higher or lower lexicographically than the list)

If not, cut in half, discount second half.

Repeat until found location (matches at that point)

Question 43

Q

How you make a Burrows Wheeler Transform?

Answer

A

Uses rotations
Uses $ symbol

all seq. end with a dollar
lining up positional order & number

then line up lexicographically (alphabetically with $ first)

DOES NOT store positional information

Stores last column (last character in each line)

Question 44

Q

Considerations with Burrows Wheeler Transform

Answer

A

More efficient - binary storage (FM index)

Compressed further

Uses last-first principle

Makes substring search quicker (too complex to explain)

Question 45

Q

SAM format

Answer

A

Sequence Alignment/Map Format

tab deliminated file (columns)

Information about mapping of the read

Question 46

Q

Difficulties during alignments

Answer

A

Exact VS Inexact matching

Multi mapping sequences

Question 47

Q

Exact vs Inexact matches

Answer

A

Will be comparing for difference/ checking that they are there (allow for mismatch of X% - set limit)

versus

certainty that read from that location

[Software will have default value - but changeable]

Question 48

Q

Multi mapping sequences

Answer

A

Regions of ref. genome will be identical in more than one place

repetitive regions
gene families (have similar sequences)

Question 49

Q

Alignment visualisation steps

Answer

A

software IGV

reference genome along bottom, reference genomes aligned above, with base differences highlighted

software Tablet

reference at top, shows all bases, highlight differences

Question 50

Q

depth/coverage (alignments)

Answer

A

amount of reads aligned to that region

Question 51

Q

biological regions for sequence alignments

Answer

A

differential gene expression

studying the regulome

Question 52

Q

Differential gene expression using alignments

Answer

A

Amount of alignments aligned to that region = level of expression

Question 53

Q

Studying the regulome

Answer

A

regulatory regions in the genome

ChIP Seq Chromatin Immunoprecipitation - studying sequence where proteins bound (e.g. transcription factor)

BIS Seq - studying methylation

Question 54

Q

ChIP Seq

Answer

A

Chromatin Immunoprecipitation
looks at regions bounds by proteins (e.g. transcription factors)

Fix protein to DNA

Use antibody to pull those bits on DNA out

Unfix DNA

Sequence those bits of DNA

Question 55

Q

BIS Seq

Answer

A

methylation of base pairs
treat with bisulphide

replaces non-methylated Cs to a U

sequence and compare to ref. genome

any bases where see a T (DNA U), is unmethylated

Question 56

Q

Variant Calling

Answer

A

detecting single nucleotide polymorphisms (SNP) or insertions/deletions compared to reference genome

work out biological implications

Question 57

Q

How to detect variation (variant calling)?

Answer

A

Software e.g. GATK (human), FreeBayes (others)

Uses SAM formatting file

Number of reads at a location
Quality of reads
Certainty of alignment
-> probability

Question 58

Q

Challenges in variant calling

Answer

A

sequencing error rate (e.g. Illumina 99.9% accurate)

PCR duplications (amplification of an error) - based on location (usually)

Poor coverage

Polyploidy (differences due to different alleles, not functional (phenotype) difference)

Missing regions of reference genome

Question 59

Q

How the GATK software overcome variant calling issues

Answer

A

“golden standards” - sequencing sample with know variant, should be see these variant in this sample

Question 60

Q

How do variant callers work?

Answer

A

x number of reads out of y total are different 
\+ read quality
\+ mapping probability
\+ genotype calculation
\+ standards information

Question 61

Q

Different approaches of variant callers depend on…

Answer

A

single individual or multiple indivduals

each variant locus independently or as a haplotype

Question 62

Q

variant locus independent approach to variant calling

Answer

A

variant is unrelated to everything else

Question 63

Q

haplotype approach to variant calling

Answer

A

looks for consistency in variant in haplotype

looks for links between variant (e.g. if change at x always a change at y)

Question 64

Q

how to choose variant calling software

Answer

A

species speciality
e.g. GATK best for humans
FreeBayes better for everything else

Answer 65

A

Make sure that certain that that variant is certain

Variant Quality Score (like read quality)
Coverage (min. req. for number of reads)
Fraction of reads as an alternate allele - which have diff base
Base quality of alternate allele

Answer 66

A

vcflib or vcftools NOT variant calling software itself

Answer 67

A

Location in genome

coding/non-coding (alter protein product?
synonymous/non-synonymous
what sort of seq. is it binding to - e.g. transcription factor binding (non-coding regions)/stop codon (coding region)
type of impact (e.g. frameshift/INDEL…etc.)

Answer 68

A

Pieces of genome in genome assembly

Answer 69

A

pieces two contigs together using scaffolds (gap between two contigs)

Answer 70

A

FASTA formatted sequence

Answer 71

A

Common sequences (repetitive - e.g. the word ‘the’ in a book)

Repetiive regions

Gene families/pseudogenes - multiple copies of genes

Sequencing errors

Uneven Coverage

Answer 72

A

e.g. DNA fragment ~1000bp, first 300bp sequenced (Illumina limit)

Answer 73

A

e.g. DNA fragment ~1000bp

first 300 bp sequences and last 300 bp sequenced, with gap for middle sequence`

Answer 74

A

Similar to paired-end
Used for scaffolding
Can have larger middle gap
Up to 20kbp

Answer 75

A

Similar to paired-end - know that two seq. (contigs) should be near each other
Used for scaffolding
Can have larger middle gap
Up to 20kbp

Answer 76

A

Using new tech. - e.g. PacBio/MinIon
Up to 2Mbp
Not as accurate
Initial assembly Illumina + long reads for scaffolding

Answer 77

A

String Graph

de Bruijn Graph

Answer 78

A

theory for sequence assembly 
Look for overlaps in reads - set minimum overlap requirement (e.g. 3 base pairs)
Add nodes and edges
Remove redundancy 
-> graph

Answer 79

A

take sequences and see how overlap with each other, based on whether identical

Answer 80

A

idea of nodes joined with edges

e.g. node = known sequence
edge = overlap in sequences (seem to be lines between sequences)

Answer 81

A

Split sequence into kmers (string of shorter seq. of k length (e.g. 3 = 3bp))

Looks for overlap of kmers, sets minimum overlap of k-1 (e.g. 3-1 = 2).

atc-tcg-gtc…etc

Answer 82

A

which was to read the graph
atg cat gta (two atg repeating seq.)

so the atg seq. could line up with same region on genome

Answer 83

A

Path that goes through each node of graph at least once, with minimal length

->rebuilds genome (contigs)

contigs come from when cannot join two regions

Answer 84

A

types of work: single cell genomes/transcriptomics and metagenomics

Sequencing data = length of reads/Illumina (types single-end…etc.)

species - eukaryotic/prokaryotic

Answer 85

A

e. g. Peregrine

e. g. Shasta

Answer 86

A

e.g. SPAdes (bacterial genome)
A5 - sequencer-specific
ALLPATHS-LG - humans
Canu - long reads

Answer 87

A

amount of nodes and edges
smaller kmer = more nodes and edges

quality vs contiguity (length of contigs) of data

Answer 88

A

length of DNA that DNA sequence is split into for assembler graph - de Bruijn

e.g. 3 kmer = 3 bp sections

Answer 89

A

Assembly quality
Matrix statistics 
- number of contigs
- length of assembly (close to length of expected genome - related species)
- is number of genes what expected
- accuracy of assembly

Answer 90

A

Assembly quality
Matrix statistics
- number of contigs
- length of assembly (close to length of expected genome - related species)
- is number of genes what expected (marker genes)
- accuracy of assembly (coverage and contamination)
Consider heterozygosity (diploid vs haploid)

Answer 91

A

point at which 50% of genome covered by contigs of x size or larger

e.g. 20 16 12 10 8 5 - N50 = 16

(higher contig value is better)
does not take into account missing regions

Answer 92

A

looks for orthologues in related species
shows that expected number of genes

BUSCO - relies on evolutionary data (prone to error)

Answer 93

A

based on CG content and coverage

GC content different between species (identifier) & different sequencing depth for diff. species

Answer 94

A

Promoters

Telomeres/centromeres

Answer 95

A

Look for start and stop codons - ORF

Compare start/stop location to database of another species - try and find orthologues (BLAST)

Look at transcritpomic data - this is transcribed