Genomes and Genome Sequencing Flashcards

1
Q

Application of studying genomic

A

Research
Health (e.g. diagnostic)
Environment (e.g. pollutants)
Agriculture (e.g. livestock, nutrients)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Health example for genomics

A

Causes of severe intellectual disability in children (42% of cases linked to DNA compared to 12% using other methods)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Disease example for genomics

A

Inflammatory Bowel Disease (Crohn’s disease)
more viral DNA = more viruses
viruses were bacteriophages
they infected gut bacteria and affected gut bacteria population -> Crohn’s disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Disease Outbreak Tracking for genomics (only need one)

A

Ebola - finding point of origin, watching it change over time

HIV - identified known origin, identified species crossovers

Influenza - track current outbreaks of influenza to inform vaccine choices for coming winter in opposite hemisphere/ identify crossover/ crossover potential for strains

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The third generation of DNA sequences

A

Longer DNA sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sanger Sequencing

A

Chain termination sequencing

Uses DDNTPs (fluorescently labelled nucleotides)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does Sanger Sequencing work

A

polymerase rebuilds double helix using normal nucleotides, then randomly adds a fluorescently labelled base, polymerase stops and sequence cut at that point

-> strands of DNA of varying lengths, each ending with a fluorescently-labelled base
(* as many times req. so substitute each base in length)

Then run small pieces on capillary electrophoresis gel

Record fluorescence

Each base is a diff. colour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Downsides of Sanger Sequencing

A

Slow

Expensive

Not high throughput

Errors in repetitive regions (lots of bases similar to each other, next to each other)

Bias in sequencing (certain regions better amplified than others)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Library Preparation

A

Extract DNA from cells

Fragment DNA (50-1000bp)

Add adaptors (either end of seq.) one will stick to seq., other will be start point for seq. reaction

Amplification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Issues with Library Preparation

A

Bias in amplification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does Illumina Sequencing work?

A

Fragements added to the flow cell - bind to flow cell (adapter-flow cell)

Polymerases starts at top (furthest from flow cell) and add in fluorescently labelled nucelotides (randomly, on at a time)
+laser excitation, fluorescence recorded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Benefits of Illumina Sequencing

A

Fast

Cheap

High throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Issues with Illumina Sequencing

A

Repetitive regions

Amplification

Length resistrictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Third generation sequencing

A

prevent length resistriction

take out need to amplification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PacBio SMRT

A

uses Single Molecule, Real-time Technology

Zero-mode wave-guides

One piece of DNA per well

Polymerase in well adds fluorescence like Illumina to single piece of DNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PacBio Considerations

A

Higher error rates
No need for amplification
Longer, but not genome-length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Oxford Nanopore Minion

A

Very small

Membrane with many pores

Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Oxford Nanopore MinION

A

Very small

Membrane with many pores

Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Oxford Nanopore MinION Consideration

A

Does not use fluorescently-labelled nucleotides

Not as accurate as Illumina (99.9%), but close (95%)

Long read (up to 2 million bp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the Prometheon?

A

48 MinIONS

large amounts of sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Single-Cell Sequencing

A

uses Illumina
BUT with diff. lib preparation - single-cell

Each cell in a ‘gem’ - when gel broken open all contents labelled with barcode for indv. gem

Can say where DNA comes from -> cell types/spatial transcriptomics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Challenges to genome projects

A

Sequencing technologies not perfect (e.g. Illumina 99.9% not 100%)

Some DNA harder to seq. than others (e.g. centromere/telomere) - secondary structures

Population representation (variation)

Gaps. errors, lack of variation

Accuracy of assemblage

Genomes keep being corrected (diff. versions from same individual)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Alignments

A

Reference genome available

Compare and align

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Assembly

A

Does not have an available reference genome

Assemble reads into a reference genome

Is a BEST REPRESENTATION not exact

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Steps in an alignment

A

Find an approrpriate reference genome - diff. versions

Find fragment matches on reference genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Steps of the alignment analysis

A

Base calling
Quality control
Alignment/Mapping
Alignment Post-Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Base calling

A

process of determining bases in the sequencing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Quality control

A

Phred score

Q value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Mapping vs. Alignment

A

Mapping = position of the sequence on the reference genome

Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Alignment

A

position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Options for Alignment Post-Processing

A

Variant calling

Methylation studies

RNA seq. expression

Structural variants

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Mapping vs. Alignment

A

Mapping = position of the sequence on the reference genome

Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Mapping

A

position of the sequence on the reference genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Ways to align fragment sequence to a reference genome

A

Brute Force Method - by eye, move along reference a base pair at a time until matches

Alignment Software

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the “Brute Force” method?

A

by eye, move along reference a base pair at a time until matches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Considerations with the “Brute Force” method

A

Easy to do
Very slow
Requires a lot of repetitive computations - inefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Alignment Software Types

A

RNA/DNA/bisulpide sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Alignment Software Algorithms

A

Burrows Wheeler Transform

Suffix Arrays

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Considerations with Alignment Software

A

Works as replacement for BLAST (BLAST-like methods do not scale well)

Trade-off between speed and accuracy
(quicker software may be less accurate)

Some newer tools use kmers (only mapping data)

40
Q

Considerations with Alignment Software

A

Works as replacement for BLAST (BLAST-like methods do not scale well)

Trade-off between speed and accuracy
(quicker software may be less accurate)

Some newer tools use kmers (only mapping data)

41
Q

How you make a Suffix Array?

A

all seq. end with a dollar
lining up positional order & number

then line up lexicographically (alphabetically with $ first)

Then take the positional information in lexicographical order of the new list

42
Q

Alignment to a suffix array

A

See whether substring (fragment) matches the middle point (higher or lower lexicographically than the list)

If not, cut in half, discount second half.

Repeat until found location (matches at that point)

43
Q

How you make a Burrows Wheeler Transform?

A

Uses rotations
Uses $ symbol

all seq. end with a dollar
lining up positional order & number

then line up lexicographically (alphabetically with $ first)

DOES NOT store positional information

Stores last column (last character in each line)

44
Q

Considerations with Burrows Wheeler Transform

A

More efficient - binary storage (FM index)

Compressed further

Uses last-first principle

Makes substring search quicker (too complex to explain)

45
Q

SAM format

A

Sequence Alignment/Map Format

tab deliminated file (columns)

Information about mapping of the read

46
Q

Difficulties during alignments

A

Exact VS Inexact matching

Multi mapping sequences

47
Q

Exact vs Inexact matches

A

Will be comparing for difference/ checking that they are there (allow for mismatch of X% - set limit)

versus

certainty that read from that location

[Software will have default value - but changeable]

48
Q

Multi mapping sequences

A

Regions of ref. genome will be identical in more than one place

  • repetitive regions
  • gene families (have similar sequences)
49
Q

Alignment visualisation steps

A

software IGV

reference genome along bottom, reference genomes aligned above, with base differences highlighted

software Tablet

reference at top, shows all bases, highlight differences

50
Q

depth/coverage (alignments)

A

amount of reads aligned to that region

51
Q

biological regions for sequence alignments

A

differential gene expression

studying the regulome

52
Q

Differential gene expression using alignments

A

Amount of alignments aligned to that region = level of expression

53
Q

Studying the regulome

A

regulatory regions in the genome

ChIP Seq Chromatin Immunoprecipitation - studying sequence where proteins bound (e.g. transcription factor)

BIS Seq - studying methylation

54
Q

ChIP Seq

A

Chromatin Immunoprecipitation
looks at regions bounds by proteins (e.g. transcription factors)

Fix protein to DNA

Use antibody to pull those bits on DNA out

Unfix DNA

Sequence those bits of DNA

55
Q

BIS Seq

A

methylation of base pairs
treat with bisulphide

replaces non-methylated Cs to a U

sequence and compare to ref. genome

any bases where see a T (DNA U), is unmethylated

56
Q

Variant Calling

A

detecting single nucleotide polymorphisms (SNP) or insertions/deletions compared to reference genome

work out biological implications

57
Q

How to detect variation (variant calling)?

A

Software e.g. GATK (human), FreeBayes (others)

Uses SAM formatting file

Number of reads at a location
Quality of reads
Certainty of alignment
-> probability

58
Q

Challenges in variant calling

A

sequencing error rate (e.g. Illumina 99.9% accurate)

PCR duplications (amplification of an error) - based on location (usually)

Poor coverage

Polyploidy (differences due to different alleles, not functional (phenotype) difference)

Missing regions of reference genome

59
Q

How the GATK software overcome variant calling issues

A

“golden standards” - sequencing sample with know variant, should be see these variant in this sample

60
Q

How do variant callers work?

A
x number of reads out of y total are different 
\+ read quality
\+ mapping probability
\+ genotype calculation
\+ standards information
61
Q

Different approaches of variant callers depend on…

A

single individual or multiple indivduals

each variant locus independently or as a haplotype

62
Q

variant locus independent approach to variant calling

A

variant is unrelated to everything else

63
Q

haplotype approach to variant calling

A

looks for consistency in variant in haplotype

looks for links between variant (e.g. if change at x always a change at y)

64
Q

how to choose variant calling software

A

species speciality
e.g. GATK best for humans
FreeBayes better for everything else

65
Q

Filtering Variants

A

Make sure that certain that that variant is certain

Variant Quality Score (like read quality)
Coverage (min. req. for number of reads)
Fraction of reads as an alternate allele - which have diff base
Base quality of alternate allele

66
Q

Tools for filtering variants

A

vcflib or vcftools NOT variant calling software itself

67
Q

Interpreting filtered variants

A

Location in genome

  • coding/non-coding (alter protein product?
  • synonymous/non-synonymous
  • what sort of seq. is it binding to - e.g. transcription factor binding (non-coding regions)/stop codon (coding region)
  • type of impact (e.g. frameshift/INDEL…etc.)
68
Q

contigs

A

Pieces of genome in genome assembly

69
Q

scaffolds

A

pieces two contigs together using scaffolds (gap between two contigs)

70
Q

Genome Assembly Output

A

FASTA formatted sequence

71
Q

Challenges to Genome Assembly

A

Common sequences (repetitive - e.g. the word ‘the’ in a book)

Repetiive regions

Gene families/pseudogenes - multiple copies of genes

Sequencing errors

Uneven Coverage

72
Q

Single end sequencing data

A

e.g. DNA fragment ~1000bp, first 300bp sequenced (Illumina limit)

73
Q

Paired-end sequencing data

A

e.g. DNA fragment ~1000bp

first 300 bp sequences and last 300 bp sequenced, with gap for middle sequence`

74
Q

Mate pair sequencing data

A

Similar to paired-end
Used for scaffolding
Can have larger middle gap
Up to 20kbp

75
Q

Mate pair sequencing data

A

Similar to paired-end - know that two seq. (contigs) should be near each other
Used for scaffolding
Can have larger middle gap
Up to 20kbp

76
Q

Long reads sequencing data

A

Using new tech. - e.g. PacBio/MinIon
Up to 2Mbp
Not as accurate
Initial assembly Illumina + long reads for scaffolding

77
Q

Types of Assemblers

A

String Graph

de Bruijn Graph

78
Q

String Graphs

A
theory for sequence assembly 
Look for overlaps in reads - set minimum overlap requirement (e.g. 3 base pairs)
Add nodes and edges
Remove redundancy 
-> graph
79
Q

Concept of overlaps

A

take sequences and see how overlap with each other, based on whether identical

80
Q

Concept of graphs

A

idea of nodes joined with edges

e.g. node = known sequence
edge = overlap in sequences (seem to be lines between sequences)

81
Q

de Bruijn Graphs

A

Split sequence into kmers (string of shorter seq. of k length (e.g. 3 = 3bp))

Looks for overlap of kmers, sets minimum overlap of k-1 (e.g. 3-1 = 2).

atc-tcg-gtc…etc

82
Q

de Bruijn Graphs and repeating regions

A

which was to read the graph
atg cat gta (two atg repeating seq.)

so the atg seq. could line up with same region on genome

83
Q

How do assemblers use graphs?

A

Path that goes through each node of graph at least once, with minimal length

->rebuilds genome (contigs)

contigs come from when cannot join two regions

84
Q

How to choose an assembler

A

types of work: single cell genomes/transcriptomics and metagenomics

Sequencing data = length of reads/Illumina (types single-end…etc.)

species - eukaryotic/prokaryotic

85
Q

Long read data assemblers

A

e. g. Peregrine

e. g. Shasta

86
Q

Examples of Assembler Software

A

e.g. SPAdes (bacterial genome)
A5 - sequencer-specific
ALLPATHS-LG - humans
Canu - long reads

87
Q

Important of kmer length

A

amount of nodes and edges
smaller kmer = more nodes and edges

quality vs contiguity (length of contigs) of data

88
Q

What is a kmer?

A

length of DNA that DNA sequence is split into for assembler graph - de Bruijn

e.g. 3 kmer = 3 bp sections

89
Q

How to determine best kmer length?

A
Assembly quality
Matrix statistics 
- number of contigs
- length of assembly (close to length of expected genome - related species)
- is number of genes what expected
- accuracy of assembly
90
Q

Assembly quality determination

A

Assembly quality
Matrix statistics
- number of contigs
- length of assembly (close to length of expected genome - related species)
- is number of genes what expected (marker genes)
- accuracy of assembly (coverage and contamination)
Consider heterozygosity (diploid vs haploid)

91
Q

What is the N50?

A

point at which 50% of genome covered by contigs of x size or larger

e.g. 20 16 12 10 8 5 - N50 = 16

(higher contig value is better)
does not take into account missing regions

92
Q

Presence of marker genes…

A

looks for orthologues in related species
shows that expected number of genes

BUSCO - relies on evolutionary data (prone to error)

93
Q

Coverage and contamination…

A

based on CG content and coverage

GC content different between species (identifier) & different sequencing depth for diff. species

94
Q

Assembly annotation

A

Promoters

Telomeres/centromeres

95
Q

Levels of genome annotation

A

Look for start and stop codons - ORF

Compare start/stop location to database of another species - try and find orthologues (BLAST)

Look at transcritpomic data - this is transcribed