Week 2 Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Information

A

the letters used to write the sequences

Genomes in a double or single stranded states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is information used for?

A

we use information in DNA seqeunce to align sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Contig

A

Summary sequence of overlapping DNA fragments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why can we align DNA sequences

A

How long does a sequence have to be for us to expect it to be unique?

Information theory and Genetics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The restriction enzyme HaeIII cuts the sequence GGCC. What is the probability that any random four base pair sequence is a HaeIII site?

A

1/4 chance of getting either base

1/4 x 1/4 x 1/4 x 1/4

(1/4)^n

n=length of the genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many distinct permutations in sequence can a stretch of n bases long have?

A

How many sequences (permutations) are possible for a giving length of a DNA strand (n-bases long) if the bases occur at an equal frequency in the genome
, how many permutations?

how many peices of distinct information can a stretch of n bases have (potential)

4^n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

We can expect a sequence to be unique in a genome when the # of permutations is greater than the # of sequences of n bases in the genome

A

Bacterial genome = 4x10^6 bases

Human genome = 3x10^9 base (single stranded)

What is the # of sequences of n bases in the genome?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the number of seqeunces of n bases in the genome?

A

because genomes are so long the number of sites = the length of the genome if the genome is single stranded

the number of sites in double stranded DNA = 2x the length of the genome

Palindromic sequences are an exception.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

We can expect a sequence to be unique in a genome when the # of permutations is greater than the number of sequences of n bases in the genome.

A

when is 4^n > 8x10^6 (bacteria)

n>=12 in bacteria is unique

when is 4^n > 6x10^9 (humans)

n>=17 in humans is unique

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Say I have 9 genomes made of random sequence of 3 X 109 bases. And I search each of these with ATAGACATAGGATACAT. How many would you expect to have the sequence?

A

6 X 10^9 / 17 X10^9 = 1/3
Therefore, 1/3 X 9 = 3.
I expect about 3 of these random genomes to have this sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How much information is there in DNA?

A

A,T,G,C each letter can arbitrarily represented as:

A=00
G=01
C=10
T=11

the information content at any position in a DNA sequence is 2 bits

smallest unit of information is a bit

The total bits of a sequence n is the addition of the bits at each position.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Information theory

A

trying to pick out an identified object

we have eight objects each of the objects are assigned a binary number, each object is identified by a distinct binary number

when do i have sufficient information to identify the object

if you only have one bit of information when you search through the object you are going to identify half

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Minimum amount of information

A

Imin = log2 N

I>= Imin to be sufficient information

N is the number of objects/length of the genome

12 base sequence = 24 bits > log2 8 X 10^6 = 22.9

17 base sequence = 34 bits > log2 6 X 10^9 = 32.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BLAST Basic local alignment search tool

A

Genbank has about 10^13 bases of sequence in the database. Therefore, Imin= 44 and a sequence of 23 bases or more could be expected to be unique, more often than not, in the database.

Blast takes your query and breaks it into short words (5 sequence words), they take the database and break into words as well (5 base word) these words are indexed with position information.

If so the blast program creates an aligment to align the query to the sequences in the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Two statistics of a BLAST

A

Score: higher the score the better.

e value: lower the e value the better.

e value is telling you what the random chance of your alignment occurring when searching the database. Low e value low chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cost of DNA sequencing

A

Has fallen over the past 20years

large change occurs after massive parallel sequencing is introduced

17
Q

Sanger

A

reads per/run: 1

Length of read: 1,000 bases

basic method: termination

expensive/accurate

18
Q

Illumina

A

length of read: 100 bases
# of reads/per run: 10^9-10^10
basic method: real time incorporation
cheap/accurate

19
Q

Pacific bioscience

A

length of read: 15,000 bases

reads/run: 10^6

real time incorporation

cheap/low accuracy

20
Q

nanopore

A

read length: 100,000
#reads per run: 10^6
current changes
cheap/low accuracy

21
Q

Real time incorporation: Illumina

A

A series of rounds of synthesis, the nucleotides are modified to have a fluorescent group and a blocked 3’OH

after cycle one the 3’OH and the fluorescent group is removed

the cycle is repeated, continue on and on and take successive pictures to determine the sequence of DNA

22
Q

Real time incorporation: Pacbio

A

the cell is set up so you can see the DNA polymerase, and DNA polymerase incorporated labeled dNTPs, the label is at the very end of the phosphate, what you observe is the read output, labeled dNTPs are incorporated in the chain, this fluorescence is detected, and diffuses away shortly after incorporation

23
Q

Nanopore

A

the sequence that is in the pore alters the amount of current that can flow through the nanopore in a specific manner depending on the sequence.

as each base goes through we see a current change

a set of bases is read and is being read, the current we’re looking at is the G coming out naf the T going in

24
Q

Illumina basic steps

A

DNA molecule has adaptors added.

Sticks to oligos on a glass slide.

Amplified to make a microdot on the slide.

First sequence read.

Reorientation of the DNA molecule.

Second paired sequence read.

Bridge amplification to close fragments

25
Q

Pacific Biosciences sequencing video

A

DNA molecule has adaptors added.

DNA isolated
Label adpators to DNA creating a circular template, primer and polymerase is added to the library

DNA and DNA polymerase in a SMRT cell.
Smart cell contains Zero-Mode Waveguides (ZMWs)
A single molecule of DNA is immobilized at the bottom of the ZMW (smallest detection unit)
As polymerase incorporates nucleotides, light is emitted (after which the fluorescent tag is cleaved)
Nucleotide light emission is used to measure incroporation

As each ZMW is illuminated from below the wavelength of the light is too large to allow it o pass through the waveguide
Attenuated light form the excitation beam penetrates the lower 20-30 nm of each ZMW
DNA sequence read.
Circular concensus sequencing mode to produce highly accurate long reads (Hifi)
Use continuous long read sequencing mode to generate long as possible reads

26
Q

Nanopore sequencing video

A

DNA molecule has adaptors added.

Motor molecule binds

Adaptors are important for the binding of a motor molecule, the DNA is released over membraens tha thave a single pore over them.

The DNA molecules guided to the nanopore with the help of tethers that will grab onto the motor and help position it over nanopore.

Sequencing directly enables the detection of base modifications and methylation

Motor protein recognizes adaptor sequence bound to DNA and guides DNA to the nanopore

Tether guides DNA through the nanopore

DNA passes through causing a disruption of ionic current measured in signal trace.

Intact DNA sequences are sequenced in real time by the nanopore, regardless of their length (no runtime, fragments)

DNA enzyme complexes approach the nanopore and the single stranded DNA is pulled through aperture of the nanopore, the enzyme ratchets the the DNA through the nanopore one nucleotide at a time.

The enzyme bands to a single stranded leader at the end of a single strand, unzipping the double stranded DNA.

Speed of the enzyme can be controlled.

Hair pin structure ensures both strands of DNA can be read in one read.

A nanopore

27
Q

Contig

A

Assembling a summary sequence from fragments of DNA

28
Q

Genome sequecing starts with

A

All genome sequencing starts with random fragmentation.

In the tube you have a random set of fragments from different genomes of different sizes, starting at different positions.

If you pick enough DNA to sequence such that you would have sequenced six random sequences of 11 bases in length, every nucleotide once.

Picking them randomly is not a complete genome sequence

29
Q

Coverage

A

how many times has a base been sequenced = bases sequenced/length of genome.

30
Q

Why do we have gaps?

A

This is called the zero class

the proportion that will be zero = e^-m

where m is the expected mean = # of times you seqeunced the genome

e^-m X total length of the genome (ss)

Repetitive sequence also creates gaps in assembly.

High information content is necessary to create a sequence

31
Q

Scaffold

A

A scaffold is an ordered array of contigs with gaps remaining.

32
Q

How do we generate scaffolds?

A

Paired end reads are used to create a scaffold.

Sanger and Illumina

DNA fragment of some length, what we can determine a read of DNA sequence at one and and a read at the other end. The reverse complement.

Distance between them.
These two sequences are linked by a defined distance

33
Q

Is the human genome completely sequenced?

A

8% is missing

sequences around the centromere
sequences of tandem rDNA (rRNA genes) repeats
Sequences of areas with repeats

Other genomes may remain incompletely sequenced due to the lack of interest/funds.

34
Q

Reference genomes

A

Drosophila: genome determined from a specific fly stock.

Human: Was supposed to be determined from a number of individuals but ended up with most from a single white male.

References genome allows rapid assembly of genome sequence for a single individual.

So its easy once you have a reference genome just to align your sequence reads with the reference genome

NO. Comparison with other Human genomes a large amount of sequence is missing. A more sophisticated Human reference needs to be developed.