Week 2 Flashcards
Information
the letters used to write the sequences
Genomes in a double or single stranded states
What is information used for?
we use information in DNA seqeunce to align sequences
Contig
Summary sequence of overlapping DNA fragments
Why can we align DNA sequences
How long does a sequence have to be for us to expect it to be unique?
Information theory and Genetics
The restriction enzyme HaeIII cuts the sequence GGCC. What is the probability that any random four base pair sequence is a HaeIII site?
1/4 chance of getting either base
1/4 x 1/4 x 1/4 x 1/4
(1/4)^n
n=length of the genome
How many distinct permutations in sequence can a stretch of n bases long have?
How many sequences (permutations) are possible for a giving length of a DNA strand (n-bases long) if the bases occur at an equal frequency in the genome
, how many permutations?
how many peices of distinct information can a stretch of n bases have (potential)
4^n
We can expect a sequence to be unique in a genome when the # of permutations is greater than the # of sequences of n bases in the genome
Bacterial genome = 4x10^6 bases
Human genome = 3x10^9 base (single stranded)
What is the # of sequences of n bases in the genome?
What is the number of seqeunces of n bases in the genome?
because genomes are so long the number of sites = the length of the genome if the genome is single stranded
the number of sites in double stranded DNA = 2x the length of the genome
Palindromic sequences are an exception.
We can expect a sequence to be unique in a genome when the # of permutations is greater than the number of sequences of n bases in the genome.
when is 4^n > 8x10^6 (bacteria)
n>=12 in bacteria is unique
when is 4^n > 6x10^9 (humans)
n>=17 in humans is unique
Say I have 9 genomes made of random sequence of 3 X 109 bases. And I search each of these with ATAGACATAGGATACAT. How many would you expect to have the sequence?
6 X 10^9 / 17 X10^9 = 1/3
Therefore, 1/3 X 9 = 3.
I expect about 3 of these random genomes to have this sequence
How much information is there in DNA?
A,T,G,C each letter can arbitrarily represented as:
A=00
G=01
C=10
T=11
the information content at any position in a DNA sequence is 2 bits
smallest unit of information is a bit
The total bits of a sequence n is the addition of the bits at each position.
Information theory
trying to pick out an identified object
we have eight objects each of the objects are assigned a binary number, each object is identified by a distinct binary number
when do i have sufficient information to identify the object
if you only have one bit of information when you search through the object you are going to identify half
Minimum amount of information
Imin = log2 N
I>= Imin to be sufficient information
N is the number of objects/length of the genome
12 base sequence = 24 bits > log2 8 X 10^6 = 22.9
17 base sequence = 34 bits > log2 6 X 10^9 = 32.5
BLAST Basic local alignment search tool
Genbank has about 10^13 bases of sequence in the database. Therefore, Imin= 44 and a sequence of 23 bases or more could be expected to be unique, more often than not, in the database.
Blast takes your query and breaks it into short words (5 sequence words), they take the database and break into words as well (5 base word) these words are indexed with position information.
If so the blast program creates an aligment to align the query to the sequences in the database
Two statistics of a BLAST
Score: higher the score the better.
e value: lower the e value the better.
e value is telling you what the random chance of your alignment occurring when searching the database. Low e value low chance.