Lecture 3 - Probability and Stats for Sequence Analysis Flashcards by Gia Gupta

Why would you want to identify the GC composition of a human genome?

it is the fraction of GC in genome and it is often hypermethylated and more GC due to more H binds increases the melting point of DNA

How well did you know this?

Not at all

Perfectly

Given the GC composition of a list of possible source genomes and a sequence read, can you identify which is the most likely genome the read was sequenced from?

need to turn genome into something for numerical encoding
perform a statistical test

How well did you know this?

Not at all

Perfectly

What are DNA binding proteins in DNA?

-they are proteins which form noncovalent bonds with short strecthed of DNA and there are multiple binding sites within a genime
-the specific loci where DBP bind are called binding sites and they are near other genes - the pattern bound is called a motif
-when a genome is intiallly sequnced the locations of binding sites are unknown
-forms ionic bonds

How well did you know this?

Not at all

Perfectly

What does the LOGO binding motif show?

that there is a relative frequency of nucleotides ate DNA binding sites - every binding site gas an A at the fourth and fifth positions at the 12ths positions and G at the third position
-consider a model where the binding motif is a fixed number of bases k and the nucelotides of the motif have 100% frequency

How well did you know this?

Not at all

Perfectly

What is the probability a certain chromosomal sequence is over represented in the genome?

need to consider a single chromosomal sequence in the forward strand from 5’ to 3’

How well did you know this?

Not at all

Perfectly

What is the null model for DNA?

random nucleotide generator which does not care about the output before or output later meaning it is independent

How well did you know this?

Not at all

Perfectly

When generating random nucleotides what is each base at position i considered to be?

the random variable Li

How well did you know this?

Not at all

Perfectly

What is the domain of the random varaible Li or the values that it can take on?

Li E {A. C, G, T)

How well did you know this?

Not at all

Perfectly

What are discrete random variables?

they are random variables which can take on a finite set of values

How well did you know this?

Not at all

Perfectly

If a discrete random variable Li can take on J separate values with probabilites p1….pj then the sum of all the probabilites is?

How well did you know this?

Not at all

Perfectly

What forms the probability distribution of a random variable?

p1,…,pj

How well did you know this?

Not at all

Perfectly

For the random nucleotide generator the only constraint we place on nucleotide probabilties is what?

pA + pC + pG + pT = 1
it is not constrained to be equal unless specified - i.i.d.

How well did you know this?

Not at all

Perfectly

What is i.i.d.?

pA=pC=pG=pT = 0.25

How well did you know this?

Not at all

Perfectly

When are two random varaible independent?

if the probability of those two random variables happening is the product of them like if event X is getting a one one a 6 sided die and Y is getting a heads in coin toss the probability of them both occurring is 1/6*1/2 = 1/12

How well did you know this?

Not at all

Perfectly

What is an example of a dependent variable?

the height and wieght of a person X and Y

How well did you know this?

Not at all

Perfectly

How can you use the complement of A to find the P(A)?

Study These Flashcards

How do you find the probability of A or B?

Study These Flashcards

sum the two probabilites and subtract their intersection

Study These Flashcards

What is the expectation of a random variable?

Study These Flashcards

What is the expectation scale by a constant?

Study These Flashcards

multiply the calculated expectation by the constant

Can you sum expectations?

Study These Flashcards

yes

What is the variance if a random variable?

Study These Flashcards

mean of squares - square of means

How do you scale a variance?

Study These Flashcards

multiple by the square of constant

Can you sum variances?

Study These Flashcards

yes

The expected number of a given nucleotide in a sequence needs to use an auxiliary variable Xi which is where?

If N is the random variable representing the number of As in a sequence then...

Consider the example below when calculating deviation from expectation... A genome is thought to have a 50% GC composition and a sequence of 100 bases is observed with 65 nucleotides that are either G or C. What is the probability of such a deviation?

1. Need to first fine number of expected bases in the genome. 50 GC nucleotides 2. Use X^2 test

Consider the example below when comparing nucleotide composition... Given a genome of length n with Sa, Sc, Sg, St counts of each nucleotide and a sample sequence of 100 nucleotides what is the proabability that the nucleotide counts deviate from what could be expected if the sample sequence was derived from the genome?

1. Find the expected number of nucleotides in the read and calculate the probabilties and expected value 2. use chi square for multiple df

What is the formula for df for multiple for chi sqaure?

(#rows -1)(#cols -1)

Lecture 3 - Probability and Stats for Sequence Analysis Flashcards

(30 cards)