Topic B: Computational Biology Flashcards
Let T=abaaba$ be a string where $ is a terminator character, which is lexicographically prior to all other characters. Write out the Burrows Wheeler transformation of T (BWT).
abba$aa
[7,6,3,4,1,5,2]
For genome-wide association studies, how is genomic inflation factor computed?
Genomic inflation factor is computed by calculating the median value for the distribution of chi squared statistics for a group of SNPs randomly distributed in the genome and comparing to the median value for the distribution of alleles in the case or control group. Done to assess extent of population stratifi- cation (extent >1).
λ = observed median of test statistic distribution / expected median of the test statistic distribution. For 1 degree of freedom, the X2 distribution has an expected median of 0.455.
A λ of 1.05 is considered acceptable. > 1.1 is troubling, and indicates there is some inflation of the p values.
Causes of inflation include: technical batch effects (if samples were not processed in parallel), popula- tion stratification, unknown relatedness between samples, and DNA sample quality.
What is phylogenetic footprinting?
Phylogenetic footprinting is powerful technique for finding functional elements from sequence data. Functional elements are thought to have greater sequence constraint than nonfunctional elements, and, thus, undergo a slower rate of sequence change through time. Phylogenetic footprinting uses com- parisons of homologous sequences from closely related organisms to identify ”phylogenetic footprints,” regions with slower rates of sequence change than background.
How is a scale-free network different from a random network?
A scale-free network is a network whose degree distribution follows a power law, at least asymp- totically. There exists high-degree nodes, often called ”hubs”.
A scale-free ideal network is a random network with a degree distribution following the scale-free ideal gas density distribution.
A random graph is a graph that is generated by some random process.
Why is population substructure a problem for GWAS studies?
In case-control studies, the association found could be due to the underlying population structure and not a disease associated locus.
Also the real disease causing locus might not be found in the study if the locus is less prevalent in the population where the case subjects are chosen.
It can be controlled by Eigenstrat, or doing PCA, MDS.
Provide the formula for the Information Content at a column of a PWM.
Intuitively, what does this measure?
Information Content = 2 summation (x=A,C,T,G) Pxlog2Px
The max is 2 (the position is informative) and the min is 0 (not informative at all). a small pseudo-count is used for 0 probability.
It measures how many information is provided by the PWM, or how different is PWM from back- ground probability.
How well the motif can discriminate between a real signal and background noise. Or how infor- mative a position in a PWM is.
Because different genomes and different regions of the same genome have different nucleotide compositions, information content may not universally provide an accurate measure of motif specificity. What is an alternative measure? Provide the formula for this alternative measure.
we are making an assumption about what the background nucleotide frequencies are when we use information content as a measure namely, that all nucleotides are equally likely. In order to take different background probabilities into account, we need to use a slightly different measure, called relative entropy.
Define homolog, ortholog, and paralog.
Homolog: A gene related to a second gene by descent from a common ancestral DNA sequence Ortholog: genes in different species that evolved from a common ancestral gene by speciation. Paralogs: genes related by duplication within a genome.
Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one.
If you were trying to identify an unknown protein, which evidence for identification would you be most confident in:
• The presence of several high scoring sequence matches to one protein obtained from BLAST, where there were also matches to different proteins in the result set, or
• a match to a PFAM-A profile of a protein sequence family, and why?
We would be more confident with a match to PFAM-A profile of a protein sequence family, because it has protein-level information and functionality.
Wright-Fisher model of genetic drift
idk
Suggest five features that might be relevant for gene prediction (i.e. predict the genic structure of a novel genome).
promoter sequence, TSS, ORF (AUG), splice sites, stop codon
conservation, similarity with its own gene
DNA methylation, histone modification (H3K4me3 for promoters, H3K36me3 for gene body).
What are the computational challenges of NGS?
- compare a billion reads against human genome
- call variants reliably
- association tests of these variants
- interpret the effect of variants
- query and storage large amounts of data