Topic D & E. DNA Sequencing... Function Genomics Flashcards
What are these large regions of chromosomes that maintain homology between grape and poplar termed? Describe an approach you would use to further characterize these regions.
Syntenic Regions
I will look within these highly conserved regions to see what proteins they code for, which would indicate their functionality.
What are the genes connected by lines between grape and poplar known as? Describe a compu- tational approach used to define these types of genes between organisms.
Orthologs
We can use computational approach, like sequence alignment, if two genes are very similar in their sequence together. BlastZ, ClusterW
If there were lines connecting genes within grape what would these genes be known as? Describe a computational approach used to define these types of genes within an organism.
Paralogs
We can use homology search, probably HMM.
Three years after the human genome was declared essentially finished, gaps in the sequence persist. Describe briefly 3 reasons for the remaining gaps in the euchromatic region of the genome. Do you think it is possible with current technology to close the heterochromatic gaps? Why or why not?
tandem repeats, non-uniquely mapping reads, structural variations We need longer reads to close the gaps
What sequence features or genetic properties might be associated with these gaps? How might they be causing the gaps?
repeats: it’s hard to determine how long the repeat region is if you have reads falling within it heterochromatic regions: Hard to actually get the sequence because it does not dissociate well
Acquisition and mapping of fosmid end sequences derived from unrelated individual genomes to the current human reference sequence forms the basis for the human Structural Variation Project. What kinds of important genetic information might one expect to discover from this analysis? Give 3 examples.
CNVs, inversions, translocations and SNP
Whole genome shotgun sequencing strategy:
An approach to genome sequencing where the whole genome is sheared into sequencable fragments, and computationally assembled. All sequencing is done ahead of time using PCR products, to form shotgun libraries of sequence reads.
Clone-by-clone sequencing strategy:
An alternative to WGS where a divide and conquer approach is utilized. First, create genomic libraries of clones immortalized in vectors such as BACs. Ideally you want 5- 10x redundancy of genomic coverage in your libraries. Then form a tiling path by end sequencing clones and aligning overlapping fragments. In so doing, you will be able to quantify gaps where clones lack coverage. You will sequence individual clones along the tiling path and assemble contigs spanning the genome. Finally work on finishing sequence and plugging gaps.
Hybrid sequencing strategies:
A combination of clone by clone and WGS which was used for the mouse and chicken genome projects. Such a compartmentalized shotgun, could for example break the genome up into chromosomes, and then do shotgun sequencing on each chromosome. Probably the best of both worlds, as many genome projects are now adopting a combines approach.
Draft Sequence:
Finished Sequence:
Sequence with an error rate of 10−3 → q=30
Sequence with an error rate of 10−4 → q=40
Segmental Duplications:
> 1kb > 90% similarity
Q-value:
-10log(p) where p = the error rate (or probability of an error)
Mate-pair sequences:
A pair of sequences derived from the two ends of a single clone. An essential component of shot gun sequencing as the distance between the pairs gives spatial information and assists in resolving repeats.
BAC end sequences:
Used to establish mate pairs and construct the tiling path in clone by clonesequencing. mRNA sequences Messemger RNA Eukaryotic transcribed sequences that have been pro- cessed (ie spliced and exported out of the nucleus)
EST sequences:
Expressed Sequence Tags a sequenced piece of cDNA, however may not span the whole cDNA transcript. cDNA library generation uses primers to the poly a tail of the mRNA transcript, and a single sequencing trace is usually performed toward the 5 portion of the gene (all this is done on the complement strand).
STS:
Sequence Tagged Site any sequenced fragment of DNA derived from a library of clones that is placed on the physical map of the genome. Each STS is unique and primers, PCR conditions, and product size are immediately quantifiable and storable in a database. Fundamental to the HGP.
Microsatellites:
tretch of repetitive DNA made up os a variable number of several to one hundread or more tandem repeats of a small number of nucleotides. Ex (AG)n or (CAG)n. Highly polymorphic (in n at least) and heterozygous, and occur around several per hundred kilobases in higher eukaryotes.
SNP
Single Nucleotide Polymorphisms. Useful for mapping phenotype to gene. Highest resolution of polymorphic markers 1/kb
Meiotic Linkage Maps:
Linkage maps based on natural meiotic breaks from homologous recombina- tion.
Radiation Hybrid Maps:
Linkage maps based on induced chromosomal breaks from X-ray irradia- tion. Fragmented chromosomes are then exposed to hamster cell lines and fragments become either incorporated into the hamster chromosomes (via homologous recombination), or segregate as mini chromosomes.
Cytogenetics:
tudy of chromosomes and the related disease states caused by numerical and structural chromosome abnormalities. FISH is especially used in cytogenetics
FISH
Flourescence Insitu Hybridization. Hybridize fluorescent DNA probe on mitotic chromosome at metaphase. Used in ”chromosome painting” where one species chromosomes are labeled and synteny with another species is sought.
BACs
Bacterial Artificial Chromosomes. A system to clone approk 100kb of DNA into bacteria. Clone-based Physical Maps: Assembled genomic sequence base on hierarchical sequencing of clone libraries Contig alignment to chromosomes
Euchromatin
Open active DNA with genes being actively transcribed. Classically associated with acetylation of histones and HATs
Heterochromatin
Closed inactive DNA, tightly coiled and not actively transcribes. Classically associ- ated with methylation of and methyl transferases
Centromeres
Structures of eukaryotic chromosomes that serve as the attachment for the spindle appa- ratus during mitosis. Highly repetitive, and separates long arm for short arm in human chromosomes.
Telomeres
Sequences toward the end of chromosomes that contain mainly simple repeats and du- plicates. They prevent chromosomes from fusing with each other by forming tertiary structures that protect termini. They are interestingly not replicated by polII, but rather their own telomerase.
What are the two differences between finished and draft genome sequences?
finished genome repaired many, but not all, of the gaps in the draft sequences. some heterochromatic gaps, gaps at eukaryotic boundary regions and interior regions remained. Finished genome increased continuity with an increase in N50 contig size. the finished genome corrected order and orientation of draft contigs and eliminated artefactual sequence duplications.
Why is sequencing telomeric DNA more difficult than euchromatic DNA?
telomeric DNA is more condensed and contains many repeating sequences that are hard to assemble with short reads
You are part of a large consortium that performs a large GWAS study of 10,000 individuals that aims to identify risk factors for coronary artery diesase or CAD. You identify four genomic locations that show significant association with CAD. Together these loci explain 2 percent of the heritability of CAD in your population with relative risks ranging from 1.3 to 1.8. Is this suprising? Name at least three reasons that might explain why the heritability is so low.
Rare alleles, interaticons, environmental. maybe many variants with smaller effects are acting together rather than one or two variants with large effect size. make sure there wasn’t population stratification underlying your study.
Name three or more possible sources of bias introduced by T7 RNA polymerase amplification of mRNA from single cells
idk
Name three or more possible strategies that a cell can use to reduce gene expression noise in vivo, assuming the same steady-state protein concentration.
idk
For your graduate research project, you are interested in studying the highly repetitive genome of Sequoia trees. You need to produce a reference genome sequence. What high throughput sequencing technique would you use and why?
hierarchical sequencing. you want to use technology that allows for longer reads and paired end reads because the genome is so highly repetitive
As a postdoc you identify a novel class of human RNA molecules that are likely not polyadeny- lated. You want to know how prevalent they are in the human transcriptome. What high throughput sequencing technique would you use and why?
RNA sequencing with rRNA depeletion instead of poly A selection for library prep because it captures non poly adenylated RNAs and can measure relative expression levels of these novel RNAs.
As a PI, you become obsessed with identifying all transcripts that have expressed, overlapping 3 prime UTRs in K562 cells, a human blood model cell line. What high throughput sequencing technique would you use and why?
idk
A pseudogene is a locus that resembles a protein coding gene but lacks the ability to encode a functional protein? Given this oberservation what are three possible ways that you could distinguish if the sequence you identify is a gene or pseduogene?
idk
Bisulfite sequencing
changes C’s to U’s in unmethylated sites but C’s are unchanged in methylated sites. The green signals indicate sites where methylation patterns aren’t significantly different in normal versus tumor cells, as the singals in the bottom two panels are similar. The red regions indiciate regions that are significantly more methylated, or repressed, in tumor cells but not in normal cells.
Exome sequencing is becoming a standard tool for mapping Mendelian disease causing (or pathogenic) non synonymous single nucleotide variants (nsSNVs). Minor allele frquency (MAF) filter- ing approach is often used to identify candidate pathogenic mutations in these studies. However, hard filtering in exome sequencing of Mendelian diseases still leaves a large number (typically around 100 to 1000) of candidated nsSNVs. Please provide at lease three different ideas/methods that you can use to predict which of the remaining ones have serious funcitonal consequences and prioritize them for validation.
??
The ENCODE Project has generated hundreds of ChIP Seq experiments spanning 119 tran- scription factors, histone marks, and other DNA-binding proteins and hundreds of cell lines for public use.
Suppose you have a list of genes that are dysregulated in a particular condition or tissue, based on the ENCODE databse, how do you identify the possible transcription factors regulating these genes?
look at the Chip Seq peaks from the encode database for your cell type or tissue of interest and overlay it with RNA seq data. Look to see that you are ssing the dysregualtion of the same genes. Look at Chip seq peaks for normal tissue and tissue of interest to identify significant differences in peaks at transcription factors, inidcating up or down regulation of a transcription factor that may effect expression of your gene of interest
Suppose that you have identified a GWAS lead SNP (SNP1), which is in a LD block of 5 other SNPs. Explain how you can use the ENCODE data to potentially identify the functional SNPs.
aligning multiple binding information. look at how prevalent SNP is across similar tissues or conditions comparted to surrounding SNPs. look to see if SNP is in funtional or non function region. is it in an exon of a gene? look to see if there are any ohter nearby regulatory markers that may affect this SNP but not the others.
What can you conclude about chromating modification with respect to cell types?
chromatin modifications near promoters seem to be similar irrespective of cell type chromating modi- ficantions near non redundant enhancers seem to be more variable and more cell type specific
The prairie dog genome has been predicted to be 1.9 Gb. Using the brand new Illumina HiSeq- 2500, 2 X 150 paired-end sequencing with average output of 600 million reads per lane is possible, and on average, 75% of all bases are Q30 or above (1 error in 1,000). Using this system, what level of coverage would one lane give you with all bases of Q30 or better? Show your work for full credit.
(600, 000, 000 × 2 × 150 × 0.75)/1, 900, 000, 000 = 71
The sequencing gives you great assembly of gene-rich regions of the genome, but you still have 2,000 scaffolds with a total size of 1.7 Gb and 28 predicted chromosomes (n=28). You decide that an assembled genome is essential to your research. Therefore, you decide to create a BAC library with average insert sizes of 100 kilobases. With this in mind, how many total BAC clones will you need to reach 10X coverage for this genome? Show your work for full credit.
1.7Gb × 10(10Xcoverage)/100, 000(100kilobases) = 1.7 × 105 BAC: up to 200Kbs, more commonly used
YAC: really huge inserts, sometimes 1Mb
The human and mouse genomes are said to be finished, whereas all other vertebrate genomes currently in NCBI are said to be either high-quality draft sequences or low-coverage draft sequences. What criteria are used to declare a vertebrate genome ”finished”, and what is meant by high quality draft genome and ”low-coverage draft genome”?
Finished has as few gaps as possible by focused strategies, high quality of base calls q>40 (
The finished human genome still contains gaps. Give 4 different reasons for why there are still gaps
Repetitive regions which cannot be resolved by relatively short read sequences
Heterochromatic regions: hard to sequence
Multigene families that have a lot of structural similarities but polymorphisms between individual gene members
Structural variations: inversions, segmental duplications, insertions, deletions
Why are heterochromatin regions hard to sequence?
Constitutive heterochromatin is composed mainly of high copy number tandem repeats known as satellite repeats, minisatellite and microsatellite repeats, and transposon repeats
Describe 2 features that a finished genome provides that are lacking or clearly suboptimal in a high-quality draft genome.
Draft has many more gaps, less continuity, more incorrect order and orientation of draft contigs, more artifactual sequence duplications, segemental duplications and structural variations unresolved. This provides a finished genome definition for experimentation.
Describe 2 applications for which a low-coverage draft genome is useful.
SNP calling
Simple sequence motif matching