Chaudhuri Flashcards

1
Q

What is bioinformatics?

A
  • science of collecting and analysing complex biological data, eg. genetic codes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does bioinformatics exist at the interface of?

A
  • computing, biology and maths
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are bioinformatics skills so in demand?

A
  • seq data accum faster than ability to analyse it and even to store it
  • transferable an necessary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How quickly has cost of sequencing decreased?

A
  • quicker than Moore’s Law (= computational power ≈ x2 every 18 months
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What caused a large decrease in seq cost in 2008?

A
  • Illumina
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is Illumina paired-end sequencing carried out? (overview)

A
  • in library prep, fragments of ≈500bp selected
  • bridge amplification results in clusters, each w/ many copies of both strands of fragment
  • sequencing reads gen separately using primers complementary to both adaptors
  • expect those read pairs to map 500bp apart on opp strands
  • 3rd primer used to seq index barcode present in 1 of adaptors to enable identification of sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does homologous mean?

A
  • same reaction, relative position or structure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If 2 seqs have 12/16 of the same bases what can you say about identity and homologous?

A
  • 75% identity

- NOT 75% homologous –> either homologous or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does aligning 2 seqs tell you?

A
  • how many changes would be req to get those seqs, under assumption that aligned positions share common origin
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does introducing gaps when aligning seqs allow?

A
  • max no. matches

- represents insertions/deletions = indents (a imposs to distinguish)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is seq alignment used for in bioinformatics?

A
  • identify homologous seqs w/ common ancestor
  • assess how similar homologous seqs are to infer evo relationships between groups of seqs
  • assemble short reads into contiguous seqs and ultimately seq entire chromosomes/genomes
  • map seq reads to reference genome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the more likely option when deciding how to align seqs?

A
  • one explained by less evolutionary events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is seq alignment decided?

A
  • scoring system
  • matches assigned +ve score
  • mismatches/gaps assigned -ve score
  • sometimes 1 penalty for opening new gap and 2nd lower penalty for extending growth (as bigger gaps favoured over several small gaps)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are scoring matrices and why are they used?

A
  • for nt alignment mismatches usually all treated the same
  • for AAs, scoring matrix used so biochemically conservative AA subs penalised less than subs likely to affect protein structure
  • eg. BLOSUM62, PAM70
  • constructed empirically by examining freq of each AA sub across large collection of protein alignments
  • eg. Ile for Leu is match
  • eg. Trp v unique and doesn’t like to be sub
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is global alignment, and when is it suitable?

A
  • attempt made to align seqs across entire length
  • assumes seqs equivalent
  • not suitable for aligning full length seq w/ partial seq
  • 1st global alignment algorithm proposed by Needleman and Wunsch
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is local alignment?

A
  • searches subsequences of full length seq to max alignment score
  • 1st algorithm by Smith and Waterman
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does BLAST work?

A
  • widely used method of searching database to rapidly identify seqs similar to query seqs
  • user supplies query seq and BLAST searches for similar seqs
  • performs local alignment to identify regions of hit that match query seq
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the output of BLAST?

A
  • E value = P value normalised to database size and length of query seq
  • effectively no. hits expected to be found by chance in this database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the difficulties w/ de novo assembly?

A
  • unknown target
  • coverage bias
  • sequencing errors
  • repeats
  • multiple replicons
  • contamination
  • circular genomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is genome assembly difficult w/ short reads?

A
  • resolving repeats esp hard –> paired end reads can help, but only for repeats smaller than insert size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is overlap layout consensus seq, and when would it work?

A
  • looks for overlaps between adj reads
  • would work well if genomes non repetitive and seq error free
  • repeats can result in mis-assembly errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are de Bruijn graphs?

A
  • common approach to assembling short reads, to take account of seq errors and repeats in genome
  • break read up into Kmers
  • K = no. of bases, usually 51/99 (usually odd no.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the advs of de Bruijn graphs?

A
  • stops assembly errors as allow repeats to be identified
  • each K-mer in seq once and expect at least 30x coverage for each Kmer and even more for repeat seq
  • Kmers only need to be stored in memory once so less RAM needed
  • removal of rare Kmers corrects for seq errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can bubbles be resolved using read pairs in de Bruijn graphs?

A
  • read pairs can provide info which spans repeat seqs, helping resolve order of contigs and close the assembly
  • resolving 1 of key functions of genome assembly software
  • if can’t be resolved, results in break in assembly
  • as reads get longer, graph gets simpler
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How many reading frames are there for DNA?

A
  • 6
  • triplet genetic code, so 3 distinct ways DNA strand can encode a protein and 3 more in reverse direction on complementary strand
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How can you identify gene by looking at ORFs?

A
  • longest ORF likely to be gene
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

The presence of what can identify large genes in bacteria/archaea?

A
  • long ORFs
  • AUG start codon
  • Shine-Dalgarno site
  • Pribnow box
  • characteristic base composition due to biases in codon usage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are strategies for finding genes in euks?

A
  • introns so harder
  • algorithms exist but are hit and miss
  • look for
  • -> Kozak seq
  • -> euk terminator consensus
  • -> polyA adenylation signal
  • -> splice donor and acceptor sites can indicate inton presence
  • RNA seq data can reveal which regions present in mature RNAs so assist w/ identification of genes and introns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How can genomes be annotated?

A
  • used to be manual
  • pipelines, eg. Prokka and MAKER, provide automated annotation
  • apply no. programs to predict positions of protein coding genes, tRNA and rRNA genes
  • also use BLAST to identify homologues from which functional annotation can be transferred
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why is resequencing useful?

A
  • simpler to reseq from species already seq than seq new genome
  • investigate genomic variation w/in pop of species
  • understanding single gene, complex disorders and cancer
  • identifying variants for diagnosis –> may allow personalised medicine or genome editing based cures in future
  • relied on by functional genomic techs, like RNA seq and ChIP seq
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How are reads mapped to a reference genome?

A
  • each read compared w/ sorted index derived from reference genome seq to identify short identical matches (“seed seqs”)
  • don’t use whole read as seed seq, as looking for differences, only small chunks will match
  • alignment from all seed matches extended to inc rets of read
  • alignment scoring system used to identify best mapping position, accounting for no. matches, mismatches and base qualities (mismatches at low quality bases penalised less than high quality mismatches)
  • each mapped read given mapping score, indicating confidence that read is derived from that position in genome (uniquely mapped reads have high score and ambiguously mapped have low score)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are most common read mapping softwares?

A
  • BWA and Bowtie2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How are mapped reads stored?

A
  • usually BAM file

- contains details of which position on which chromosome mapped to and how good alignment is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the depth of coverage?

A
  • no. reads which overlap particular position
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is a pile up plot?

A
  • bar chart showing variation in coverage depth
36
Q

How can ambiguities be solved by paired reads?

A
  • reads in pair mapped independently, but if 1 maps to multiple positions, position of other can be used to determine correct position
  • if both reads w/in repeat, then mapping ambiguous (then 1 usually chosen and given low mapping score)
  • for some apps, low mapping score reads excluded
37
Q

How can mapped reads be visualised?

A
  • look at overview of whole chromosome, w/ focused region highlighted and ruler to show genomic coords
  • pile up plot of read depth at each position
  • positions w/ mapped reads highlighted
  • can also show equivalent into from 2nd highlighted biological sample
38
Q

Why do you need more than 1 read for each base?

A
  • if only 1x coverage would be imposs to distinguish errors from real diffs between seq sample and ref genome
  • so seq each base many times
39
Q

What assumption does having higher coverage make, and is this true?

A
  • errors random

- broadly but always

40
Q

What does variant detection software do?

A
  • uses probabilistic models to distinguish errors from homozygous and heterozygous SNPs
41
Q

How are SNPs identified by IGV?

A
  • coloured lines mean varies from ref genome
  • homozygous SNPs present in all reads, seq errors only in 1
  • heterozygous SNPs ≈ half match ref and ≈ show SNP
42
Q

How are SNP effects annotated?

A
  • programs, such as SNPeff predict SNP effects

- eg. intergenic, intron, regulatory, synonymous (doesn’t change encoded protein), non synonymous

43
Q

What are the poss sources of errors in variant callers?

A
  • random or systematic seq errors
  • mapping errors (mapper placed read in incorrect position)
  • sample contamination (contains DNA from another source w/ diff seq)
  • seq contamination (reads from another sample mislabelled)
  • errors or omissions in ref genome
44
Q

What structural variants are poss as result of larger scale chromosomal rearrangements?

A
  • insertions
  • deletions
  • duplications
  • inversions
  • translocations
45
Q

What are the diff methods of detecting structural variants?

A
  • using coverage depth
  • -> duplication where coverage depth higher and deletion where no reads
  • using read pairs
  • -> no structural variation if map to same place on ref and sample
  • -> deletion if further apart on ref
  • -> mobile element insertion if only one maps to ref
  • using split reads (for deletions)
  • -> look for indiv reads that overlap deletio
  • -> can see if read matches 2 positions, so see where gap is
  • using assembly
  • -> might get read w/ 1 end matching ref genome
  • -> then look for overlapping read pairs to assemble and see what insertion could be
46
Q

What are genome wide association studies?

A
  • simplest form of functional genomics
  • identifies regions of genome pot assoc w/ particular phenotype by statistical assoc
  • look for variants over represented
  • datasets often shown in Manhattan plot
  • can be misleading, correlation NOT causation and further studies req to establish mol mechanism underlying phenotype
47
Q

What does transcriptomics tell us?

A
  • tells us which genes expressed under particular conditions
48
Q

How is transcriptomics done?

A
  • traditionally done gene by gene, using N blotting or RT-qPCR
  • N blots use radiolabeled probes to detect specific seqs from whole RNA cell extracts
  • RT-qPCR uses fluorescently labelled primers to quantify levels of specific transcript during PCR amplification
49
Q

What are DNA microarrays used for?

A
  • measuring transcrip levels for whole genome
50
Q

How are DNA microarrays carried out?

A
  • like reverse N blot
  • can be rep 100s times on slide w/ ordered array of diff probes at known positions on slide surface
  • probes can be designed for every gene, so poss to measure global gene expression of sample
  • fluorescently labelled RNA sample added to surface of array
  • if transcript complementary to particular probe, will hybridise and spot lights up
  • measure fluorescence level of particular spot to assess abundance of transcript in sample
51
Q

How are experiments designed using high throughput methods (eg. microarrays, RNA-seq)?

A
  • same principles as any other
  • techs more expensive, so mistakes cost more, temptation to cut corners
  • as always approp controls and rep essential
  • simplest case is comparing 2 conditions (eg. treatment and control)
  • can be done of 2 separate microarrays or use 2 colour microarrays, to allow direct comparison
  • more complex designs such as courses also poss
  • can use microarrays in which red spot indicates transcript more abundant in sample, green indicated less abundant in sample and yellow indicates similar levels
52
Q

What is the output of a transcriptomic experiment?

A
  • common to focus on top few upreg and downreg genes by choosing arbitrary fold change cut off
53
Q

What are the limitations of microarrays?

A
  • low resolution sequencing tech
  • if get signal for particular prob, know that seq present in sample
  • don’t usually know if that is exact seq
  • don’t know if any seqs present not covered by microarray probes
54
Q

How is RNA-seq now carried out?

A
  • reverse transcrip, fragmentation and amplification to make cDNA library
  • high throughput seq
  • map reads to ref genome
55
Q

How do RNA-seq and microarrays compare?

A
  • both well dev w/ min technical variation
  • RNA-seq has larger dynamic range (greater ability to distinguish diff levels of expression
  • microarrays only give info for pre selected regions of genome, RNA-seq genome wide and can detect novel transcripts
  • microarrays can have dye bias effects (diff intensity for diff colour dyes)
  • RNA-seq allows detection of diff from ref genome
  • RNA-seq can be done w/o ref genome
56
Q

What is mRNA splicing?

A
  • introns spliced out of pre-mRNA to get exonic seq in mRNA
57
Q

How are reads mapped in RNA-seq?

A
  • map to ref genome to quantify expression of each gene
  • if bacterial genome, can be done using standard read-mapping software
  • not for euk, due to mRNA splicing –> need splice aware read mapper, eg. TopHat
58
Q

How is de novo transcriptome assembly carried out?

A
  • poss in absence of ref genome
  • similar to genome assembly but more complex, as not all transcripts present at same level and some genes may prod multiple diff transcripts
  • most popular software package is Trinity
59
Q

How are experiments designed for transcriptomic studies?

A
  • rep essential, typically 3x biological replicates
  • important to isolate variable of interest –> not diff person doing all WT samples and someone else mutant, as could be diffs in procedure
60
Q

How do you test for differential expression?

A
  • ratio of sample to control signal = fold change
  • usually expressed as log2(fold change) pr logFC (fold change of 1/2 = 1/2 as mich mutant as WT)
  • adv is that its symmetrical –> so genes upreg or downreg 2 fold have ratio of 2 and 1/2, respectively, but log2(fold change) of +1 and -1
  • interested if logFC signif diff from 0 –> t test
  • usually relatively few replicates, so t test lacks power
  • modern analysis programs, eg. Limma (microarray) and DEseq2 (RNA-seq) solve this by taking adv of large no. parallel experiments
61
Q

What is the diff between technical and biological replicates?

A
  • technical = doing experiment once and arraying/seq extracted RNA multiple times (tests reproducibility of techniques, not important now, as know good)
  • biological = doing experiment multiple times
62
Q

Why do we need to take multiple tests into account?

A
  • if use signif threshold of 5% , expect to see signif effects n 5% of experiments by chance
  • can lead to many false +ves when performing 1000s tests in parallel
63
Q

How are multiple tests taken into account?

A
  • usually “false discovery rate” adjustment made to P values, control % of false +ves to be equal to chosen P value cut off
64
Q

What is DNA methylation?

A
  • mostly involves add of methyl group to 5C red of cytosine
  • cat by DNA methyltransferases
  • eg. of epigenetics
65
Q

Why is DNA methylation an important epigenetic mod in higher euks?

A

Involved in many processes:

  • reg of gene expression
  • imprinting
  • X chromosome inactivation
  • silencing of germline specific genes and repeat genes
66
Q

What is DNA methylation used for in bacteria?

A
  • distinguish self DNA from non-self
  • non-self can be digested w/ REs, acting as IS
  • also important role in controlling bacterial DNA rep, limiting it to single rep per cell cycle
67
Q

What are the 3 poss contexts for methylation, and why is this important?

A
  • downstream context of cytosine critical for determining its methylation status
  • CpG (C adj to G)
  • CHG (C, any base but G, G)
  • CHH (C, 2x not G bases)
68
Q

Why are CpG dinucleotides underrepresented in genomes?

A
  • methylated cytosine easily mutates to uracil, which is repaired to thymine
69
Q

What is bisulphite sequencing?

A
  • bisulphite treatment converts unmethylated cytosine to uracil
  • methylated cytosines protected and no converted, so detected
  • can be targeted to CpG islands
70
Q

What is reduced representation bisulphite sequencing?

A
  • targets BS-seq analysis to regions of genome likely to have high CpG content
  • allows us to make most of sequencing run, particularly using lower-yield sequencing platforms
  • exploits REs w/ recognition site containing CpG
71
Q

How is direct methylation sequencing used in PacBio?

A
  • SMRT-seq allows methylated bases to be distinguished, as their presence delays progress of pol
72
Q

How is direct methylation sequencing used in Oxford Nanopore?

A
  • directly detects disruption in electrical current caused by base passing through pore in membrane
  • methylated bases give distinct signal from unmethylated
73
Q

What is ChIP used for?

A
  • isolate DNA bound by specific protein
74
Q

How is ChIP carried out?

A
  • proteins covalently crosslinked to DNA by treating w/ formaldehyde (get protein physically attached to DNA)
  • chromatin sheared by sonication or using endonuclease –> use of exonuclease allows bound DNA to be trimmed to binding site
  • immunoprecipitation and purification of bound DNA using antibody specific to protein of interest
75
Q

What is ChIP-chip (ChIP on chip), and how is it carried out?

A
  • involves identification of ChIP purified DNA using microarray
  • usually tiling microarray. w/ probes designed at regular intervals across region or whole genome
76
Q

What is ChIP-seq, and how is carried out?

A
  • DNA purified from ChIP can be identified using Illumina
  • reads mapped to ref genome and binding sites identifies as peaks in signal
  • 5’ –> 3’ exonuclease used to trim DNA before binding site
  • means offset between reads on forward and reverse strand, allows exact boundaries of binding site to be determines
  • binding site is overlap between forward and reverse peaks
77
Q

What is chromosome conformation capture (3C)?

A
  • 3C similar to ChIP-seq, but cross-links remote regions of DNA instead of DNA and protein
  • allows investigation of long range interactions between diff genomic regions, such as interaction of enhancer elements w/ target gene
78
Q

How did 3C develop further?

A
  • 3C enhanced by self-circularisation (4C) = seq info only req from 1 of interacting loci (need to know seqs of regions of gene interested in)
  • carbon copy chromosome capture (5C) = allows massively parallel analysis of ligation junction through incorp of universal primer seq
  • Hi-C = uses biotin capture of ligation junctions followed by high throughput seq
79
Q

What are the diff types of chromosome conformation capture?

A
  • 3C = 1 to 1
  • 4C = 1 to all
  • 5C = many to many
  • Hi-C = all to all
80
Q

What was the aim of the ENCODE project?

A
  • aimed at identifying all functional elements in human genome
81
Q

What were the findings of ENCODE project?

A
  • ≈80% of human genome has some biochemical function

- controversial as liberal definition of function

82
Q

What are CLIP-seq and RIP -seq similar to?

A
  • essentially same as ChIP, but w/ RNA
83
Q

What are transposon mutant screens used for?

A
  • identify genetic changes assoc w/ particular phenotype

- esp in bacteria, common to exploit transposons to gen random insertion mutations

84
Q

What are custom transposons, and how are they used?

A
  • if transposase gene removed, transposon can still move if transposase supplied, but stable otherwise
  • inc antibiotic resistance gene allows mutants to be selected
  • poss to insert at random into target gene by supplying transposase
  • once transposase removed, mutant strains will harbour stable antibiotic resistance gene at random position w/ genome
  • if gene inactivated by transposon, will be inactivated
  • -> if gene essential, mutant won’t survive
  • -> if make millions of mutants, transposons found in every gene poss to disrupt
  • -> so genes w/o insertions likely to be essential
85
Q

What is transposon directed insertion site sequencing (TraDIS)?

A
  • primers recognise inverted repeat seq and seq outwards into flanking chromosomal DNA
86
Q

What happens when there are insertions at 3’ end?

A
  • in C. dif polA gene
  • -> insertions not found w/in 5’ –> 3’ exonuclease domain but tolerated w/in rest of gene
  • TraDIS can provide info about essential regions w/in gene
87
Q

How does TraDIS work?

A
  • get input pool of random transposon mutants
  • inoculate
  • put through screen to recover = output pool
  • get TraDIS data, showing which genes essential