Chaudhuri Flashcards
What is bioinformatics?
- science of collecting and analysing complex biological data, eg. genetic codes
What does bioinformatics exist at the interface of?
- computing, biology and maths
Why are bioinformatics skills so in demand?
- seq data accum faster than ability to analyse it and even to store it
- transferable an necessary
How quickly has cost of sequencing decreased?
- quicker than Moore’s Law (= computational power ≈ x2 every 18 months
What caused a large decrease in seq cost in 2008?
- Illumina
How is Illumina paired-end sequencing carried out? (overview)
- in library prep, fragments of ≈500bp selected
- bridge amplification results in clusters, each w/ many copies of both strands of fragment
- sequencing reads gen separately using primers complementary to both adaptors
- expect those read pairs to map 500bp apart on opp strands
- 3rd primer used to seq index barcode present in 1 of adaptors to enable identification of sample
What does homologous mean?
- same reaction, relative position or structure
If 2 seqs have 12/16 of the same bases what can you say about identity and homologous?
- 75% identity
- NOT 75% homologous –> either homologous or not
What does aligning 2 seqs tell you?
- how many changes would be req to get those seqs, under assumption that aligned positions share common origin
What does introducing gaps when aligning seqs allow?
- max no. matches
- represents insertions/deletions = indents (a imposs to distinguish)
What is seq alignment used for in bioinformatics?
- identify homologous seqs w/ common ancestor
- assess how similar homologous seqs are to infer evo relationships between groups of seqs
- assemble short reads into contiguous seqs and ultimately seq entire chromosomes/genomes
- map seq reads to reference genome
What is the more likely option when deciding how to align seqs?
- one explained by less evolutionary events
How is seq alignment decided?
- scoring system
- matches assigned +ve score
- mismatches/gaps assigned -ve score
- sometimes 1 penalty for opening new gap and 2nd lower penalty for extending growth (as bigger gaps favoured over several small gaps)
What are scoring matrices and why are they used?
- for nt alignment mismatches usually all treated the same
- for AAs, scoring matrix used so biochemically conservative AA subs penalised less than subs likely to affect protein structure
- eg. BLOSUM62, PAM70
- constructed empirically by examining freq of each AA sub across large collection of protein alignments
- eg. Ile for Leu is match
- eg. Trp v unique and doesn’t like to be sub
What is global alignment, and when is it suitable?
- attempt made to align seqs across entire length
- assumes seqs equivalent
- not suitable for aligning full length seq w/ partial seq
- 1st global alignment algorithm proposed by Needleman and Wunsch
What is local alignment?
- searches subsequences of full length seq to max alignment score
- 1st algorithm by Smith and Waterman
How does BLAST work?
- widely used method of searching database to rapidly identify seqs similar to query seqs
- user supplies query seq and BLAST searches for similar seqs
- performs local alignment to identify regions of hit that match query seq
What is the output of BLAST?
- E value = P value normalised to database size and length of query seq
- effectively no. hits expected to be found by chance in this database
What are the difficulties w/ de novo assembly?
- unknown target
- coverage bias
- sequencing errors
- repeats
- multiple replicons
- contamination
- circular genomes
Why is genome assembly difficult w/ short reads?
- resolving repeats esp hard –> paired end reads can help, but only for repeats smaller than insert size
What is overlap layout consensus seq, and when would it work?
- looks for overlaps between adj reads
- would work well if genomes non repetitive and seq error free
- repeats can result in mis-assembly errors
What are de Bruijn graphs?
- common approach to assembling short reads, to take account of seq errors and repeats in genome
- break read up into Kmers
- K = no. of bases, usually 51/99 (usually odd no.)
What are the advs of de Bruijn graphs?
- stops assembly errors as allow repeats to be identified
- each K-mer in seq once and expect at least 30x coverage for each Kmer and even more for repeat seq
- Kmers only need to be stored in memory once so less RAM needed
- removal of rare Kmers corrects for seq errors
How can bubbles be resolved using read pairs in de Bruijn graphs?
- read pairs can provide info which spans repeat seqs, helping resolve order of contigs and close the assembly
- resolving 1 of key functions of genome assembly software
- if can’t be resolved, results in break in assembly
- as reads get longer, graph gets simpler
How many reading frames are there for DNA?
- 6
- triplet genetic code, so 3 distinct ways DNA strand can encode a protein and 3 more in reverse direction on complementary strand
How can you identify gene by looking at ORFs?
- longest ORF likely to be gene
The presence of what can identify large genes in bacteria/archaea?
- long ORFs
- AUG start codon
- Shine-Dalgarno site
- Pribnow box
- characteristic base composition due to biases in codon usage
What are strategies for finding genes in euks?
- introns so harder
- algorithms exist but are hit and miss
- look for
- -> Kozak seq
- -> euk terminator consensus
- -> polyA adenylation signal
- -> splice donor and acceptor sites can indicate inton presence
- RNA seq data can reveal which regions present in mature RNAs so assist w/ identification of genes and introns
How can genomes be annotated?
- used to be manual
- pipelines, eg. Prokka and MAKER, provide automated annotation
- apply no. programs to predict positions of protein coding genes, tRNA and rRNA genes
- also use BLAST to identify homologues from which functional annotation can be transferred
Why is resequencing useful?
- simpler to reseq from species already seq than seq new genome
- investigate genomic variation w/in pop of species
- understanding single gene, complex disorders and cancer
- identifying variants for diagnosis –> may allow personalised medicine or genome editing based cures in future
- relied on by functional genomic techs, like RNA seq and ChIP seq
How are reads mapped to a reference genome?
- each read compared w/ sorted index derived from reference genome seq to identify short identical matches (“seed seqs”)
- don’t use whole read as seed seq, as looking for differences, only small chunks will match
- alignment from all seed matches extended to inc rets of read
- alignment scoring system used to identify best mapping position, accounting for no. matches, mismatches and base qualities (mismatches at low quality bases penalised less than high quality mismatches)
- each mapped read given mapping score, indicating confidence that read is derived from that position in genome (uniquely mapped reads have high score and ambiguously mapped have low score)
What are most common read mapping softwares?
- BWA and Bowtie2
How are mapped reads stored?
- usually BAM file
- contains details of which position on which chromosome mapped to and how good alignment is
What is the depth of coverage?
- no. reads which overlap particular position