Bioinformatics methods for analysis of bacterial genomes Flashcards
What is bioinformatics?
It is computational techniques for solving biological problems and includes among other things programming, maths, statistics, biology, machine learning, multiomics, and DNA sequencing
Why is bacterial genomes dynamic?
Due to mobile genetic elements that can be taken up or lost due to horizontal gene transfer
What is de novo assembly?
De novo assembly is like doing a jigsaw puzzle without the picture on the box.
Reads -> contigs -> scaffolds -> chromosome
What to consider about quality of assembly?
1) Size of the assembly: does it match estimates from other means?
2) Size of the contigs/scaffolds: are they reasonably long?
3) Are the expected “core genes” present in the assembly?
4) What fraction of reads map to the assembly?
5) Does the assembly contain sequences of contaminating organisms?
6) Is the assembly consistent with independently derived data?
What tool is used for assembly quality?
QUAST
What are the values considered for assembly quality?
- N50: A measure of the average size of contigs and scaffolds.
- Maximum/median/average contig size after removal of the smallest contigs.
- Number of Ns.
- Total length of all contigs.
- Genome coverage: The number of bases in the reference covered by the assembled contigs.
What is coverage?
The number of reads that support a certain position
What is reference-guided assembly?
It is a slightly different, easier problem analogous to knowing what the puzzle should generally look like.
Output: BAM/SAM file (alignment) or FASTA file (consensus)
What are BAM/SAM files?
It contains reads with mapping information.
SAM = sequence alignment map
BAM = binary SAM
What is BLAST?
A web based tool for sequence similarity. It gives a query cover: how much of the sequence is covered, % identity.
1) The query sequence is broken into “words” that will act as seeds in alignment
2) BLAST searches for matches (or synonyms) in target entries in the database
3) If a target entry has two or more matches to “words” from the query, the alignment is extended in both directions looking for additional similarity
What are some limitations about databases?
1) Databases will have different structure, content, and level of curation
2) Tools only detect what is in the particular database
3) Interpretation requires knowledge of tools and bacteria
4) Annotation software and database used may affect results/outcome
What is multilocus sequence typing (MLST)?
It is used to define groups within a species.
Is useful for surveillance of which types of strains that are present in a population.
General MLST analysis: 7 loci of housekeeping genes
In Galaxy, how is filtering of low quality reads and bases performed?
By adding fastp to the pipeline
Why do we add FastQC in Galaxy?
For filtering reads
What Galaxy tool is used for taxonomy/contamination and reporting of taxonomy/contamination?
Kraken2 and Kraken