20.02.16 NGS - Bioinformatics Flashcards
What is a bioinformatic pipeline generally split into
- Quality control
- Sequence alignment
- Variant Calling
- Annotation
What is a BCL file
- Base calls per cycle.
- Binary file containing the base call and quality for each tile in each cycle
- Raw file produced by Illumina platforms (except MiSeq).
- Must be converted to FASTQs for bioinformatic analysis
What are FASTQ files
- Text based format for storing both a nucleotide sequence and it’s corresponding quality scores.
- Generally the input file for most BI pipelines
What are BAM files
- Binary Alignment File.
- Binary format for storing sequence data.
- Formed when a set of FASTQs have been aligned to a reference genome using an alignment algorithm.
- Can be used during analysis to visualise variants or to check quality/coverage
What are CRAM files
-A very compressed version of a BAM. Will become more common as we try to reduce amount of data stored long-term
What is a VCF file
- Variant Call Format file is used to show genetic variations in a sample (only the differences between reference genome)
- Produced when a variant calling program is applied to a BAM file.
- Contains chromosome position, ref and alternate alleles, quality scores and if filtered.
What is an AnnotatedVCF
A VCF that has had additional information added.
Purpose of QC stage of BI pipeline
- To check inherent quality of the data. Allows user to check if there are any major problems with dataset.
- Essential to improve alignment and downstream analysis.
Give some examples of the quality metrics checked in QC stage
Per base sequence quality, per sequence quality scores, per base n content, sequence duplication levels, overrepresented sequences, kmer content.
Give an example of tools used in QC step
- FastQC= checks FASTQ files for various QC metrics. Can run several files in parallel, reducing analysis time.
- Others include= PIQA, ShortRead, but don’t allow parallelization thus less suited for high-throughput testing
What happens in Alignment stage
-Short reads produced by sequencing methods are mapped to the corresponding position within the reference genome.
What do alignment algorithms need to do
- Assign each read a unique location and find corresponding location within reference genome.
- Must be able to tolerate some deviation from reference to allow for variation between the genomes.
Give examples of alignment algorithms
- BLAST (Basic local alignment search tool), used historically. Heuristic approach that is inefficient for larger NGS data sets.
- BWT (Burrows Wheeler Transform). Uses filtering and indexing to incease efficiency, reducing time and memory for alignment.
- MAQ or Stampy, Hash-based algorithm. Build hash tables to index sample data before scanning the reference data.
What does BWT do to align reads
-Filters first to exclude parts of the reference genome where there are no matches (k-mer index or pigeonhole principle).
Why is it important to select alignment algorithm based on sequencing platform
- Alignment algorithm is most accurate when error model employed is chosen according to seq technology.
- e.g. Illumina platform is sensitive to mismatches than indels.