Bioinformatics Flashcards
Phred
An algorithm that provides a quality score (Q) to each base for most major sequencing technologies
Chastity filter
The base calling method used by Illumina, which calls a base if the (fluorescent) intensity divided by the sum of the highest and second highest intensity is no less than 0.6
FastQ
The file format used by next-generation sequencing technologies, including both the sequence and the quality score
Contig
A continuous sequence, assembled from sequence reads, where the base order is known
Gap
A region where sequencing reads from two ends of a fragment are present on two different contigs
Scaffold
A genome sequence that’s been reconstructed from contigs and gaps
RefSeq
A database containing NCBI curated non-redundant genomic DNA sequences, transcript RNA, and protein products for major model organisms
Greedy
The simplest algorithm used for genome assembly, functions by continuously merging sequences with the largest overlaps
OLC
Overlap consensus layout; a de novo genome assembly program that functions by finding the best matches between the prefix of one read and the suffix of another
ABySS
De novo genome assembly program that utilized de Bruijn graph assembly
Prodigal
An ab initio genome annotation program for bacterial and archaeal genomes
Genscan
An ab initio eukaryotic genome annotation model which uses a known set of genes to create HMMs for prediction, as well as consideration of many other parameters
Maximal Dependence Decomposition (MDD)
Uses information from large MSAs to model the dependencies of nucleotides at different positions to predict donor and splice sites in algorithms such as Genscan
ChIP-seq
Chromatin immunoprecipitation; used to identify chromosome sequences where proteins are bound, commonly transcription factor binding sites
MeDIP-seq
A technique adjacent to ChIP-seq used to identify methylated DNA sequences
Effective genome size
Used in the calculation of λBG (background noise) in peak-calling software, and accounts for the variability of mappable regions within a genome
GU-AG introns
A common type of pre mRNA intron beginning with GU and ending with AG, commonly accompanied by longer conserved sequences
EST
Expressed sequence tag; a short sub-sequence of a cDNA sequence used to identify gene transcripts
RPKM/FPKM
Reads/fragments per kilobase per million reads; normalize RNA-seq data for gene length and library size for single end and paired end reads respectively
TPM
Transcripts per kilobase million; An increasingly preferred method for RNA-seq transcript count normalization, calculated by dividing the read counts by the length of each gene in kilobases, followed by summing the counts for each gene and dividing by 1M
Cufflinks
A program that assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples
StringTie
A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts, using a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus
Pseudoaligner
RNA-seq mapping programs that avoid fully aligning each read, instead matching k-mers across reads and transcriptomes
False discovery rate (FDR)
FDR = (false positives)/(false positives + true positives)
Fusion gene
Identified by fusion junctions, a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene
Hash table
A dictionary used in indexing programs – data is stored as a collection of key-value pairs, frequently with the k-mer sequence as the key and the sequence position as the value
Seeding
A query sequence is broken up into all possible overlapping 3-letter words (k-mers) and scored for similarity against each other using a matrix (e.g. BLOSUM62), and those scoring above a predetermined threshold are kept for database searching
Suffix tree
A data structure that stores all the suffixes of a string, enabling fast string matching for an initial exact match
Suffix array
A data structure that stores all possible suffixes of a string, indexes them, and then sorts the suffixes by alphabetical order
FM Index
An augmentation of the space-efficient BWT with additional data (a suffix array) that permits very fast exact string matching
E-value
Expect value; indicates the number of false positives you would expect from a given alignment
Batch correction
A technique that accounts for technical errors across samples such as differences in reagents, equipment, and date of library preparation or sequencing
Short read mapping
A technique used in RNA-seq read mapping, consists of aligning millions of short reads (35-400bp) to a single long sequence
DESeq2
A program that uses raw read counts and not FPKM/RPKM to make normalized counts for non-differentially expressed genes similar between samples
Euclidean distance
In two-dimensional space (such as a plane), Euclidean distance is the length of the shortest path connecting two points
Bi-clustering
A type of clustering technique used to cluster both rows and columns simultaneously in a dataset, often applied to data matrices where rows represent one type of entity (e.g., genes) and columns represent another type of entity (e.g., experimental conditions)
Homoskedasticity
A constant variance along the range of mean values
Kallisto
A pseudoaligner that maps known transcripts depending on their location in the genome, which is stored in a transcriptome de Bruijn graph (T-DBG), rather than which sequence they align to
Overrepresentation analysis
Determines whether a list of functional categories are over or underrepresented in a gene list of interest in comparison to a reference list
Functional class scoring/Gene set enrichment analysis
Looks for enrichment of specific gene sets or pathways, but use a ranked gene list as input
Type I Error
False positive; probability = α