Bioinformatics Flashcards
Phred
An algorithm that provides a quality score (Q) to each base for most major sequencing technologies
Chastity filter
The base calling method used by Illumina, which calls a base if the (fluorescent) intensity divided by the sum of the highest and second highest intensity is no less than 0.6
FastQ
The file format used by next-generation sequencing technologies, including both the sequence and the quality score
Contig
A continuous sequence, assembled from sequence reads, where the base order is known
Gap
A region where sequencing reads from two ends of a fragment are present on two different contigs
Scaffold
A genome sequence that’s been reconstructed from contigs and gaps
RefSeq
A database containing NCBI curated non-redundant genomic DNA sequences, transcript RNA, and protein products for major model organisms
Greedy
The simplest algorithm used for genome assembly, functions by continuously merging sequences with the largest overlaps
OLC
Overlap consensus layout; a de novo genome assembly program that functions by finding the best matches between the prefix of one read and the suffix of another
ABySS
De novo genome assembly program that utilized de Bruijn graph assembly
Prodigal
An ab initio genome annotation program for bacterial and archaeal genomes
Genscan
An ab initio eukaryotic genome annotation model which uses a known set of genes to create HMMs for prediction, as well as consideration of many other parameters
Maximal Dependence Decomposition (MDD)
Uses information from large MSAs to model the dependencies of nucleotides at different positions to predict donor and splice sites in algorithms such as Genscan
ChIP-seq
Chromatin immunoprecipitation; used to identify chromosome sequences where proteins are bound, commonly transcription factor binding sites
MeDIP-seq
A technique adjacent to ChIP-seq used to identify methylated DNA sequences
Effective genome size
Used in the calculation of λBG (background noise) in peak-calling software, and accounts for the variability of mappable regions within a genome
GU-AG introns
A common type of pre mRNA intron beginning with GU and ending with AG, commonly accompanied by longer conserved sequences
EST
Expressed sequence tag; a short sub-sequence of a cDNA sequence used to identify gene transcripts
RPKM/FPKM
Reads/fragments per kilobase per million reads; normalize RNA-seq data for gene length and library size for single end and paired end reads respectively
TPM
Transcripts per kilobase million; An increasingly preferred method for RNA-seq transcript count normalization, calculated by dividing the read counts by the length of each gene in kilobases, followed by summing the counts for each gene and dividing by 1M
Cufflinks
A program that assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples
StringTie
A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts, using a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus
Pseudoaligner
RNA-seq mapping programs that avoid fully aligning each read, instead matching k-mers across reads and transcriptomes
False discovery rate (FDR)
FDR = (false positives)/(false positives + true positives)