Bioinformatics Flashcards
Phred
An algorithm that provides a quality score (Q) to each base for most major sequencing technologies
Chastity filter
The base calling method used by Illumina, which calls a base if the (fluorescent) intensity divided by the sum of the highest and second highest intensity is no less than 0.6
FastQ
The file format used by next-generation sequencing technologies, including both the sequence and the quality score
Contig
A continuous sequence, assembled from sequence reads, where the base order is known
Gap
A region where sequencing reads from two ends of a fragment are present on two different contigs
Scaffold
A genome sequence that’s been reconstructed from contigs and gaps
RefSeq
A database containing NCBI curated non-redundant genomic DNA sequences, transcript RNA, and protein products for major model organisms
Greedy
The simplest algorithm used for genome assembly, functions by continuously merging sequences with the largest overlaps
OLC
Overlap consensus layout; a de novo genome assembly program that functions by finding the best matches between the prefix of one read and the suffix of another
ABySS
De novo genome assembly program that utilized de Bruijn graph assembly
Prodigal
An ab initio genome annotation program for bacterial and archaeal genomes
Genscan
An ab initio eukaryotic genome annotation model which uses a known set of genes to create HMMs for prediction, as well as consideration of many other parameters
Maximal Dependence Decomposition (MDD)
Uses information from large MSAs to model the dependencies of nucleotides at different positions to predict donor and splice sites in algorithms such as Genscan
ChIP-seq
Chromatin immunoprecipitation; used to identify chromosome sequences where proteins are bound, commonly transcription factor binding sites
MeDIP-seq
A technique adjacent to ChIP-seq used to identify methylated DNA sequences
Effective genome size
Used in the calculation of λBG (background noise) in peak-calling software, and accounts for the variability of mappable regions within a genome
GU-AG introns
A common type of pre mRNA intron beginning with GU and ending with AG, commonly accompanied by longer conserved sequences
EST
Expressed sequence tag; a short sub-sequence of a cDNA sequence used to identify gene transcripts
RPKM/FPKM
Reads/fragments per kilobase per million reads; normalize RNA-seq data for gene length and library size for single end and paired end reads respectively
TPM
Transcripts per kilobase million; An increasingly preferred method for RNA-seq transcript count normalization, calculated by dividing the read counts by the length of each gene in kilobases, followed by summing the counts for each gene and dividing by 1M
Cufflinks
A program that assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples
StringTie
A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts, using a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus
Pseudoaligner
RNA-seq mapping programs that avoid fully aligning each read, instead matching k-mers across reads and transcriptomes
False discovery rate (FDR)
FDR = (false positives)/(false positives + true positives)
Fusion gene
Identified by fusion junctions, a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene
Hash table
A dictionary used in indexing programs – data is stored as a collection of key-value pairs, frequently with the k-mer sequence as the key and the sequence position as the value
Seeding
A query sequence is broken up into all possible overlapping 3-letter words (k-mers) and scored for similarity against each other using a matrix (e.g. BLOSUM62), and those scoring above a predetermined threshold are kept for database searching
Suffix tree
A data structure that stores all the suffixes of a string, enabling fast string matching for an initial exact match
Suffix array
A data structure that stores all possible suffixes of a string, indexes them, and then sorts the suffixes by alphabetical order
FM Index
An augmentation of the space-efficient BWT with additional data (a suffix array) that permits very fast exact string matching
E-value
Expect value; indicates the number of false positives you would expect from a given alignment
Batch correction
A technique that accounts for technical errors across samples such as differences in reagents, equipment, and date of library preparation or sequencing
Short read mapping
A technique used in RNA-seq read mapping, consists of aligning millions of short reads (35-400bp) to a single long sequence
DESeq2
A program that uses raw read counts and not FPKM/RPKM to make normalized counts for non-differentially expressed genes similar between samples
Euclidean distance
In two-dimensional space (such as a plane), Euclidean distance is the length of the shortest path connecting two points
Bi-clustering
A type of clustering technique used to cluster both rows and columns simultaneously in a dataset, often applied to data matrices where rows represent one type of entity (e.g., genes) and columns represent another type of entity (e.g., experimental conditions)
Homoskedasticity
A constant variance along the range of mean values
Kallisto
A pseudoaligner that maps known transcripts depending on their location in the genome, which is stored in a transcriptome de Bruijn graph (T-DBG), rather than which sequence they align to
Overrepresentation analysis
Determines whether a list of functional categories are over or underrepresented in a gene list of interest in comparison to a reference list
Functional class scoring/Gene set enrichment analysis
Looks for enrichment of specific gene sets or pathways, but use a ranked gene list as input
Type I Error
False positive; probability = α
Type II Error
False negative; probability = β
ClinVar
An NCBI-hosted database that aggregates information about genomic variation and its relation to human health
Canonical transcript
The transcript that is, on average, the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt
Online Mendelian Inheritance in Man (OMIM)
An online database containing information on all known human genes and mendelian disorders
Humsavar
Human polymorphisms and disease mutations index; A manually curated text file listing all human missense variants classified as pathogenic, likely pathogenic, benign, likely benign, and uncertain significance
Exome Aggregation Consortium (ExAC) database
A database that collected and reanalyzed the exome sequences of >60,000 individuals from different populations with adult-onset diseases that were sequenced as part of disease-specific and population genetic studies
Genome Aggregation Database (gnomAD)
A database that aggregates exome and genome sequence data from several large-scale sequencing projects
Ensembl
An additional database documenting the transcript variations of specific genes and the genomes of mainly vertebrate species
Database of Short Genetic Variations (dbSNP)
An archive of all short sequence variations for a wide-range of organisms hosted by NCBI, includes SNPs, INDELs, and multi-base INDELs
Catalogue of Somatic Mutations in Cancer (COSMIC)
A database that combines genome-wide sequencing results from >28,000 tumors, including details like tissue and variation type distribution
Polymorphism Phenotyping v2 (PolyPhen2)
A variant effect prediction algorithm that calculates the probability of a variant being damaging using both 3D structural features (surface accesibility, hydrophobicity, etc.) and sequence-based analysis
Missense3D
An Imperial College hosted variant effect prediction algorithm that uses 3D structural coordinates to perform an in-depth atom-based study of the effect of a missense variant and therefore are able to provide the user with information on the mechanism by which the variant may disrupt protein folding/function
Sorting Tolerant from Intolerant (SIFT) algorithm
A widespread missense variance-prediction algorithm based on MSA construction
Rare Exome Variant Ensemble Learner (REVEL)
An ensemble method for prediction the effect of an amino acid substitution by combining many other prediction algorithms, including PolyPhen and SIFT, and performs better than individual predictors
Variant Effect Predictor (VEP)
An Ensembl-hosted platform that runs different prediction algorithms, including PolyPhen and SIFT, for the user simultaneously
Single Amino Acid Polymorphism (SAAP)
A data analysis pipeline and predictor that performs in depth analysis of the structural effect of an amino acid substitution on a protein structure using residue conservation and an experimental 3D structures
Pseudoexon
The product of mutations in regions important for alternative splicing due to the creation of de-novo splice sites or strengthening of existing weak splice sites, resulting in transcripts subject to premature degradation or production of a modified protein
Pharmacogenetics
The study of how genetic factors affect the interindividual variability to drug response
Aspartic acid, glutamic acid
Negatively charged polar amino acids
Arginine, lysine, histidine
Positively charge polar amino acids
Asparagine, glutamine, serine, threonine, tyrosine
Uncharged polar amino acids
Alanine, glycine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan, cysteine
Nonpolar amino acids
Dihedral angles
The amino acid main chain torsion angles, important for determining secondary and tertiary structure
Protein fold classes
α/α: packing of alpha-helices
β/β: one or more beta-sheets
α/β: roughly alternate alpha-helices and beta-sheets, beta-sheets are commonly parallel
α+β: mixed alpha-helices and beta-sheets
RMSD
Root mean square deviation; a metric used to quantify the similarity between protein structures by superimposing them in the orientation which minimizes the value and calculating the average distance separation between equivalent positions
Enzyme commission (EC) classification
E.C.1(class).2(Functional definition).3(Functional definition).4(substrate specificity)
PDB
Protein data bank; an EBI-hosted database of about ~120,000 non-redundant protein structures, generally of high quality
SCOPe
An automated version of SCOP which aimed to organize PDB data by manual structure comparison and uses a hierarchical classification: class, fold, superfamily, family, protein domain and species
CATH
A protein structure database which organizes proteins by partially automatic structural alignment and uses a hierarchical classification Class, architecture, topology, homologous superfamily, sequence family
SWISSPROT
A high quality source of annotation for a selection of protein sequences
UniProtKB
UniProt Knowledge Base; a European-based protein database with 250M sequences and ~600,000 high quality SWISSPROT annotations
TrEMBL
EBI’s protein sequence database
MGnify
EBI’s metagenomics database containing 350,000 amplicons from 33,000 metagenomes organized by biomes such as human, digestive system, aquatic, soil, skin, wastewater, etc.
Conservative substitutions
An amino acid substitution that maintains the chemical properties of the original amino acids
PAM
Point accepted mutation; an amino acid sequence alignment scoring scheme developed in the 1970s and is based on counting the number of times residue types change in closely homologous sequences
BLOSUM62
Blocks substitution matrix; an amino acid sequence alignment scoring scheme developed in the 1990s and is derived from conserved protein motifs which effectively filters out noise
Needleman-Wunsch Algorithm
A general sequence comparison algorithm that maximizes similarity scores to find the best global alignment of any two sequences
Smith-Waterman Algorithm
A sequence comparison algorithm that compares segments of all possible lengths (i.e. local alignments) and chooses whichever maximized the similarity measure
P-value (p)
The probability of achieving the returned score or a better score by chance; the probability of obtaining a value at least as extreme as the observed result assuming the null hypothesis is correct
CLUSTAL
A multiple sequence alignment program that builds MSAs using guide trees
PROSITE
A database of protein sequence patterns identified using multiple sequence alignments closely linked with SWISSPROT
Hidden Markov Models (HMMs)
The optimal method for representing protein families by MSA by scoring for residue similarity and position, allowing for the detection of distant family relationships
PFAM
A database of protein domain family HMM
InterPro
An expansive database that consolidates information from a wide range of protein databases, such as PROSITE, PFAM, PANTHER, ProDOM (homologous domains), UNIPROT, etc., for the purpose of unifying research approaches and protein terminology
PSIBLAST
An algorithm that builds a MSA and a PSSM with the query sequence and uses this to further search the database to amplify conserved regions and identify conserved functional sites through iteration
TM scores
Template modeling scores; very similar to RMSD, however they removed the requirement for arbitrary decisions such as the maximum distance between equivalent residues
CASP
Critical assessment of protein structure prediction; a blind trial to evaluate different protein structure prediction methods that occurs every two years; sequences for testing are sent to predictors prior to revealing the correct experimental structures
Energy minimization
The process of finding the conformation of a protein that corresponds to the lowest possible energy state according to a specified energy function or potential
Molecular dynamics
A simulation that numerically solves Newton’s equations of motion to simulate the movement and interactions of atoms in a protein and its surrounding environment
Phyre2
An online secondary structure prediction model that uses a template library of ~200,000 known 3D structures and HMMs for the known structures
Loop modeling
The computational process of resolving INDELs in new structures, involving subdividing the loop into 2 segments and then repeatedly dividing and transforming each segment until the loop is small enough to be solved
pLDDT
Predicted local difference distance test; the per residue confidence metric used by AlphaFold
PAE
Predicted alignment error; a metric used by AlphaFold to determine how well predicted the distance between two residues is and assess the confidence of domain packing
ClusPro
A powerful ab initio protein docking server, even still high quality results are still very difficult to obtain
AF2Complex
A neural network model derived from AlphaFold that predicts structures of multimeric protein complexes without the need for paired MSA
AlphaFold Multimer
An extension of AlphaFold2 that has been specifically built to predict protein-protein complexes, slightly outdated and has steep memory requirements
GO
Gene ontology; a universal gene/gene product annotation system that details what the product does, why it performs its activity, and where it acts
STRING
A database that tabulates protein interactions for the purpose of inferring function from the protein’s interactome
NetGo
A protein interaction prediction model that incorporates a variety of different approaches, including a protein language model
DeepTMHMM
The current state of the art prediction method for identifying transmembrane structures and signal peptides consisting of deep learning approaches
Coiled-coils
Two or three intertwined alpha-helices that manifest as super helical twists with slight distortion as a result of hydrophobic residue packing
Orthologs
Homologous proteins that come from different species, are much more likely to preserve function and are likely to have the same EC classification
Paralogs
Homologous proteins in the same species resulting from gene duplication and are more free to mutate as they have redundant functions