Bioinformatics Flashcards

1
Q

Phred

A

An algorithm that provides a quality score (Q) to each base for most major sequencing technologies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Chastity filter

A

The base calling method used by Illumina, which calls a base if the (fluorescent) intensity divided by the sum of the highest and second highest intensity is no less than 0.6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

FastQ

A

The file format used by next-generation sequencing technologies, including both the sequence and the quality score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Contig

A

A continuous sequence, assembled from sequence reads, where the base order is known

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Gap

A

A region where sequencing reads from two ends of a fragment are present on two different contigs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Scaffold

A

A genome sequence that’s been reconstructed from contigs and gaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

RefSeq

A

A database containing NCBI curated non-redundant genomic DNA sequences, transcript RNA, and protein products for major model organisms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Greedy

A

The simplest algorithm used for genome assembly, functions by continuously merging sequences with the largest overlaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

OLC

A

Overlap consensus layout; a de novo genome assembly program that functions by finding the best matches between the prefix of one read and the suffix of another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ABySS

A

De novo genome assembly program that utilized de Bruijn graph assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prodigal

A

An ab initio genome annotation program for bacterial and archaeal genomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Genscan

A

An ab initio eukaryotic genome annotation model which uses a known set of genes to create HMMs for prediction, as well as consideration of many other parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Maximal Dependence Decomposition (MDD)

A

Uses information from large MSAs to model the dependencies of nucleotides at different positions to predict donor and splice sites in algorithms such as Genscan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ChIP-seq

A

Chromatin immunoprecipitation; used to identify chromosome sequences where proteins are bound, commonly transcription factor binding sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

MeDIP-seq

A

A technique adjacent to ChIP-seq used to identify methylated DNA sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Effective genome size

A

Used in the calculation of λBG (background noise) in peak-calling software, and accounts for the variability of mappable regions within a genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

GU-AG introns

A

A common type of pre mRNA intron beginning with GU and ending with AG, commonly accompanied by longer conserved sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

EST

A

Expressed sequence tag; a short sub-sequence of a cDNA sequence used to identify gene transcripts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

RPKM/FPKM

A

Reads/fragments per kilobase per million reads; normalize RNA-seq data for gene length and library size for single end and paired end reads respectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

TPM

A

Transcripts per kilobase million; An increasingly preferred method for RNA-seq transcript count normalization, calculated by dividing the read counts by the length of each gene in kilobases, followed by summing the counts for each gene and dividing by 1M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Cufflinks

A

A program that assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

StringTie

A

A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts, using a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Pseudoaligner

A

RNA-seq mapping programs that avoid fully aligning each read, instead matching k-mers across reads and transcriptomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

False discovery rate (FDR)

A

FDR = (false positives)/(false positives + true positives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Fusion gene
Identified by fusion junctions, a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene
26
Hash table
A dictionary used in indexing programs – data is stored as a collection of key-value pairs, frequently with the k-mer sequence as the key and the sequence position as the value
27
Seeding
A query sequence is broken up into all possible overlapping 3-letter words (k-mers) and scored for similarity against each other using a matrix (e.g. BLOSUM62), and those scoring above a predetermined threshold are kept for database searching
28
Suffix tree
A data structure that stores all the suffixes of a string, enabling fast string matching for an initial exact match
29
Suffix array
A data structure that stores all possible suffixes of a string, indexes them, and then sorts the suffixes by alphabetical order
30
FM Index
An augmentation of the space-efficient BWT with additional data (a suffix array) that permits very fast exact string matching
31
E-value
Expect value; indicates the number of false positives you would expect from a given alignment
32
Batch correction
A technique that accounts for technical errors across samples such as differences in reagents, equipment, and date of library preparation or sequencing
33
Short read mapping
A technique used in RNA-seq read mapping, consists of aligning millions of short reads (35-400bp) to a single long sequence
34
DESeq2
A program that uses raw read counts and not FPKM/RPKM to make normalized counts for non-differentially expressed genes similar between samples
35
Euclidean distance
In two-dimensional space (such as a plane), Euclidean distance is the length of the shortest path connecting two points
36
Bi-clustering
A type of clustering technique used to cluster both rows and columns simultaneously in a dataset, often applied to data matrices where rows represent one type of entity (e.g., genes) and columns represent another type of entity (e.g., experimental conditions)
37
Homoskedasticity
A constant variance along the range of mean values
38
Kallisto
A pseudoaligner that maps known transcripts depending on their location in the genome, which is stored in a transcriptome de Bruijn graph (T-DBG), rather than which sequence they align to
39
Overrepresentation analysis
Determines whether a list of functional categories are over or underrepresented in a gene list of interest in comparison to a reference list
40
Functional class scoring/Gene set enrichment analysis
Looks for enrichment of specific gene sets or pathways, but use a ranked gene list as input
41
Type I Error
False positive; probability = α
42
Type II Error
False negative; probability = β
43
ClinVar
An NCBI-hosted database that aggregates information about genomic variation and its relation to human health
44
Canonical transcript
The transcript that is, on average, the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt
45
Online Mendelian Inheritance in Man (OMIM)
An online database containing information on all known human genes and mendelian disorders
46
Humsavar
Human polymorphisms and disease mutations index; A manually curated text file listing all human missense variants classified as pathogenic, likely pathogenic, benign, likely benign, and uncertain significance
47
Exome Aggregation Consortium (ExAC) database
A database that collected and reanalyzed the exome sequences of >60,000 individuals from different populations with adult-onset diseases that were sequenced as part of disease-specific and population genetic studies
48
Genome Aggregation Database (gnomAD)
A database that aggregates exome and genome sequence data from several large-scale sequencing projects
49
Ensembl
An additional database documenting the transcript variations of specific genes and the genomes of mainly vertebrate species
50
Database of Short Genetic Variations (dbSNP)
An archive of all short sequence variations for a wide-range of organisms hosted by NCBI, includes SNPs, INDELs, and multi-base INDELs
51
Catalogue of Somatic Mutations in Cancer (COSMIC)
A database that combines genome-wide sequencing results from >28,000 tumors, including details like tissue and variation type distribution
52
Polymorphism Phenotyping v2 (PolyPhen2)
A variant effect prediction algorithm that calculates the probability of a variant being damaging using both 3D structural features (surface accesibility, hydrophobicity, etc.) and sequence-based analysis
53
Missense3D
An Imperial College hosted variant effect prediction algorithm that uses 3D structural coordinates to perform an in-depth atom-based study of the effect of a missense variant and therefore are able to provide the user with information on the mechanism by which the variant may disrupt protein folding/function
54
Sorting Tolerant from Intolerant (SIFT) algorithm
A widespread missense variance-prediction algorithm based on MSA construction
55
Rare Exome Variant Ensemble Learner (REVEL)
An ensemble method for prediction the effect of an amino acid substitution by combining many other prediction algorithms, including PolyPhen and SIFT, and performs better than individual predictors
56
Variant Effect Predictor (VEP)
An Ensembl-hosted platform that runs different prediction algorithms, including PolyPhen and SIFT, for the user simultaneously
57
Single Amino Acid Polymorphism (SAAP)
A data analysis pipeline and predictor that performs in depth analysis of the structural effect of an amino acid substitution on a protein structure using residue conservation and an experimental 3D structures
58
Pseudoexon
The product of mutations in regions important for alternative splicing due to the creation of de-novo splice sites or strengthening of existing weak splice sites, resulting in transcripts subject to premature degradation or production of a modified protein
59
Pharmacogenetics
The study of how genetic factors affect the interindividual variability to drug response
60
Aspartic acid, glutamic acid
Negatively charged polar amino acids
61
Arginine, lysine, histidine
Positively charge polar amino acids
62
Asparagine, glutamine, serine, threonine, tyrosine
Uncharged polar amino acids
63
Alanine, glycine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan, cysteine
Nonpolar amino acids
64
Dihedral angles
The amino acid main chain torsion angles, important for determining secondary and tertiary structure
65
Protein fold classes
α/α: packing of alpha-helices β/β: one or more beta-sheets α/β: roughly alternate alpha-helices and beta-sheets, beta-sheets are commonly parallel α+β: mixed alpha-helices and beta-sheets
66
RMSD
Root mean square deviation; a metric used to quantify the similarity between protein structures by superimposing them in the orientation which minimizes the value and calculating the average distance separation between equivalent positions
67
Enzyme commission (EC) classification
E.C.1(class).2(Functional definition).3(Functional definition).4(substrate specificity)
68
PDB
Protein data bank; an EBI-hosted database of about ~120,000 non-redundant protein structures, generally of high quality
69
SCOPe
An automated version of SCOP which aimed to organize PDB data by manual structure comparison and uses a hierarchical classification: class, fold, superfamily, family, protein domain and species
70
CATH
A protein structure database which organizes proteins by partially automatic structural alignment and uses a hierarchical classification Class, architecture, topology, homologous superfamily, sequence family
71
SWISSPROT
A high quality source of annotation for a selection of protein sequences
72
UniProtKB
UniProt Knowledge Base; a European-based protein database with 250M sequences and ~600,000 high quality SWISSPROT annotations
73
TrEMBL
EBI's protein sequence database
74
MGnify
EBI's metagenomics database containing 350,000 amplicons from 33,000 metagenomes organized by biomes such as human, digestive system, aquatic, soil, skin, wastewater, etc.
75
Conservative substitutions
An amino acid substitution that maintains the chemical properties of the original amino acids
76
PAM
Point accepted mutation; an amino acid sequence alignment scoring scheme developed in the 1970s and is based on counting the number of times residue types change in closely homologous sequences
77
BLOSUM62
Blocks substitution matrix; an amino acid sequence alignment scoring scheme developed in the 1990s and is derived from conserved protein motifs which effectively filters out noise
78
Needleman-Wunsch Algorithm
A general sequence comparison algorithm that maximizes similarity scores to find the best global alignment of any two sequences
79
Smith-Waterman Algorithm
A sequence comparison algorithm that compares segments of all possible lengths (i.e. local alignments) and chooses whichever maximized the similarity measure
80
P-value (p)
The probability of achieving the returned score or a better score by chance; the probability of obtaining a value at least as extreme as the observed result assuming the null hypothesis is correct
81
CLUSTAL
A multiple sequence alignment program that builds MSAs using guide trees
82
PROSITE
A database of protein sequence patterns identified using multiple sequence alignments closely linked with SWISSPROT
83
Hidden Markov Models (HMMs)
The optimal method for representing protein families by MSA by scoring for residue similarity and position, allowing for the detection of distant family relationships
84
PFAM
A database of protein domain family HMM
85
InterPro
An expansive database that consolidates information from a wide range of protein databases, such as PROSITE, PFAM, PANTHER, ProDOM (homologous domains), UNIPROT, etc., for the purpose of unifying research approaches and protein terminology
86
PSIBLAST
An algorithm that builds a MSA and a PSSM with the query sequence and uses this to further search the database to amplify conserved regions and identify conserved functional sites through iteration
87
TM scores
Template modeling scores; very similar to RMSD, however they removed the requirement for arbitrary decisions such as the maximum distance between equivalent residues
88
CASP
Critical assessment of protein structure prediction; a blind trial to evaluate different protein structure prediction methods that occurs every two years; sequences for testing are sent to predictors prior to revealing the correct experimental structures
89
Energy minimization
The process of finding the conformation of a protein that corresponds to the lowest possible energy state according to a specified energy function or potential
90
Molecular dynamics
A simulation that numerically solves Newton's equations of motion to simulate the movement and interactions of atoms in a protein and its surrounding environment
91
Phyre2
An online secondary structure prediction model that uses a template library of ~200,000 known 3D structures and HMMs for the known structures
92
Loop modeling
The computational process of resolving INDELs in new structures, involving subdividing the loop into 2 segments and then repeatedly dividing and transforming each segment until the loop is small enough to be solved
93
pLDDT
Predicted local difference distance test; the per residue confidence metric used by AlphaFold
94
PAE
Predicted alignment error; a metric used by AlphaFold to determine how well predicted the distance between two residues is and assess the confidence of domain packing
95
ClusPro
A powerful ab initio protein docking server, even still high quality results are still very difficult to obtain
96
AF2Complex
A neural network model derived from AlphaFold that predicts structures of multimeric protein complexes without the need for paired MSA
97
AlphaFold Multimer
An extension of AlphaFold2 that has been specifically built to predict protein-protein complexes, slightly outdated and has steep memory requirements
98
GO
Gene ontology; a universal gene/gene product annotation system that details what the product does, why it performs its activity, and where it acts
99
STRING
A database that tabulates protein interactions for the purpose of inferring function from the protein's interactome
100
NetGo
A protein interaction prediction model that incorporates a variety of different approaches, including a protein language model
101
DeepTMHMM
The current state of the art prediction method for identifying transmembrane structures and signal peptides consisting of deep learning approaches
102
Coiled-coils
Two or three intertwined alpha-helices that manifest as super helical twists with slight distortion as a result of hydrophobic residue packing
103
Orthologs
Homologous proteins that come from different species, are much more likely to preserve function and are likely to have the same EC classification
104
Paralogs
Homologous proteins in the same species resulting from gene duplication and are more free to mutate as they have redundant functions