Bioinformatics Flashcards

1
Q

Phred

A

An algorithm that provides a quality score (Q) to each base for most major sequencing technologies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Chastity filter

A

The base calling method used by Illumina, which calls a base if the (fluorescent) intensity divided by the sum of the highest and second highest intensity is no less than 0.6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

FastQ

A

The file format used by next-generation sequencing technologies, including both the sequence and the quality score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Contig

A

A continuous sequence, assembled from sequence reads, where the base order is known

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Gap

A

A region where sequencing reads from two ends of a fragment are present on two different contigs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Scaffold

A

A genome sequence that’s been reconstructed from contigs and gaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

RefSeq

A

A database containing NCBI curated non-redundant genomic DNA sequences, transcript RNA, and protein products for major model organisms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Greedy

A

The simplest algorithm used for genome assembly, functions by continuously merging sequences with the largest overlaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

OLC

A

Overlap consensus layout; a de novo genome assembly program that functions by finding the best matches between the prefix of one read and the suffix of another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ABySS

A

De novo genome assembly program that utilized de Bruijn graph assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prodigal

A

An ab initio genome annotation program for bacterial and archaeal genomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Genscan

A

An ab initio eukaryotic genome annotation model which uses a known set of genes to create HMMs for prediction, as well as consideration of many other parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Maximal Dependence Decomposition (MDD)

A

Uses information from large MSAs to model the dependencies of nucleotides at different positions to predict donor and splice sites in algorithms such as Genscan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ChIP-seq

A

Chromatin immunoprecipitation; used to identify chromosome sequences where proteins are bound, commonly transcription factor binding sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

MeDIP-seq

A

A technique adjacent to ChIP-seq used to identify methylated DNA sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Effective genome size

A

Used in the calculation of λBG (background noise) in peak-calling software, and accounts for the variability of mappable regions within a genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

GU-AG introns

A

A common type of pre mRNA intron beginning with GU and ending with AG, commonly accompanied by longer conserved sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

EST

A

Expressed sequence tag; a short sub-sequence of a cDNA sequence used to identify gene transcripts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

RPKM/FPKM

A

Reads/fragments per kilobase per million reads; normalize RNA-seq data for gene length and library size for single end and paired end reads respectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

TPM

A

Transcripts per kilobase million; An increasingly preferred method for RNA-seq transcript count normalization, calculated by dividing the read counts by the length of each gene in kilobases, followed by summing the counts for each gene and dividing by 1M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Cufflinks

A

A program that assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

StringTie

A

A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts, using a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Pseudoaligner

A

RNA-seq mapping programs that avoid fully aligning each read, instead matching k-mers across reads and transcriptomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

False discovery rate (FDR)

A

FDR = (false positives)/(false positives + true positives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Fusion gene

A

Identified by fusion junctions, a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Hash table

A

A dictionary used in indexing programs – data is stored as a collection of key-value pairs, frequently with the k-mer sequence as the key and the sequence position as the value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Seeding

A

A query sequence is broken up into all possible overlapping 3-letter words (k-mers) and scored for similarity against each other using a matrix (e.g. BLOSUM62), and those scoring above a predetermined threshold are kept for database searching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Suffix tree

A

A data structure that stores all the suffixes of a string, enabling fast string matching for an initial exact match

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Suffix array

A

A data structure that stores all possible suffixes of a string, indexes them, and then sorts the suffixes by alphabetical order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

FM Index

A

An augmentation of the space-efficient BWT with additional data (a suffix array) that permits very fast exact string matching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

E-value

A

Expect value; indicates the number of false positives you would expect from a given alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Batch correction

A

A technique that accounts for technical errors across samples such as differences in reagents, equipment, and date of library preparation or sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Short read mapping

A

A technique used in RNA-seq read mapping, consists of aligning millions of short reads (35-400bp) to a single long sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

DESeq2

A

A program that uses raw read counts and not FPKM/RPKM to make normalized counts for non-differentially expressed genes similar between samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Euclidean distance

A

In two-dimensional space (such as a plane), Euclidean distance is the length of the shortest path connecting two points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Bi-clustering

A

A type of clustering technique used to cluster both rows and columns simultaneously in a dataset, often applied to data matrices where rows represent one type of entity (e.g., genes) and columns represent another type of entity (e.g., experimental conditions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Homoskedasticity

A

A constant variance along the range of mean values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Kallisto

A

A pseudoaligner that maps known transcripts depending on their location in the genome, which is stored in a transcriptome de Bruijn graph (T-DBG), rather than which sequence they align to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Overrepresentation analysis

A

Determines whether a list of functional categories are over or underrepresented in a gene list of interest in comparison to a reference list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Functional class scoring/Gene set enrichment analysis

A

Looks for enrichment of specific gene sets or pathways, but use a ranked gene list as input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Type I Error

A

False positive; probability = α

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Type II Error

A

False negative; probability = β

43
Q

ClinVar

A

An NCBI-hosted database that aggregates information about genomic variation and its relation to human health

44
Q

Canonical transcript

A

The transcript that is, on average, the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt

45
Q

Online Mendelian Inheritance in Man (OMIM)

A

An online database containing information on all known human genes and mendelian disorders

46
Q

Humsavar

A

Human polymorphisms and disease mutations index; A manually curated text file listing all human missense variants classified as pathogenic, likely pathogenic, benign, likely benign, and uncertain significance

47
Q

Exome Aggregation Consortium (ExAC) database

A

A database that collected and reanalyzed the exome sequences of >60,000 individuals from different populations with adult-onset diseases that were sequenced as part of disease-specific and population genetic studies

48
Q

Genome Aggregation Database (gnomAD)

A

A database that aggregates exome and genome sequence data from several large-scale sequencing projects

49
Q

Ensembl

A

An additional database documenting the transcript variations of specific genes and the genomes of mainly vertebrate species

50
Q

Database of Short Genetic Variations (dbSNP)

A

An archive of all short sequence variations for a wide-range of organisms hosted by NCBI, includes SNPs, INDELs, and multi-base INDELs

51
Q

Catalogue of Somatic Mutations in Cancer (COSMIC)

A

A database that combines genome-wide sequencing results from >28,000 tumors, including details like tissue and variation type distribution

52
Q

Polymorphism Phenotyping v2 (PolyPhen2)

A

A variant effect prediction algorithm that calculates the probability of a variant being damaging using both 3D structural features (surface accesibility, hydrophobicity, etc.) and sequence-based analysis

53
Q

Missense3D

A

An Imperial College hosted variant effect prediction algorithm that uses 3D structural coordinates to perform an in-depth atom-based study of the effect of a missense variant and therefore are able to provide the user with information on the mechanism by which the variant may disrupt protein folding/function

54
Q

Sorting Tolerant from Intolerant (SIFT) algorithm

A

A widespread missense variance-prediction algorithm based on MSA construction

55
Q

Rare Exome Variant Ensemble Learner (REVEL)

A

An ensemble method for prediction the effect of an amino acid substitution by combining many other prediction algorithms, including PolyPhen and SIFT, and performs better than individual predictors

56
Q

Variant Effect Predictor (VEP)

A

An Ensembl-hosted platform that runs different prediction algorithms, including PolyPhen and SIFT, for the user simultaneously

57
Q

Single Amino Acid Polymorphism (SAAP)

A

A data analysis pipeline and predictor that performs in depth analysis of the structural effect of an amino acid substitution on a protein structure using residue conservation and an experimental 3D structures

58
Q

Pseudoexon

A

The product of mutations in regions important for alternative splicing due to the creation of de-novo splice sites or strengthening of existing weak splice sites, resulting in transcripts subject to premature degradation or production of a modified protein

59
Q

Pharmacogenetics

A

The study of how genetic factors affect the interindividual variability to drug response

60
Q

Aspartic acid, glutamic acid

A

Negatively charged polar amino acids

61
Q

Arginine, lysine, histidine

A

Positively charge polar amino acids

62
Q

Asparagine, glutamine, serine, threonine, tyrosine

A

Uncharged polar amino acids

63
Q

Alanine, glycine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan, cysteine

A

Nonpolar amino acids

64
Q

Dihedral angles

A

The amino acid main chain torsion angles, important for determining secondary and tertiary structure

65
Q

Protein fold classes

A

α/α: packing of alpha-helices
β/β: one or more beta-sheets
α/β: roughly alternate alpha-helices and beta-sheets, beta-sheets are commonly parallel
α+β: mixed alpha-helices and beta-sheets

66
Q

RMSD

A

Root mean square deviation; a metric used to quantify the similarity between protein structures by superimposing them in the orientation which minimizes the value and calculating the average distance separation between equivalent positions

67
Q

Enzyme commission (EC) classification

A

E.C.1(class).2(Functional definition).3(Functional definition).4(substrate specificity)

68
Q

PDB

A

Protein data bank; an EBI-hosted database of about ~120,000 non-redundant protein structures, generally of high quality

69
Q

SCOPe

A

An automated version of SCOP which aimed to organize PDB data by manual structure comparison and uses a hierarchical classification: class, fold, superfamily, family, protein domain and species

70
Q

CATH

A

A protein structure database which organizes proteins by partially automatic structural alignment and uses a hierarchical classification Class, architecture, topology, homologous superfamily, sequence family

71
Q

SWISSPROT

A

A high quality source of annotation for a selection of protein sequences

72
Q

UniProtKB

A

UniProt Knowledge Base; a European-based protein database with 250M sequences and ~600,000 high quality SWISSPROT annotations

73
Q

TrEMBL

A

EBI’s protein sequence database

74
Q

MGnify

A

EBI’s metagenomics database containing 350,000 amplicons from 33,000 metagenomes organized by biomes such as human, digestive system, aquatic, soil, skin, wastewater, etc.

75
Q

Conservative substitutions

A

An amino acid substitution that maintains the chemical properties of the original amino acids

76
Q

PAM

A

Point accepted mutation; an amino acid sequence alignment scoring scheme developed in the 1970s and is based on counting the number of times residue types change in closely homologous sequences

77
Q

BLOSUM62

A

Blocks substitution matrix; an amino acid sequence alignment scoring scheme developed in the 1990s and is derived from conserved protein motifs which effectively filters out noise

78
Q

Needleman-Wunsch Algorithm

A

A general sequence comparison algorithm that maximizes similarity scores to find the best global alignment of any two sequences

79
Q

Smith-Waterman Algorithm

A

A sequence comparison algorithm that compares segments of all possible lengths (i.e. local alignments) and chooses whichever maximized the similarity measure

80
Q

P-value (p)

A

The probability of achieving the returned score or a better score by chance; the probability of obtaining a value at least as extreme as the observed result assuming the null hypothesis is correct

81
Q

CLUSTAL

A

A multiple sequence alignment program that builds MSAs using guide trees

82
Q

PROSITE

A

A database of protein sequence patterns identified using multiple sequence alignments closely linked with SWISSPROT

83
Q

Hidden Markov Models (HMMs)

A

The optimal method for representing protein families by MSA by scoring for residue similarity and position, allowing for the detection of distant family relationships

84
Q

PFAM

A

A database of protein domain family HMM

85
Q

InterPro

A

An expansive database that consolidates information from a wide range of protein databases, such as PROSITE, PFAM, PANTHER, ProDOM (homologous domains), UNIPROT, etc., for the purpose of unifying research approaches and protein terminology

86
Q

PSIBLAST

A

An algorithm that builds a MSA and a PSSM with the query sequence and uses this to further search the database to amplify conserved regions and identify conserved functional sites through iteration

87
Q

TM scores

A

Template modeling scores; very similar to RMSD, however they removed the requirement for arbitrary decisions such as the maximum distance between equivalent residues

88
Q

CASP

A

Critical assessment of protein structure prediction; a blind trial to evaluate different protein structure prediction methods that occurs every two years; sequences for testing are sent to predictors prior to revealing the correct experimental structures

89
Q

Energy minimization

A

The process of finding the conformation of a protein that corresponds to the lowest possible energy state according to a specified energy function or potential

90
Q

Molecular dynamics

A

A simulation that numerically solves Newton’s equations of motion to simulate the movement and interactions of atoms in a protein and its surrounding environment

91
Q

Phyre2

A

An online secondary structure prediction model that uses a template library of ~200,000 known 3D structures and HMMs for the known structures

92
Q

Loop modeling

A

The computational process of resolving INDELs in new structures, involving subdividing the loop into 2 segments and then repeatedly dividing and transforming each segment until the loop is small enough to be solved

93
Q

pLDDT

A

Predicted local difference distance test; the per residue confidence metric used by AlphaFold

94
Q

PAE

A

Predicted alignment error; a metric used by AlphaFold to determine how well predicted the distance between two residues is and assess the confidence of domain packing

95
Q

ClusPro

A

A powerful ab initio protein docking server, even still high quality results are still very difficult to obtain

96
Q

AF2Complex

A

A neural network model derived from AlphaFold that predicts structures of multimeric protein complexes without the need for paired MSA

97
Q

AlphaFold Multimer

A

An extension of AlphaFold2 that has been specifically built to predict protein-protein complexes, slightly outdated and has steep memory requirements

98
Q

GO

A

Gene ontology; a universal gene/gene product annotation system that details what the product does, why it performs its activity, and where it acts

99
Q

STRING

A

A database that tabulates protein interactions for the purpose of inferring function from the protein’s interactome

100
Q

NetGo

A

A protein interaction prediction model that incorporates a variety of different approaches, including a protein language model

101
Q

DeepTMHMM

A

The current state of the art prediction method for identifying transmembrane structures and signal peptides consisting of deep learning approaches

102
Q

Coiled-coils

A

Two or three intertwined alpha-helices that manifest as super helical twists with slight distortion as a result of hydrophobic residue packing

103
Q

Orthologs

A

Homologous proteins that come from different species, are much more likely to preserve function and are likely to have the same EC classification

104
Q

Paralogs

A

Homologous proteins in the same species resulting from gene duplication and are more free to mutate as they have redundant functions