Bioinformatics Flashcards

Question

Fusion gene

Answer 1

Identified by fusion junctions, a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene

Answer 2

A dictionary used in indexing programs – data is stored as a collection of key-value pairs, frequently with the k-mer sequence as the key and the sequence position as the value

Answer 3

A query sequence is broken up into all possible overlapping 3-letter words (k-mers) and scored for similarity against each other using a matrix (e.g. BLOSUM62), and those scoring above a predetermined threshold are kept for database searching

Answer 4

A data structure that stores all the suffixes of a string, enabling fast string matching for an initial exact match

Answer 5

A data structure that stores all possible suffixes of a string, indexes them, and then sorts the suffixes by alphabetical order

Answer 6

An augmentation of the space-efficient BWT with additional data (a suffix array) that permits very fast exact string matching

Answer 7

Expect value; indicates the number of false positives you would expect from a given alignment

Answer 8

A technique that accounts for technical errors across samples such as differences in reagents, equipment, and date of library preparation or sequencing

Answer 9

A technique used in RNA-seq read mapping, consists of aligning millions of short reads (35-400bp) to a single long sequence

Answer 10

A program that uses raw read counts and not FPKM/RPKM to make normalized counts for non-differentially expressed genes similar between samples

Answer 11

In two-dimensional space (such as a plane), Euclidean distance is the length of the shortest path connecting two points

Answer 12

A type of clustering technique used to cluster both rows and columns simultaneously in a dataset, often applied to data matrices where rows represent one type of entity (e.g., genes) and columns represent another type of entity (e.g., experimental conditions)

Answer 13

A constant variance along the range of mean values

Answer 14

A pseudoaligner that maps known transcripts depending on their location in the genome, which is stored in a transcriptome de Bruijn graph (T-DBG), rather than which sequence they align to

Answer 15

Determines whether a list of functional categories are over or underrepresented in a gene list of interest in comparison to a reference list

Answer 16

Looks for enrichment of specific gene sets or pathways, but use a ranked gene list as input

Answer 17

False positive; probability = α

Answer 18

False negative; probability = β

Answer 19

An NCBI-hosted database that aggregates information about genomic variation and its relation to human health

Answer 20

The transcript that is, on average, the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt

Answer 21

An online database containing information on all known human genes and mendelian disorders

Answer 22

Human polymorphisms and disease mutations index; A manually curated text file listing all human missense variants classified as pathogenic, likely pathogenic, benign, likely benign, and uncertain significance

Answer 23

A database that collected and reanalyzed the exome sequences of >60,000 individuals from different populations with adult-onset diseases that were sequenced as part of disease-specific and population genetic studies

Answer 24

A database that aggregates exome and genome sequence data from several large-scale sequencing projects

Answer 25

An additional database documenting the transcript variations of specific genes and the genomes of mainly vertebrate species

Answer 26

An archive of all short sequence variations for a wide-range of organisms hosted by NCBI, includes SNPs, INDELs, and multi-base INDELs

Answer 27

A database that combines genome-wide sequencing results from >28,000 tumors, including details like tissue and variation type distribution

Answer 28

A variant effect prediction algorithm that calculates the probability of a variant being damaging using both 3D structural features (surface accesibility, hydrophobicity, etc.) and sequence-based analysis

Answer 29

An Imperial College hosted variant effect prediction algorithm that uses 3D structural coordinates to perform an in-depth atom-based study of the effect of a missense variant and therefore are able to provide the user with information on the mechanism by which the variant may disrupt protein folding/function

Answer 30

A widespread missense variance-prediction algorithm based on MSA construction

Answer 31

An ensemble method for prediction the effect of an amino acid substitution by combining many other prediction algorithms, including PolyPhen and SIFT, and performs better than individual predictors

Answer 32

An Ensembl-hosted platform that runs different prediction algorithms, including PolyPhen and SIFT, for the user simultaneously

Answer 33

A data analysis pipeline and predictor that performs in depth analysis of the structural effect of an amino acid substitution on a protein structure using residue conservation and an experimental 3D structures

Answer 34

The product of mutations in regions important for alternative splicing due to the creation of de-novo splice sites or strengthening of existing weak splice sites, resulting in transcripts subject to premature degradation or production of a modified protein

Answer 35

The study of how genetic factors affect the interindividual variability to drug response

Answer 36

Negatively charged polar amino acids

Answer 37

Positively charge polar amino acids

Answer 38

Uncharged polar amino acids

Answer 39

Nonpolar amino acids

Answer 40

The amino acid main chain torsion angles, important for determining secondary and tertiary structure

Answer 41

α/α: packing of alpha-helices β/β: one or more beta-sheets α/β: roughly alternate alpha-helices and beta-sheets, beta-sheets are commonly parallel α+β: mixed alpha-helices and beta-sheets

Answer 42

Root mean square deviation; a metric used to quantify the similarity between protein structures by superimposing them in the orientation which minimizes the value and calculating the average distance separation between equivalent positions

Answer 43

E.C.1(class).2(Functional definition).3(Functional definition).4(substrate specificity)

Answer 44

Protein data bank; an EBI-hosted database of about ~120,000 non-redundant protein structures, generally of high quality

Answer 45

An automated version of SCOP which aimed to organize PDB data by manual structure comparison and uses a hierarchical classification: class, fold, superfamily, family, protein domain and species

Answer 46

A protein structure database which organizes proteins by partially automatic structural alignment and uses a hierarchical classification Class, architecture, topology, homologous superfamily, sequence family

Answer 47

A high quality source of annotation for a selection of protein sequences

Answer 48

UniProt Knowledge Base; a European-based protein database with 250M sequences and ~600,000 high quality SWISSPROT annotations

Answer 49

EBI's protein sequence database

Answer 50

EBI's metagenomics database containing 350,000 amplicons from 33,000 metagenomes organized by biomes such as human, digestive system, aquatic, soil, skin, wastewater, etc.

Answer 51

An amino acid substitution that maintains the chemical properties of the original amino acids

Answer 52

Point accepted mutation; an amino acid sequence alignment scoring scheme developed in the 1970s and is based on counting the number of times residue types change in closely homologous sequences

Answer 53

Blocks substitution matrix; an amino acid sequence alignment scoring scheme developed in the 1990s and is derived from conserved protein motifs which effectively filters out noise

Answer 54

A general sequence comparison algorithm that maximizes similarity scores to find the best global alignment of any two sequences

Answer 55

A sequence comparison algorithm that compares segments of all possible lengths (i.e. local alignments) and chooses whichever maximized the similarity measure

Answer 56

The probability of achieving the returned score or a better score by chance; the probability of obtaining a value at least as extreme as the observed result assuming the null hypothesis is correct

Answer 57

A multiple sequence alignment program that builds MSAs using guide trees

Answer 58

A database of protein sequence patterns identified using multiple sequence alignments closely linked with SWISSPROT

Answer 59

The optimal method for representing protein families by MSA by scoring for residue similarity and position, allowing for the detection of distant family relationships

Answer 60

A database of protein domain family HMM

Answer 61

An expansive database that consolidates information from a wide range of protein databases, such as PROSITE, PFAM, PANTHER, ProDOM (homologous domains), UNIPROT, etc., for the purpose of unifying research approaches and protein terminology

Answer 62

An algorithm that builds a MSA and a PSSM with the query sequence and uses this to further search the database to amplify conserved regions and identify conserved functional sites through iteration

Answer 63

Template modeling scores; very similar to RMSD, however they removed the requirement for arbitrary decisions such as the maximum distance between equivalent residues

Answer 64

Critical assessment of protein structure prediction; a blind trial to evaluate different protein structure prediction methods that occurs every two years; sequences for testing are sent to predictors prior to revealing the correct experimental structures

Answer 65

The process of finding the conformation of a protein that corresponds to the lowest possible energy state according to a specified energy function or potential

Answer 66

A simulation that numerically solves Newton's equations of motion to simulate the movement and interactions of atoms in a protein and its surrounding environment

Answer 67

An online secondary structure prediction model that uses a template library of ~200,000 known 3D structures and HMMs for the known structures

Answer 68

The computational process of resolving INDELs in new structures, involving subdividing the loop into 2 segments and then repeatedly dividing and transforming each segment until the loop is small enough to be solved

Answer 69

Predicted local difference distance test; the per residue confidence metric used by AlphaFold

Answer 70

Predicted alignment error; a metric used by AlphaFold to determine how well predicted the distance between two residues is and assess the confidence of domain packing

Answer 71

A powerful ab initio protein docking server, even still high quality results are still very difficult to obtain

Answer 72

A neural network model derived from AlphaFold that predicts structures of multimeric protein complexes without the need for paired MSA

Answer 73

An extension of AlphaFold2 that has been specifically built to predict protein-protein complexes, slightly outdated and has steep memory requirements

Answer 74

Gene ontology; a universal gene/gene product annotation system that details what the product does, why it performs its activity, and where it acts

Answer 75

A database that tabulates protein interactions for the purpose of inferring function from the protein's interactome

Answer 76

A protein interaction prediction model that incorporates a variety of different approaches, including a protein language model

Answer 77

The current state of the art prediction method for identifying transmembrane structures and signal peptides consisting of deep learning approaches

Answer 78

Two or three intertwined alpha-helices that manifest as super helical twists with slight distortion as a result of hydrophobic residue packing

Answer 79

Homologous proteins that come from different species, are much more likely to preserve function and are likely to have the same EC classification

Answer 80

Homologous proteins in the same species resulting from gene duplication and are more free to mutate as they have redundant functions

Bioinformatics Flashcards

(104 cards)