Bioinformatik Flashcards
Flat file
term used to refer to when data is stored in a plain ordinary file on the hard disk. Example RefSEQ.
Bioinformatics
Application of information technology to the storage, management and analysis of biological information (Facilitated by the use of computers)
Nanopore seq
When a molecule goes through the hole it is measured. Proteins in the hole that pull it through, 800 nucleotides per minute Read length up to 300 000 —> Able to do phasing/haplotyping. If you have hetereozygote in two spots in the genome.
Examples of location descriptors
Location Description
476 Points to a single base in the presented sequence
340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases
<345..500 The exact lower boundary point of a feature is unknown.
(102.110) Indicates that the exact location is unknown but that it
is one of the bases between bases 102 and 110.
(23.45)..600 Specifies that the starting point is one of the bases
between bases 23 and 45, inclusive, and the end base 600
123^124 Points to a site between bases 123 and 124
145^177 Points to a site anywhere between bases 145 and 177
J00193:hladr Points to a feature whose location is described in
another entry: the feature labeled ‘hladr’ in the
entry (in this database) with primary accession ‘J00193’
Sequencing file format tips
a) When saving a sequence for use in an email message or pasting into a web page…
b) When retrieving from a database or exchanging between programs…
c)When using sequence again with the same program…
a) …use an unannotated text format such as FASTA
b) …use an annotated text format such as Genbank
c) …use that program’s annotated binary format (or annotated text if binary not available)
Asn-1 (NCBI)
Gbff (sanger)
XML
Phred
*base calling
*vector trimming
*end of sequence read trimming
*assigns quality values (qv) of bases in the sequence
Phrap
*Phrap uses Phred’s base calling scores to determine the consensus sequences. *Examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence
Consend
graphical interface extension that controls both Phred and Phrap
Poor data at seq end
This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp)
Cis- and transsplicing for ORF
Cis-splicing - splice a intron and join exons on the same site
trans splice - splice and join from different sites, able to do between sense and antisense strand.
Swissprot
SWISS-PROT is an annotated protein sequence database. Continuously updated (daily).
Format follows as closely as possible that of EMBL’s
Curated protein sequence database
Three differences:
- Strives to provide a high level of annotations
- Minimal level of redundancy
- High level of integration with other databases
Behind a paywall..
TREMBL
Translated EMBL sequences not (yet) in Swissprot. Updated faster than SWISS-PROT.
TREMBL - two parts
1. SP-TREMBL
Will eventually be incorporated into Swissprot
Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO,ROD, UNC, VRL and VRT.
- REM-TREMBL (remaining)
Will NOT be incorporated into Swissprot
Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins
Protein searching
3 levels
1.Swissprot - Little noise, annotated entries
2.Swissprot + TREMBL - More noise, all probable entries
3.Translated EMBL - blast or tfasta - Most noisy, all possible entries
PDB
3D structure of proteins. AI is able to read the information from AA to predict the datamodel.
>10 000 structures of proteins
Also contains structures of DNA, carbohydrates and protein-DNA complexes
Structures determined principally by X-ray crystallography but other methods are electron microscopy and NMR.
Each entry identified by unique 4-letter code
4 most used databanks in bioinformatics
gene ontology - defines the terms
pfam - protein families, identifies functional parts in proteins
smart - visual presentation of protein families
kegg - pathway database, which enzymes work together in biosynthesis pathway
Problem with flat files:
Wasted storage space
Wasted processing time
Data control problems
Problems caused by changes to data structures
Access to data difficult
Data out of date
Constraints are system based
Limited querying eg. all single exon GPCRs (<1000 bp)
Relational databases
A set of tables and links. A language to query the database. A program to manage the data.
Has existed for 50 years. Main stream in bioinformatics.
Very well known and proven underlying mathematical theory, a simple one that makes possible. Relational model is very mature and has strong knowledge on how to make a relational back-end fast and reliable and how to exploit different technologies.
Pros with databases
+Redundancy can be reduced
+Inconsistency can be avoided
+Conflicting requirements can be balanced
+Standards can be enforced
+Data can be shared
+Data independence
+Integrity can be maintained
+Security restrictions can be applied
Cons with databases
-Size
-Complexity
-Cost
-Additional hardware costs
-Higher impact of failure
-Recovery more difficult
Identity
Extent to which two (nucleotide or amino acid) sequences are invariant
Homology
Similarity attributed to descent from common ancestor
Orthologous
Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function
Paralogous
Homologous sequences within a single species that arouse by gene duplication.
Empirical finding
If two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor.
Scoring matrix
A tool to quantify how well a certain model is represented in the alignment of two sequences, and any result obtained by its application is meaningful exclusively in the context of that model. All subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix.
Nucleic acid scoring matrices (examples)
Are not performed that much
Identity matrix
BLAST matrix
Transition/Transversion matrix
Transition
Mutation that conserves the ring number of the nucleotide
Transversion
Mutation that does not conserve the ring number of the nucleotide
Genetic Code matrix
Used to define the evolutionary distance between two aa by the minimal number of nucleotide changes required.
The probability that an observed aa pair is related by chance rather than inheritance should depend on amount of point mutations needed to transform one codon to the other.
From the matrix it has been seen that the genetic code appears to have evolved to minimize the effects of point mutations. Mutations often give aa with similar properties.
Hydrophobic aliphatic amino acids
Side chains consist of nonpolar methyl or methylene-groups. A
A usually located on the interior of the protein because of their hydrophobicity.
All except alanine are bifurcated.
For Val and Ile the bifurcation is close to main chain and can therefore restrict the conformation of the polypeptide by steric hindrance.
Hydrophobic-aromatic aa side chains
Only phenylalanine is totally non-polar.
Tyrosine’s phenolic side chain has a hydroxyl substituent and tryptophan has a nitrogen atom in its indole ring system. These residues are almost always found largely buried in hydrophobic interior of proteins which are normally predominantly non-polar naturally.
But, polar atoms of tyrosine and tryptophan allow hydrogen bonding interaction with other residues or even solvent molecules.
Neutral-polar side chains
Small aliphatic side chains with polar groups that cannot ionize readily.
Serine and threonine possess hydroxyl groups in their side chains and as these polar groups are close to the main chain they can form hydrogen bonds with it. This can influence the local conformation of the polypeptide.
Residues such as serine and asparagine are known to adopt conformations which most other amino acids cannot.
The amino acids asparagine and glutamine posses amide groups in their side chains which are usually hydrogen-bonded whenever they occur in the interior of a protein.
Substitution ser <-> thr most common in nature.
Acidic amino acids
Aspartate and glutamate have carboxyl side chains and are therefore negatively charged at physiological pH.
Strong polar nature of the residues means they are often found on the surface of globular proteins - able to interact with solvent molecules.
Residues can also partake in electrostatic interactions with positively charged basic aa.
Aspartate and glutamate can also take on catalytic roles in the active site of enzymes, well known for their metal ion binding abilities.
Basic amino acids
Histidine has the lowest pKa (around 6) - neutral at around physiological pH.
Occurs often in enzyme active sites as it can function as a very efficient general acid-base catalyst.
Also acts as metal ion ligand in many cases. Lysine and arginine are more strongly basic, + at physiological pH.
Generally solvated but occasionally occur inside proteins involved with electrostatic interactions with - groups.
Lys and Arg are important for anion-binding proteins because able to interact electrostatically with ligand.
Conformationally important aa residues
Glycine and proline - unique, appear to influence conformation of the polypeptide.
Gly lacks a side chain and is very flexible in conformation. Occurs abundantly in certain fibrous proteins because of its flexibility and since small size allows adjacent polypeptide chains to pack together closely.
Proline on the other hand is the most rigid aa because the side chain is covalently linked with main chain nitrogen.
Hydrophobicity matrix
If you want to predict which part of a protein is going through a membrane.
An attempt to quantify some physical or chemical attribute of the residues and assign weights based on similarities of the residues in this chosen property
Dayhoff PAM
A family of matrices that scores aa pairs on the basis of the expected frequency of substitutions of one aa for the other during protein evolution.
PAM - stands for…
Percent accepted mutation, one accepted point mutation on the path between two sequences per 100 residues
7 steps of constructing a scoring matrix
- Find accepted mutations
- Frequencies of occurrence
- Relative mutabilities
- Mutation probability matrix
- The evolutionary distance
- Relatedness odds
- Log-odds matrix
Properties of aa going into the makeup of PAM matrices..
Size
Shape
Local concentrations of electric charge
van der Waals surface
Ability to form salt bridges
Hydrophobic interactions
Hydrogen bonds
What two aspects can cause the evolutionary distance to be unequal in general to the number of observed differences between the sequences?
*Chance that a certain residue may have mutated, then reverted, hiding the effect of the mutation
*Specific residues may have mutated more than once → number of mutations likely to be larger than the number of differences between the two sequences.
PAM matrix; twilight zone
When the PAM distance value between two distantly related proteins nears the value 250 it becomes difficult to tell whether the two proteins are homologous, or if they are two randomly taken proteins that can be aligned by chance.
Low PAM
Closely related sequences. High scores for identity and low scores for substitutions, closer to the identity matrix.
High PAM
Distant sequences. At PAM200 all information is degenerate except for cysteins.
PAM error sources
*Many sequences depart from average composition.
*Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements observed!)
*Errors in 1PAM are magnified in the extrapolation to 250PAM.
*Distantly related sequences usually have islands (blocks) of conserved residues → Replacement is not equally probable over entire sequence.
BLOSUM
Blocks substitution matrix. Scores aa pairs based on frequency of aa substitutions in aligned sequence motifs called blocks that are found in protein families. Comes to the same conclusion as PAM.
BLOSUM method
A. Observed pairs
B. Expected pairs
C. Summary (A/B)
High BLOSUM: Closely related sequences
Low BLOSUM: Distant sequences
BLOSUM45 <-> PAM250
BLOSUM62 <->PAM160. Blosum62 is the most popular matrix.
High BLOSUM
High BLOSUM: Closely related sequences
Low BLOSUM
Distant sequences
Which is the best matrix to use?
No single matrix is the complete answer for all sequence comparisons. It is probably best to compliment the BLOSUM62 matrix with comparisons using 250PAMs and Overington structurally derived matrices.
Dotplot
Graphical representation using two orthogonal axes and “dots” for regions of similarity. In a bioinformatics context two sequence are used on the axes and dots are plotted when a given threshold is met in a given window.
Dot plotting is the best way to see all of the structures in common between two sequences or to visualize all of the repeated or inverted structures in one sequence.
Causes of noise in dotplots
Nucleic acids: 1 of 4 bases will match at random. Removing self alignments will reduce noise.
Stringency: Window size is considered, percentage of bases matching in the window is set as threshold.
Pairwise sequence alignment
Can be global or local. Local alignment look at a portion that align optimally, while global alignment looks at everything (and we are allowed to make gaps to make it fit).
Works for basically every sequence. However, cannot run multiple. Is not scalable in size and numbers of sequences.
Global: Sequences are completely aligned
Local: Only the best sub-regions are aligned. BLAST uses this
Algorithm
Method or a process followed to solve a problem. A recipe. An algorithm takes the input to a problem (function) and transforms it to the output. A mapping of input to output. A problem can have many algorithms.
Multiple sequence alignment
A process of aligning multiple sequences of nucleic acids or proteins to identify similarities and differences among them.
The sequences being aligned can be DNA, RNA, or proteins, and they may come from different organisms.
The goal of multiple sequence alignment is to identify conserved regions among the sequences, which can provide insight into their evolutionary relationships and functional significance.
If we have more than 2 sequences. 3D matrices formed. Will use more computational power.