bioinformatics Flashcards
what is the difference between a database and a databank?
database: a structured collection of data with a tool enabling it to be queried/searched, often split into tables
databank: collection of data, normally in simple text files, without associated query tool
what are different types of databanks
primary databank: raw data possibly with annotations, simply contain sequence data, may also have feature information such as splice sites, S=S sites etc.
secondary databank:derived data, patterns, annotations, generally high annotated
composite: non redundant sets of data, derived from primary databases
gateways: websites which allow access to data, links between databanks
what is bioinformatics
the application of computers to problems in biology
what are examples of primary databanks, what do they contain
genbank, EMBL-ENA, DDBJ: for DNA
uniprot: protein primary databank, formed from others such as swissprot, PIR
PDB: protein databank, contains protein structural information
enzyme: enzyme databanks, contains enzyme classifications
they simply contain sequence data (DNA or protein)
may also have “feature information” such as splice sites, disulphide bridges etc
DNA databanks may also contain translations; known or predicted
(a protein identified from genome data is hypothetical until verified by experiment)
give examples of secondary databanks, what do they contain
prosite, prints, blocks, interpro
they contain derived information, patterns that characterise a protein family, with detailed annotations
give examples of simple prosite patterns
protein kinase C phosphorylation site: [ST]-x-{RK]
square brackets mean or; so a phosphorylation site is either serine or threonine followed by anything followed by arginine or lycine
N-linked glycosylation site:
N-[P]-[ST]-{P}
{} brackets mean not, so anything but these amino acids
these sequences do not necessarily mean they are these sites, however all of these sites have them
what is a dotplot
method for comparing proteins/DNA etc, comparing how similar they are
amino acid sequences are plotted along horizontal and vertical chart axes of the different protein, every time there is a match a dot is placed
the sequence for one goes on one axis and for the other on the other axis
this can be used for DNA as well however larger amount of noise due to only 4 base pairs as opposed to 20 AAs
a straight diagonal line down the middle shows the same sequence
how are similarities of DNA/proteins measured
can be used for methods such as dotplot
simple identity scoring gives 1 for a match and 0 for a mismatch
more complex scoring schemes account for similarity between amino acids, for example serine and threonine are very similar, serine and tryptophan are very different
similarity scores typically derived from analysis of aligned homologous proteins where substitutions have been observed
what are alignments
when comparing DNA/amino acid sequences
using stretches of data where values are the same, one sequence is written on a line above the other with the same sequence at the same points, thus comparing data
e.g sequence of ABCDCDCFG and VFRBCDCMLLCFG could be compared 2 ways;
comparison of BCDC point:
VFRBCDCMLLCFG
ABCDCDCFG
comparison of CFG point:
VFRBCDCMLLCFG
ABCDCDCFG
alignments are used by creating gaps in sequences to create optimal scores for comparison techniques such as dotplots
2 types of alignment: global and local
global is when you find areas of one sequence that match other areas and show where they match
local sequences are when you find a single area that optimally matches another area
what is problem with dynamic programming methods
dynamic programming methods; alignments and dotplots are slow
what are examples of comparing sequences quickly
fast methods; FASTA/BLAST
they are known as heuristics
indexes are created containing every word in the database
book example:
words are searched and search index shows which pages have the words and which pages have it the most
searches for sequences that come up frequently to search, even if there are a few differences
how is DNA sequenced
sanger method, however only lets you sequence relatively short segments of DNA; less than 2000bp
newer methods do long stretches such as 10,000 but not very accurately
to sequence whole genomes the end of the preceeding sequence find matches with the start of the succeeding sequence via alignments, a big an overlap as possible is desired to reduce chance of errors, however too large a repeated sequence may lead to innacurate representation of large repeated sequences whose size are then underestimated
how is machine learning used in bioinformatics
machine learning is a general class of software which learns from examples and is then able to make predictions
you can train this software with examples of real transcription sites, intron/exon boundaries, sequence composition etc, lots of examples are given
machine then learns features of these examples, and applies features to make predictions
examples are neural networks and decision trees
what is used to sequence protein sequence
DNA sequencing has now mostly replaced amino acid sequencing for protein structure
what are problems with bioinformatics
some databanks have/may have mistakes
some genes have many names
some different genes have the same name