bioinformatics Flashcards

Question 1

Q

what is the difference between a database and a databank?

Answer

A

database: a structured collection of data with a tool enabling it to be queried/searched, often split into tables
databank: collection of data, normally in simple text files, without associated query tool

Question 2

Q

what are different types of databanks

Answer

A

primary databank: raw data possibly with annotations, simply contain sequence data, may also have feature information such as splice sites, S=S sites etc.

secondary databank:derived data, patterns, annotations, generally high annotated

composite: non redundant sets of data, derived from primary databases
gateways: websites which allow access to data, links between databanks

Question 3

Q

what is bioinformatics

Answer

A

the application of computers to problems in biology

Question 4

Q

what are examples of primary databanks, what do they contain

Answer

A

genbank, EMBL-ENA, DDBJ: for DNA

uniprot: protein primary databank, formed from others such as swissprot, PIR

PDB: protein databank, contains protein structural information

enzyme: enzyme databanks, contains enzyme classifications

they simply contain sequence data (DNA or protein)

may also have “feature information” such as splice sites, disulphide bridges etc

DNA databanks may also contain translations; known or predicted

(a protein identified from genome data is hypothetical until verified by experiment)

Question 5

Q

give examples of secondary databanks, what do they contain

Answer

A

prosite, prints, blocks, interpro

they contain derived information, patterns that characterise a protein family, with detailed annotations

Question 6

Q

give examples of simple prosite patterns

Answer

A

protein kinase C phosphorylation site: [ST]-x-{RK]

square brackets mean or; so a phosphorylation site is either serine or threonine followed by anything followed by arginine or lycine

N-linked glycosylation site:

N-[P]-[ST]-{P}

{} brackets mean not, so anything but these amino acids

these sequences do not necessarily mean they are these sites, however all of these sites have them

Question 7

Q

what is a dotplot

Answer

A

method for comparing proteins/DNA etc, comparing how similar they are

amino acid sequences are plotted along horizontal and vertical chart axes of the different protein, every time there is a match a dot is placed

the sequence for one goes on one axis and for the other on the other axis

this can be used for DNA as well however larger amount of noise due to only 4 base pairs as opposed to 20 AAs

a straight diagonal line down the middle shows the same sequence

Question 8

Q

how are similarities of DNA/proteins measured

Answer

A

can be used for methods such as dotplot

simple identity scoring gives 1 for a match and 0 for a mismatch

more complex scoring schemes account for similarity between amino acids, for example serine and threonine are very similar, serine and tryptophan are very different

similarity scores typically derived from analysis of aligned homologous proteins where substitutions have been observed

Question 9

Q

what are alignments

Answer

A

when comparing DNA/amino acid sequences

using stretches of data where values are the same, one sequence is written on a line above the other with the same sequence at the same points, thus comparing data
e.g sequence of ABCDCDCFG and VFRBCDCMLLCFG could be compared 2 ways;

comparison of BCDC point:
VFRBCDCMLLCFG
ABCDCDCFG

comparison of CFG point:
VFRBCDCMLLCFG
ABCDCDCFG

alignments are used by creating gaps in sequences to create optimal scores for comparison techniques such as dotplots

2 types of alignment: global and local

global is when you find areas of one sequence that match other areas and show where they match

local sequences are when you find a single area that optimally matches another area

Question 10

Q

what is problem with dynamic programming methods

Answer

A

dynamic programming methods; alignments and dotplots are slow

Question 11

Q

what are examples of comparing sequences quickly

Answer

A

fast methods; FASTA/BLAST

they are known as heuristics

indexes are created containing every word in the database

book example:
words are searched and search index shows which pages have the words and which pages have it the most

searches for sequences that come up frequently to search, even if there are a few differences

Question 12

Q

how is DNA sequenced

Answer

A

sanger method, however only lets you sequence relatively short segments of DNA; less than 2000bp

newer methods do long stretches such as 10,000 but not very accurately

to sequence whole genomes the end of the preceeding sequence find matches with the start of the succeeding sequence via alignments, a big an overlap as possible is desired to reduce chance of errors, however too large a repeated sequence may lead to innacurate representation of large repeated sequences whose size are then underestimated

Question 13

Q

how is machine learning used in bioinformatics

Answer

A

machine learning is a general class of software which learns from examples and is then able to make predictions

you can train this software with examples of real transcription sites, intron/exon boundaries, sequence composition etc, lots of examples are given

machine then learns features of these examples, and applies features to make predictions

examples are neural networks and decision trees

Question 14

Q

what is used to sequence protein sequence

Answer

A

DNA sequencing has now mostly replaced amino acid sequencing for protein structure

Question 15

Q

what are problems with bioinformatics

Answer

A

some databanks have/may have mistakes

some genes have many names

some different genes have the same name

bioinformatics Flashcards

(15 cards)