bioinformatics Flashcards

1
Q

what is the difference between a database and a databank?

A

database: a structured collection of data with a tool enabling it to be queried/searched, often split into tables
databank: collection of data, normally in simple text files, without associated query tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are different types of databanks

A

primary databank: raw data possibly with annotations, simply contain sequence data, may also have feature information such as splice sites, S=S sites etc.

secondary databank:derived data, patterns, annotations, generally high annotated

composite: non redundant sets of data, derived from primary databases
gateways: websites which allow access to data, links between databanks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is bioinformatics

A

the application of computers to problems in biology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are examples of primary databanks, what do they contain

A

genbank, EMBL-ENA, DDBJ: for DNA

uniprot: protein primary databank, formed from others such as swissprot, PIR

PDB: protein databank, contains protein structural information

enzyme: enzyme databanks, contains enzyme classifications

they simply contain sequence data (DNA or protein)

may also have “feature information” such as splice sites, disulphide bridges etc

DNA databanks may also contain translations; known or predicted

(a protein identified from genome data is hypothetical until verified by experiment)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

give examples of secondary databanks, what do they contain

A

prosite, prints, blocks, interpro

they contain derived information, patterns that characterise a protein family, with detailed annotations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

give examples of simple prosite patterns

A

protein kinase C phosphorylation site: [ST]-x-{RK]

square brackets mean or; so a phosphorylation site is either serine or threonine followed by anything followed by arginine or lycine

N-linked glycosylation site:

N-[P]-[ST]-{P}

{} brackets mean not, so anything but these amino acids

these sequences do not necessarily mean they are these sites, however all of these sites have them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is a dotplot

A

method for comparing proteins/DNA etc, comparing how similar they are

amino acid sequences are plotted along horizontal and vertical chart axes of the different protein, every time there is a match a dot is placed

the sequence for one goes on one axis and for the other on the other axis

this can be used for DNA as well however larger amount of noise due to only 4 base pairs as opposed to 20 AAs

a straight diagonal line down the middle shows the same sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how are similarities of DNA/proteins measured

A

can be used for methods such as dotplot

simple identity scoring gives 1 for a match and 0 for a mismatch

more complex scoring schemes account for similarity between amino acids, for example serine and threonine are very similar, serine and tryptophan are very different

similarity scores typically derived from analysis of aligned homologous proteins where substitutions have been observed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are alignments

A

when comparing DNA/amino acid sequences

using stretches of data where values are the same, one sequence is written on a line above the other with the same sequence at the same points, thus comparing data
e.g sequence of ABCDCDCFG and VFRBCDCMLLCFG could be compared 2 ways;

comparison of BCDC point:
VFRBCDCMLLCFG
ABCDCDCFG

comparison of CFG point:
VFRBCDCMLLCFG
ABCDCDCFG

alignments are used by creating gaps in sequences to create optimal scores for comparison techniques such as dotplots

2 types of alignment: global and local

global is when you find areas of one sequence that match other areas and show where they match

local sequences are when you find a single area that optimally matches another area

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is problem with dynamic programming methods

A

dynamic programming methods; alignments and dotplots are slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are examples of comparing sequences quickly

A

fast methods; FASTA/BLAST

they are known as heuristics

indexes are created containing every word in the database

book example:
words are searched and search index shows which pages have the words and which pages have it the most

searches for sequences that come up frequently to search, even if there are a few differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how is DNA sequenced

A

sanger method, however only lets you sequence relatively short segments of DNA; less than 2000bp

newer methods do long stretches such as 10,000 but not very accurately

to sequence whole genomes the end of the preceeding sequence find matches with the start of the succeeding sequence via alignments, a big an overlap as possible is desired to reduce chance of errors, however too large a repeated sequence may lead to innacurate representation of large repeated sequences whose size are then underestimated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how is machine learning used in bioinformatics

A

machine learning is a general class of software which learns from examples and is then able to make predictions

you can train this software with examples of real transcription sites, intron/exon boundaries, sequence composition etc, lots of examples are given

machine then learns features of these examples, and applies features to make predictions

examples are neural networks and decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is used to sequence protein sequence

A

DNA sequencing has now mostly replaced amino acid sequencing for protein structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what are problems with bioinformatics

A

some databanks have/may have mistakes

some genes have many names

some different genes have the same name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly