definities bio-inf Flashcards
database
a collection of related data
–> data = known facts
–> implicit meaning: suggest/understood without being directly stated
–> assume computerized
–> structured with some degree of interaction
tables in a database
= data types
each table has several
records
a type is described by several
attributes
a specific attribute that makes it possible that each record can be uniquely identified
= key identifier
related data records are linked trough
foreign keys
an alignment
= arrangement of sequences that show where they are similar and where they differ
–> arrangement that results in an optimal score given a subst matrix and gap penalty
similarity
sequences are comparable on a set of criteria (can be mesured)
homology
sequences have a commen ancestor (is or is not)
gaps
= presentation form for insertions and deletions in alignments
substituiton matrices
necessary for biologivally relevant calculation of the quality of the alignment
PAM matrix
= point accepted mutations matrix
–> based on a model of prot evolution
–> provides a measure of the probability of AZ substitution based on counting effective substitutions in a large DB of evolutionary similar sequences
1PAM unit
= evolutionary period required to produce an average of one accepted point-mutation per 100 AZ
PAM 1 matrix
= matrix with probabilities for replacements during 1PAM unit
mutation score
- log (measured mutation freq/expected mutation freq)
–> neg numbers: observed less freq than expected by chance
–> pos numbers: observed more freq than expected by chanche
–> zero: observed as freq as expected
BLOSUM
= blocks substitution matrix
–> BLAST uses this matrix
–> based on much lager dataset than PAM to calculate mutation freq
–> based on blocks databaset
blocks database
families of prot with similar biochemical functions
–> familie members where aligned and blocks of high similarity were considered
–> within the blocks sequences with similarity higher than a treshold were clustered
–> blosum 62: treshold = 62%
BLAST mechanism
- breaks seq down into short words using a sliding window
–> default word size, typically 3 for prot seq
11 for nucleotide seq - for each word in the query seq, BLAST generates a list of neighboring words. Similarity is determined using a scoring matrix (=BLOSUM62) and only words that score above a certain treshold are considere neighbors
- BLAST searches the database for occurences of neighboring words. Efficiency –> look forr matches of short words rather than aligning the entire query
- extension: when a match is found in the database, BLAST attempts to extend the alignment in both directions to see if a high-scoring alignment can be formed. Local alignment in the regions surrounding the match
- if the extended alignment score exceeds a pre-defined threshold, it is reported as a significant match or hit. These are further analyzed and ranked using bit score/E-value..
Raw score (S)
= the sum of substitution and gap scores
–> little to no meaning, is dependent on the scoring system
–> identical alignments with diff subst matrix wil yield a diff S
Bit scores (S’)
= normalized raw scores
S’ = (λS - lnK)/ ln2
–> λ & K = scale parameters depending on subs matrix and gap penaltys
–> can be used to compare alignment scores from diff searches
E-value
= expectation value
E = mn 2^(-S’)
–> m = lenght of the query
–> n = total lenght of sequences in the database
–> E depends on size of the dataset
–> number of diff alignments with score equivalent or better than S that are expected to occur in a database search by chanche
–> the lower the E value the more significant the score
node degree
= total number of edges a node has
node closeness
= how close a node is to all other nodes in the network
node betweenness
if a node lies on the shortest paths between other nodes in the network –> nodes role as bridge