Sequence Similarity Searching Flashcards
Sequence similarity
similiar physiochemical properties - common ancestry - common function
Homology (and similarity)
share common ancestry (>80% similar)
Homologous (2 types and meaning)
Orthologs - speciation event (similar functions)
Paralog- duplication avent (different functions)
Sequence alignment and algorithms
enables maximisation of similarity, most likely evolutionary path
Dynamic Programming Algorithms (allow what, 2 types, negative)
exhaustive identification of optimal alignments
too slow for large databases
global - whole sequence (~length)
local- local regions (biological relevance, find conserved patterns)
Scoring alignment
quantification of similarity (what´s real from chance) - scoring matrix
what represent gaps and mismatches
indels events relative to ancestor (mutations during replication)
3 types of gap penalties formula
constant: -a
proportional to lenght: - (a x l)
affine gap: - (a+bl) a»b b= extending penalty proportional to gap length - more relevant
formula of percentage of identically aligned residues
nº matches/length x 100
protein alignments substitutions of aa are not equal why?
protein sequences are under stabilising selection for structure and function
depend on chemical similarity - similar aa substitute more easily
LEU>ILE or PRO>TRP
BLOSUM62 substitution matrix
Gap free alignment of short protein motifs (BLOCKS)
Higher score - chemically similar, conservative (higher probability of homology)
Heuristic Algorithms (vs DPAs) - example
high scoring short regions exact matches (break query into short words and look for matches and then see if can be extended)
faster
BLAST
BLAST
all matches above threshold are extended until introduction of gaps
High scoring segment pairs (HSPs)
N - nt vs nt (gene)
P - protein vs protein (protein)
x- translated nt vs protein (DNA sequence code protein)
tblastn - protein vs translated nt (what DNA sequence encodes protein)
p values
significant match p
probability of observing as high scoring an alignment between 2 unrelated sequences of similar length and composition
significant match - p<0,05
Expect values (E)
how often a match at a given p value would be expected to occur in the database by chance (biologically unrelated) - threshold for significance
E= pX (should be =<0,01)
X- total length of all sequences in database/ length of aligned sequence