Patterns in Protein Sequences Flashcards
Prosite aims
Database by SIB (Swiss Institute of Bioinformatics)
Database of protein domains, families and functional sites.
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids.
Prosite Syntax
- “A”… standard IUPAC one-letter codes
- “x”position where any aa is accepted
- “[ALT]”ambiguity any of Ala or Leu or Thr
- “ALT” (geschweifte Klammer) negative ambiguity any aa but not! Ala or Leu or Thr.
- “x(2,4)”,”L(3)”repetition. two to four amino acids of any type, exactly three Leu
- ”-“ aa separator
- "”C terminal Ala
regex tool for the terminal, syntax
Regexpression Tools
• a powerful language for specifying patters
• used in different contexts and in different “dialects”
- Linux shells
- text editors
- protein motif data bases (e.g. PROSITE)
- R-grep(’ALA’,psa.7rsa$AA)
• pattern
- in everyday life
- in biological sequences: Proteins, DNA
Regex Syntax
• characters:- exact matches: x q ATG SS2010
• metacharacters:
- characters with special meaning:
[ ] . ? * + j f g () n^ $
• metasymbols:
- sequences of characters with special meaning:
s/t/n/w
• “A”… standard IUPAC one-letter codes
• “.” position where any aa is accepted
• “[ALT]”ambiguity any of Ala or Leu or Thr
• “[^ALT]” negative ambiguity any aa but not! Ala or Leu or Thr.
• “2,4”, “L(3)” (geschweifte Klammern) repetition. two to four amino acids of any
type, exactly three Leu
• “+” one or more
• “*” zero or more
• “^M”N terminal Met
• “A$”C terminal Ala
grep real regex syntax - reading
Commandline Tool grep
• can be used to search text for patterns
• used to extract lines from files by searching for file content
• is one of the most useful and powerful linux commands
Syntax: grep [options] pattern file Example: [selbig@white LehreSS16]$ grep M ss_aa_matrix.txt M 48532 331720 58130 227032 137 ..
Regex Disadvantages
- too rigid to pick up divergent sequences
- short patterns might be too unspecific / will find false positives
- cannot include information about relative frequencies