Bioinformatics (Week 2) Flashcards
- GUI orientated programs
- Graphical User Interface
- Multifunctional
Graphical User Interface
• Easy to use
• No need to understand basic concepts to use
• NEED understand Basic concepts to properly utilize
• One command for Multiple algorithms or steps
• Visually orientated = quick view of multiple sets of data
• Good if your looking for patterns
• Publishable, quality images
Multi-functional
• Can contain a suite of programs
• Helpful when working with complex data/intricate question
• Use multiple formats
• Usually platform independent
• Most available for Mac, Win +Linux
• Mostly commercial software
• Can cost hundreds of thousands of Rands + patented code
• Restricted performance + ability
• Graphic rendering computationally intensive
• Graphic nature also limits certain functions or operations
- Command line (CLI) programmes
program that accepts text input to execute operating system functions.
- Represents bulk of available software
- Thousands of derivates for specific problems
- Specialists + Multi-functional
- Many are only focused on a single task while some are as program suits
- Helpful when working with multiple sets of data from varying sources
- Very specific file formats
- Difficult to master
- Lack of GUI intimidating and little support or help
- True Open Source
- Variations of programs pop-up overnight
- Free!
- Mostly Unix system dependent
- Mac has moderate availability with almost none for Windows
- Great processing usage
- Focus on development is proper Disk and CPU usage
- Difficult to interpret
- Without the aid of visualization software, it is more difficult to properly visualize for publishing or reports
- Data returned all text based + heavily reliant on user editing for analysis
- Homologs, paralog and ortholog
Homologs:
- Protein/gene that shares common ancestor + which has good sequence and/or structure similarity to another
Paralog
- Homologue which arose through gene duplication in same species/chromosome
Ortholog
- Homologue which arose through speciation (found in different species)
- Similarity and homology
Similarity
- Likeness or % identity between 2 sequences
- sharing a number of bases or amino acids
- Does not imply homology
- Quantifiable i.e., CAN sat x% similar
Homology
- Shared ancestry
- Derived from a common ancestral sequence
- Implies similarity
- Not Quantifiable i.e., NOT x% homologous
- Global and local alignment
Global alignment:
- Attempts to align complete length of one sequence with complete length of the other
o Needleman-Wunsch (1970) algorithm
Local alignment:
- Attempt to find the longest stretches of highest similarity between the two sequences
o Smith-Waterman (1981)
- Pairwise alignments
- Describe percent identity 2 sequences share + % similarity
- Score of a pairwise alignment includes positive values for exact matches, + other scores for mismatches and gaps
- Based on a scoring matrix
- PAM and BLOSUM
• PAM10 and BLOSUM80
• PAM250 and BLOSUM30
scoring Matrix
- PAM + BLOSUM scoring matrices provide rules for assigning scores.
- PAM10 and BLOSUM80 = examples of matrices appropriate for comparison of closely related sequences.
- PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins.
Scoring matrix
- look under objective for diagram
- What is BLAST, what is its main purpose and Types of BLAST
• BLAST (Basic Local Alignment Search Tool) allows rapid sequence
comparison of query sequence against a database.
MAIN purpose = infer homology
Types of BLAST (Diagram - objective)
- Nucleotide-based BLAST
* exact word match, one word match - Protein-based BLAST
* neighborhood words, two word matches within 40 residues
- Raw score (S)
- Calculated as the sum of identities, substitution matrix + gap scores.
- Substitution scores are given by a look-up table (PAM, BLOSUM)
- Gap scores calculated as sum of G, gap opening penalty and L, gap extension penalty
- For a gap length of n, gap cost = G + Ln
- Usually a high value for G and lower value for L
- Alignment specific
- Bit score (S’)
- Derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account
- Bit score calculated based on frequency of a particular aligned character pair compared to frequency of the same character pair in a random sequence
- Bit scores have been normalized with respect to coring system (normalized for “effective length“) + used to compare alignment scores from different searches
- E-value
• Significance of each alignment computed as E-value
Number of hits of score ≥ S expected by chance when searching given string in a database of a particular size
- Based on random database of similar size
- Lower means more significant indicating that the observed sequence similarity is unlikely to have arisen purely by chance
- Used to assess statistical significance of alignment
- E value is equivalent to standard P value
- Significant if E < 0.001 (smaller numbers = more significant)
• A sequence alignment that has E-value of 0.001 means that this similarity has a 1 in 1000 chance of occurring by chance alone
OR
• in database of similar size that is the expected number of results that will have other alignments with similar or better S scores
11,1 E value depends on
(a. ) Similarity Score (Bit Score): Higher similarity score (e.g., high % seq id) = smaller E-value
(b. ) Length of the query: Similarity Score is more easily obtained by chance with a longer query sequence, longer queries = larger E-values
(c. )Size of the database: Since a larger database makes Similarity Score easier to obtain, larger database = larger E-values
- very low E values (< e-100) = homologs or identical genes
- moderate E values (~ e-50) = related genes
- long list of gradually declining E values indicates large gene family
- long regions of moderate similarity are more significant than short regions of high identity