Bioinformatics (Week 2) Flashcards

Question 1

Q

GUI orientated programs
- Graphical User Interface
- Multifunctional

Answer

A

Graphical User Interface
• Easy to use
• No need to understand basic concepts to use
• NEED understand Basic concepts to properly utilize
• One command for Multiple algorithms or steps
• Visually orientated = quick view of multiple sets of data
• Good if your looking for patterns
• Publishable, quality images

Multi-functional
• Can contain a suite of programs
• Helpful when working with complex data/intricate question
• Use multiple formats
• Usually platform independent
• Most available for Mac, Win +Linux
• Mostly commercial software
• Can cost hundreds of thousands of Rands + patented code
• Restricted performance + ability
• Graphic rendering computationally intensive
• Graphic nature also limits certain functions or operations

Question 2

Q

Command line (CLI) programmes

Answer

A

program that accepts text input to execute operating system functions.

Represents bulk of available software
Thousands of derivates for specific problems
Specialists + Multi-functional
Many are only focused on a single task while some are as program suits
Helpful when working with multiple sets of data from varying sources
Very specific file formats
Difficult to master
Lack of GUI intimidating and little support or help
True Open Source
Variations of programs pop-up overnight
Free!
Mostly Unix system dependent
Mac has moderate availability with almost none for Windows
Great processing usage
Focus on development is proper Disk and CPU usage
Difficult to interpret
Without the aid of visualization software, it is more difficult to properly visualize for publishing or reports
Data returned all text based + heavily reliant on user editing for analysis

Question 3

Q

Homologs, paralog and ortholog

Answer

A

Homologs:
- Protein/gene that shares common ancestor + which has good sequence and/or structure similarity to another

Paralog
- Homologue which arose through gene duplication in same species/chromosome

Ortholog
- Homologue which arose through speciation (found in different species)

Question 4

Q

Similarity and homology

Answer

A

Similarity

Likeness or % identity between 2 sequences
sharing a number of bases or amino acids
Does not imply homology
Quantifiable i.e., CAN sat x% similar

Homology

Shared ancestry
Derived from a common ancestral sequence
Implies similarity
Not Quantifiable i.e., NOT x% homologous

Question 5

Q

Global and local alignment

Answer

A

Global alignment:
- Attempts to align complete length of one sequence with complete length of the other
o Needleman-Wunsch (1970) algorithm

Local alignment:
- Attempt to find the longest stretches of highest similarity between the two sequences
o Smith-Waterman (1981)

Question 6

Q

Pairwise alignments

Answer

A

Describe percent identity 2 sequences share + % similarity
Score of a pairwise alignment includes positive values for exact matches, + other scores for mismatches and gaps
Based on a scoring matrix

Question 7

Q

PAM and BLOSUM

• PAM10 and BLOSUM80
• PAM250 and BLOSUM30
scoring Matrix

Answer

A

PAM + BLOSUM scoring matrices provide rules for assigning scores.
PAM10 and BLOSUM80 = examples of matrices appropriate for comparison of closely related sequences.
PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins.

Scoring matrix
- look under objective for diagram

Question 8

Q

What is BLAST, what is its main purpose and Types of BLAST

Answer

A

• BLAST (Basic Local Alignment Search Tool) allows rapid sequence
comparison of query sequence against a database.

MAIN purpose = infer homology

Types of BLAST (Diagram - objective)

Nucleotide-based BLAST
* exact word match, one word match
Protein-based BLAST
* neighborhood words, two word matches within 40 residues

Question 9

Q

Raw score (S)

Answer

A

Calculated as the sum of identities, substitution matrix + gap scores.
Substitution scores are given by a look-up table (PAM, BLOSUM)
Gap scores calculated as sum of G, gap opening penalty and L, gap extension penalty
For a gap length of n, gap cost = G + Ln
Usually a high value for G and lower value for L
Alignment specific

Question 10

Q

Bit score (S’)

Answer

A

Derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account
Bit score calculated based on frequency of a particular aligned character pair compared to frequency of the same character pair in a random sequence
Bit scores have been normalized with respect to coring system (normalized for “effective length“) + used to compare alignment scores from different searches

Question 11

Q

E-value

Answer

A

• Significance of each alignment computed as E-value
Number of hits of score ≥ S expected by chance when searching given string in a database of a particular size

Based on random database of similar size
Lower means more significant indicating that the observed sequence similarity is unlikely to have arisen purely by chance
Used to assess statistical significance of alignment
E value is equivalent to standard P value
Significant if E < 0.001 (smaller numbers = more significant)

• A sequence alignment that has E-value of 0.001 means that this similarity has a 1 in 1000 chance of occurring by chance alone
OR
• in database of similar size that is the expected number of results that will have other alignments with similar or better S scores

Question 12

Q

11,1 E value depends on

Answer

A

(a. ) Similarity Score (Bit Score): Higher similarity score (e.g., high % seq id) = smaller E-value
(b. ) Length of the query: Similarity Score is more easily obtained by chance with a longer query sequence, longer queries = larger E-values
(c. )Size of the database: Since a larger database makes Similarity Score easier to obtain, larger database = larger E-values

very low E values (< e-100) = homologs or identical genes
moderate E values (~ e-50) = related genes
long list of gradually declining E values indicates large gene family
long regions of moderate similarity are more significant than short regions of high identity

Bioinformatics (Week 2) Flashcards

(12 cards)