sequence alignment Flashcards
1
Q
bioinformatics resources
A
- algorithm
- set of rules to perform an operation
- same one can be used by different programs
- program
- code that implements an algorithm
- can use stored data, but not its aim (e.g. PSI BLAST)
- database
- organised searchable source of biological data
- aim is to store data
2
Q
protein evolution
A
- duplication can lead to divergent evolution
- homologous proteins with related sequences and structures
- often but not always related function
- analyse by alignment
3
Q
alignment features
A
- identity (:)
- gap
- insertion or deletion
- substitution
- can be conservative (.)
- same characteristics e.g. hydrophobicity
- end gap
- one sequence longer than the other
4
Q
paralogs
A
- homolog created by gene duplication within a species
- can result in change of function
- original copy can maintain function
- second copy free to mutate and adopt novel function
5
Q
orthologs
A
- homolog created by speciation
- both species now have a single copy of the same gene
- only one copy per species
- less likely to change
- function needs to be retained
6
Q
requirements of a pairwise protein sequence alignment
A
- scoring scheme of residue similarity
- algorithm to establish the alignment
- aim to combine algorithm and scoring scheme to generate the best alignment in biological terms
- potential to be extended to database searching
7
Q
scoring schemes
A
- simplest would be 1 for identity and 0 for different
- better to include similarity of residues
- conservative subsitutions indicate more recent changes
- residues tend to retain chemical properties so that function is modified, not destroyed
- gaps also indicate increased distance
8
Q
BLOSUM
A
- blocks substitution matrix
- aligned segments of protein families (blocks)
- blosum62:
- clustered sequences in blocks where pairwise identity >62%
- most widely used
9
Q
blosum62
A
- substitution matrix
- score for changing one residue to another
- represents chemical similarity
- e.g. cys - disulfide formation means high conservation
- presence in both sequences indicate similarity
- high score (9)
- low negative score if properties change
- e.g. hydrophobic to charged
- empirical
- gaps considered later
10
Q
affine gap penalty
A
- penalise insertions/deletions
- penalty = o + el
- o = gap opening constant
- e = gap extension constant
- l = length of gap extension
- o>e
- gap introduction is the major event
- extending the gap is minor
11
Q
protein domains
A
- protein seqeucnes formed of domains
- each domain originates from a different homologous family
- domains are the evolutionary unit
- methods need to take this into account
- don’t have to align whole sequence
12
Q
local vs global alignment methods
A
- different algorithms
- part or all of a query can match part or all of a database sequence
- gaps may be needed to get a suitable alignment
13
Q
dotplot
A
- used to assign identities
- one sequence on each axis
- assign dot where they match along the diagonal
- best path has the highest number of dots
- need closely related sequences
14
Q
needleman wunsch algorithm
A
- dynamic programming
- maximises similarity score to give maximum match
- largest number of residues of one sequence that can be matched with another allowing for all possible insertions/deletions
- finds best global alignment
- iterative matrix method
- 2D array of all possible pairs of residues (bases or amino acids)
- one sequence on each axis
- all possible alignments represnted by paths through the array
15
Q
NW similarity values
A
- Sij = numerical value assigned to every cell in the array
- depends on similarity of the 2 residues
- value of 1 indicates identity