block 2 Flashcards
Bioinformatics
uses computational methods to study protein sequence, structure and function
what is the point of studying sequences and identifying similarity?
-Similarity indicates conserved function
-Human and mouse genes are more than 80% similar at sequence level
-But these genes are small fraction of genome
Most sequences in the genome are not recognizably similar
Comparing sequences helps us understand function
-Locate similar gene in another species to understand your new gene
protein domains
Definition: A distinct, independently folding unit of a protein chain.
Characteristics:
Can exist independently in structure and function.
Typically connected to other domains within the same protein.
Functions:
Examples include protein-protein interaction, nucleic acid binding, and catalytic activity.
Structure:
Tertiary structure: Includes the arrangement of units within domains and how domains fit together.
Quaternary structure: Refers to the association of separate polypeptide chains, not domains within the same chain.
what does a high sequence similarity mean?
-structural similarity and functional similarity
-probably the same fold
interPro:protein family database
InterPro: Protein Family Database
Overview: A large database containing ~22,000 protein families, represented by multiple sequence alignments.
Purpose: Helps identify functional regions (domains) and homologous proteins, offering insights into protein functions.
Key Concept: Proteins are made of functional domains, and combinations of these domains create the diverse range of proteins in nature.
Application: Useful for studying protein family relationships and functional annotations.
SCOP
structural classification of proteins.
-different classes within the database and different combinations of how they come together
-SCOP classifies proteins into a hierarchy using the following categories:
◦
Class: Proteins are grouped into classes based on their secondary structure content, including all alpha proteins, all beta proteins, alpha and beta proteins (a/b), and alpha and beta proteins (a+b).
◦
Fold: Describes the overall shape and arrangement of secondary structures.
◦
Superfamily: Proteins within a superfamily are thought to share a common evolutionary origin.
◦
Family: Proteins within the same family are closely related and share higher sequence similarity.
The SCOP database, along with other databases like CATH, helps in understanding the hierarchy of protein structure, aiding in protein classification and the study of their structure and function. The classifications are based on experimentally determined protein structures and are used to infer possible functions based on the relationship between structure and function
why does sequence variation occurs
-due to random mutations and natural selection
-On the organism level, mutations might lead to disease or death
A single mutation in a protein is usually harmless, unless it appears in the following:
An active site: Enzymes
A binding site: Receptors, antibodies, signaling proteins
A site promoting toxic aggregation: Sickle-cell anemia
sequence formats
FASTA is the most completely used
-protein sequence with the file heading on top with a greater than symbol to indicate what the protein does
sequence alignment and pairwise identity
-To compare two (or more) sequences we need to align them (see below).
One way to quantify similarity between two aligned sequences is by their pairwise sequence identity (with a few caveats):
-Identical amino acids: -Contribute to identity
-Similar amino acids: Contribute to similarity
-Gaps: Contribute to the overall alignment
substitution matrix
-pairwise alignment
-Definition: A scoring system to evaluate sequence alignments by quantifying the likelihood of character substitutions (e.g., amino acids or nucleotides).
Scores: Positive for likely substitutions; negative for rare ones.
Types:
PAM: Based on evolutionary changes in closely related sequences.
BLOSUM: Derived from conserved protein regions (e.g., BLOSUM62).
Uses: Sequence alignment, identifying homologs, and studying evolutionary relationships.
classes of pairwise alignment
global allignment=Tries to align entire sequence
Align all letters from query to target
Suitable for closely related & equal length sequences
local alignment= Aligns regions with highest similarities
Align substring of target with substring of query
Suitable for more divergent sequences, different length and conserved region containing sequences
sequence identity and homology
Homology vs Sequence Similarity:
Sequence similarity refers to the comparison of two sequences (e.g., DNA, RNA, or protein) to quantify how similar they are. This can be measured using:
Score or pairwise sequence identity (e.g., 76% identical in sequences).
Homology refers to the evolutionary relationship between sequences, i.e., whether two genes or proteins share a common ancestor. This is a hypothesis based on sequence comparison, not something that can be directly measured.
Incorrect statement: “Two sequences are 76% homologous.” This is not meaningful. Homology isn’t quantified as a percentage.
homologs
genes sharing a common origin
orthologs
genes originating from a single ancestral gene in the last common ancestor of the compared genomes (speciation is the key event)
- are homologs
paralogs
genes related via duplication (gene duplication is the key event)
are homologs
What is the difference between tertiary and quaternary structure?
Tertiary structure describes how units within domains associate and how domains fit together. Quaternary structure describes how separate polypeptide chains associate with each other.
What is a substitution matrix used for?
A substitution matrix is used in pairwise alignment to quantify similarity between amino acids, based on their biochemical and biophysical properties. Examples include BLOSUM62 and PAM250.
What are the two main types of pairwise alignment?
The two main types of pairwise alignment are global alignment, which aligns the entire sequence, and local alignment, which aligns regions of highest similarity
What is the main purpose of global alignment and local alignment?
Global alignment is used to identify folds and sequence homology, while local alignment is used to identify motifs and gene duplication events
What is sequence identity?
Sequence identity refers to the identical amino acids in an alignment. It is a main tool to establish sequence homology and similarity in folds
What is BLAST?
BLAST (Basic Local Alignment Search Tool) is a heuristic pairwise alignment tool that searches for local similarities between sequences, approximating the Smith-Waterman algorithm
What is the most common input format for BLAST?
The most common input format for BLAST is the FASTA format
What does the e-value in BLAST output represent?
The e-value in BLAST output represents the number of alignments expected by chance
What is Multiple Sequence Alignment (MSA) used for?
MSA is used to align multiple sequences to identify conserved regions, motifs, and patterns. MSA can also be used to study sequence conservation, gene duplication events, and amino acids that are important for binding or catalysis.