Exam 1 Flashcards
4 areas of bioinformatics
- annotation
- comparative
- functional
- structural
Annotation
explanation
* ORF (open reading frames)
* functional sites
* structure, function
promoter
where transcription factors and RNA polymerase bind
What does
ORF
CDS
stand for?
ORF- open reading frame
CDS- coding sequence
p53
- guardian of genome
- Most cancers have p53 mutations
- Informs DNA repair system → apoptosis system
Comparative genomics
genomic features of different organisms are compared
Comparing ORFs ->Identifying orthologs → Inferences on structure and function
Comparing functional sites → Inferences on regulatory networks
Where are ultraconserved elements of the human genome located?
overlapping exons in genes involved in RNA processing
or in introns
or nearby genes involved in the regulation of transcription and development
Functional genomics
To describe gene functions and interactions
Genome-wide profiling of: mRNA levels, Protein levels → Co-expression of genes and/or proteins
Identifying protein-protein interactions → Networks of interaction
Multiomic profile
Genome- DNA
Expressosome- transcription
Proteome- protein
metabolome- metabolites (different metabolomic profiles different in people- thats why some drugs dont work on some people and work on others)
Structural genomics
Assign structure to all proteins encoded in a genome
genomic DNA sequences –> protein-coding genes –> obtain protein by expression OR silico –> protein structure –> biochemical and cellular role
Protein docking site
Giemsa genetics
Staining for identifying chromosomes
Black region: A-T rich → gene poor → transcriptionally inactive
Light region: GC rich → transcriptionally active
cytogenetic banding nomenclature
ChromosmeArmRegionBand. Subband
Why can’t humans digest cellulose?
Because dont have cellulase
chromosome vs chromatin
chromosome- have genetic material (DNA)
chromatin- chromosomal material in its decondensed, threadlike state
what do homologous pairs determine?
the same traits
how do peptide bonds work?
add 2 amino acids and release water
Examples of Aneuploidy
Edward’s syndrome- extra chromosme 18
Properties of the Double Helix
- Hydrogen bonds between the bases and base stacking contribute to helical stability
- Diameter is 20 Å because A/T and G/C base pairs have identical widths (10.85 Å)
- Distance between base pairs is 3.4 Å
- Spaces between the turns of the helix forms major and minor grooves—important sites for DNA/protein interactions
DNA Replication
- Helicase helps form replication fork
- Primase starts by forming sequence
- DNA polymerase binds to primer and starts replication 5’ to 3’ and
- Lagging strand: replace RNA primer into DNA- Ligase joins okazaki fragments
mRNAs
rRNAs
miRNAs
tRNAs
other small RNAs
mRNAs- code for proteins
rRNAs- form core of ribosome and catalyze protein synthesis
miRNAs- gene expression regulation
tRNAs- adaptors between mRNA (codons) and aa during translation
other small- RNA splicing
What controls where RNA polymerase initiates and terminates transcription?
promoter- DNA sequence that is recognized by RNA polymerase as a starting point
chain elongation- occurs until RNA polymerase reaches a terminator site, at which RNA is released and RNA pol dissociates from DNA
Where does transcription and translation happen?
transcription- nucleus
translation- cytosol
transcription provides ……. of genetic info
amplification
how is transcription in eukaryotes different from prokaryotes?
prokaryotes need sigma factors
RNA polymerase prokaryotes vs eukaryotes
prokaryotes- have a single type
eukaryotes- RNA pol 1 (most rRNA genes)
RNA pol 2 (protein-encodoing genes –> makes mRNA)
RNA pol 3: tRNA, 5S rRNA, small structural RNA genes
initiation prokaryotes vs eukaryotes
pro: RNA pol can initiate without helper protein
euk: require general transcription factors
transcription processing prokaryotes vs eukaryotes
pro- transcripts are generally NOT processed
euk- mRNAs are processed
what functions does TF 2H have?
- helicase-like function opens up DNA
- promoter escape: phosphorylates C-terminal end of RNA polymerase –> RNA pol activates and leaves promoter
eukaryote transcription initiation
- RNA polymerase recognizes and binds to promoter
- TF 2D comes with TBP and TAF
- TF 2A/ 2B: help recruit RNA polymerase to promoter region
- TF 2F (escort protein) binds, which recognizes RNA polymerase, and RNA polymerase binds to promoter and stabilizes the structure
- TF 2E/ 2H
TBP vs TAF
TBF: TATA box binding protein
TAF: TATA box binding protein associated factor
eukaryote translation initiation and termination
- Small ribosome subunit scans newly synthesized RNA transcript looking for translation initiator site (AUG)
- Once found, large ribosomal subunit binds
- First amino acid binds to P region
- New amino acids come to bind to A site. That amino acid move to P site and binds through peptide bond to existing amino acid
- tRNA released through E site
- Reaches stop codon (UAG)
- Binding of release factor to A site
- Termination
prokaryote translation initiation
Prokaryotic mRNAs lack 5’ caps, so instead, ribosomes search for specific ribosome-binding sequences (Shine Delgarno sequence), and there can be multiple bindingsites within an mRNA (polycistronic– ribosome can bind anywhere)
prokaryote transcription initiation
- RNA polymerase cant recognize promoter sequence unless sigma factor is attached to it
- RNA polymerase (apoenzyme: not active) + sigma factor = holoenzyme (active). All attached together, closed structure. Needs to be opened up
- Sigma factor leaves RNA polymerase → open structure → Rna polymerase can start transcription
how is RNA polymerase released in prokaryotes?
Rho independent
Have palindromic sequence to form hairpin-like structure
GC (3 bonds) very stable and strong vs weak AU (2 bonds)
Change of forces help to dissociate the newly formed RNA from DNA-RNA hybrid
Rho protein dependent
Stretch of CCCCCC (Rut - Rho utilization site) recognized by rho protein
Cleaves RNA polymerase from hybrid
how is RNA polymerase released in eukaryotes?
- CPSF binds to AAUAA
- CSTF binds G/U-rich sequences
- Cleavage factor chops RNA in half to release TRAN transcript
postranscriptional modification purpose
help go to nuclear pore to cytoplasm
postranscriptional modifications
7 methylguanylate cap: 5’ cap at the end of 5’ that helps navigate through the nuclear pore to cytoplasm; Prevent from exonuclease
** 3’ poly A polymerase**: poly A tail at 3’ end (where RNA polymerase cut through the RNA)
Splicing: chopping introns and attaching exons ; Exons go to cytoplasm for translation
what do anticodons carry?
amino acid
what enzyme is required for transcription of mRNA in eukaryotes?
RNA pol 2?? (double check)
Post translational modification:
addition of functional groups
genomics
proteomics
metabolomics
Genomics -Global analysis of genome structure and function, including gene expression
Proteomics -Global analysis of the proteome, including protein expression and modification
Metabolomics- Global analysis of metabolic processes
RNA prob came first theory
discovery of catalytic RNA, ribozymes, by Tom Cech and Sidney Altman put an entirely different spin on this question.
RNA can both store genetic info and has catalytic functions (ribozymes)- want RNA because already spliced CDNA = DNA and do reverse transcriptase
exons
- Have nucleotides that are translated into amino acids of proteins
- Separated from one another by introns.
- segments of mRNA spliced back together after the introns are removed
introns
- Intervening segments of DNA
- Are not translated into protein. They are removed when eukaryotic mRNA is processed.
intron-free mRNA is used as a ………….. to make proteins
a template
how do you know when a gene is expressed?
when a functional product is made from the gene
Microarray
gene-expression analysis
if want to see expression pattern of 3 different genes:
1. collect RNA from tumor and control cells
2. reverse transcription –> cDNA
3. tagged with different colors and mixed together
4. genes on plates can only fluorescence if fluorecing cDNA comes to bind with them (can be 1, 2, or none colors); probes indicate A, B, or C
what does it mean if a microarray probe is yellow, when the 2 colors were red and green?
equal amounts of red and green (A and B)
2 types of databases
generalized (DNA, proteins and carbs, 3D structures)
specialized (genomes, RNA, protein families, pathways)
NCBI
- National center for biotechnology info
- primary sequence database
- understanding of molecular processing affecting human health and disease
database search
- sequence-based (BLAST, FASTA)
- motif-based
- structure-based
- mass-based protein search
what does BLAST stand for? and what does it do?
- basic local alignment search tool
- produce local alignments: short significant stretches of similarity, irrespective of where they are in the sequence; more useful in searching whole genome
why is BLAST popular?
- Good balance of sensitivity and speed
- Reliable
- Flexible
homology is a ……. relationship
binary- they are either homologous or not
homologous proteins
- have: similar sequence, 3D structures and functions, and share a common ancestral sequence
- Proteins with >30% sequence identity
ortholog
Same protein, different species
Different species, same protein function structure
Human and chimpanzee histone h1.1= orthologs (same in different species)
retain similar functions during evolution
paralogs
- Same species, result of gene duplication
- same species, different function
Why is it preferable to do sequence comparisons on protein sequences as opposed to DNA?
- Proteins are more complex than DNA (20 aa vs 4 nt)
- DNA contains protein coding sequences, untranslated sequences and regulatory sequences and additional sequences with mostly unknown but important function
Global alignments
Trying to see similarity for the entirety
-Attempt to align every residue in every sequence
-Most useful when the sequences in the query set are similar and are of roughly equal length.
-Needleman-Wunsch algorithm
-Best score from among alignments of full-length sequences
Can only use if length is similar. So if there is huge mismatch, cannot use global
Intersequence- how much similarity it has
local alignments
Identify the optimal “best local” alignments and then extends those alignments until search parameters fall below a threshold, T
-More useful for dissimilar sequences and possibly dissimilar sequence lengths
Smith–Waterman algorithm
-Best score from among the partial sequences
Heuristic methods
A speculative explanation serving as a guide in the solution of a problem
FASTA: First fast sequence searching algorithm for comparing a query sequence against a database
BLAST: Improvement of FASTA: Search speed, ease of use, statistical rigor
Smith-Waterman algorithm
Guaranteed to find the optimal local alignment (with respect to the scoring system being used).
Costly: to align two sequences of lengths m and n, O(mn) time and space are required.
Word (K-tuple method)
Heuristic methods that are much more efficient than SW
useful in large-scale database searches where a large proportion of stored sequences will have essentially no significant match with the query sequence
applications/functions of blast
- Identify orthologs/ paralogs
- Identify protein/genes in a genome
- Identify new genes
- Structure/function analysis
- Primer design (Primer-BLAST)
what approach does BLAST use?
- heuristic search that approximates the S-W algorithm, identifies HSPs
- pairwise analysis of a query sequence against an entire databse
how does BLAST work?
- removes low-complexity region or sequence repeats in query sequence
- makes a list of word pairs (K-letter word) for each query sequence
- list possible matching words
- identification of exact word match method (scans database of word pairs that meet T score or greater)
- maximum segment pair alignment method (word hits are extended in either direction to generate alignment with score exceeding S
- list all HSPs in database whose score is high enough to be considered
- determine E value: significance of HSP score
- report every match whose expect score is lower than threshold parameter E
scores are calculated from scoring matrices like BLOSUM62
BLAST word search method
L-W+1
W=11 for nucleotides
W=3 for proteins
BLAST parameteres for evaluating results
score S: calculates from substitution matrix
E-value: expected HSPs of at least S
Alignment: visual assessment of a BLAST result
E value
- expected number of distinct alignments (HSPs), with a score of at least S
- the lower the E value, the more significant/ reliable the score is
- anything less than 10^-4
- decreases exponentially as the score S of the match increases
sequence alignments
- verify locations of sequence identities, similarities, and mismatches
- identify gap locations and distributions
- illustrate extent of sequence between query sequence and a “hit”
protein scoring matrices
- estimate of frequency of occurrence and the likelihood of some aas to substitute for other aas
- BLAST default matrix is BLOSUM62
- the higher the score the better the alignment
if have different gap and extension number, what gap penalty matrix should be used?
affine
types of gap penalties
constant: +1 each match, -1 whole gap
linear: +1 each match, -1 per gap
affine: gap open and extension penalty different
3 lines for each BLAST alignment
top: query
bottom: hit or subject sequence
line between: + for similar aas; spaces for gaps
types of blast alignments
Nedleman Winsch (global)
Smith Waterman; Word (K-tuple) method (local)
what metric is used for MSA?
BLOSSum62 for protein sequence
Specific uses of MSA applications
Phylogenetic analysis
* Domain identification
* Structure/Function Analysis
* Regulatory element identification
Three guiding principles for MSAs
- All MSA apps are approximate
- To be effective, MSAs are iterative
- Use multiple MSA applications to find robust and stable alignments
Three approaches to stopping corruption in PSI BLAST
- Apply filtering of biased composition regions
- Adjust E value from 0.001 (default) to a lower value such as E = 0.0001.
- Visually inspect the output from each iteration. Remove suspicious hits by unchecking the box.