Bioinformatics Flashcards
interdisciplinary
field that combines biology, computer science,
statistics, mathematics, and engineering to analyze
and interpret biological data, particularly data from
large datasets like genomes or protein sequences
Bioinformatics
It is a widely-used format for
representing nucleotide or protein sequences.
FASTA
It consists of a header line starting with ‘>’, followed by the sequence data on subsequent lines.
FASTA
in sequence alignment, a ________ represents a position where one sequence has an insertion or
deletion relative to another sequence.
Gap
____________ are
introduced to optimize alignment and account for
evolutionary changes
Gap
___________ are
introduced to optimize alignment and account for
evolutionary changes.
Gap
It is the
sequence for which you are searching for similarities
or matches within a database
Query sequence
It’s the sequence you
are using as a reference
Query sequence
it is the
sequence(s) in a database against which the query
sequence is compared during sequence alignment or
similarity searches
Subject sequence
it is a branching
diagram that depicts the evolutionary relationships
among a set of organisms, genes, or species
Phylogenetic tree
It
shows the inferred evolutionary history and
relatedness based on genetic or sequence data
Phylogenetic tree
it is a
unique numerical identifier assigned to each
sequence entry in the NCBI (National Center for
Biotechnology Information) databases.
GI number
It provides a
stable and unique way to refer to a specific sequence
entry.
GI number
It is a
unique identifier assigned to a sequence record in a
public sequence database (like GenBank, EMBL, or
DDBJ)
Accession number
Typically consist of letters
and numbers and are used to reference specific
sequence entries.
Accession number
Involves
identifying and labeling the features of a genome such as genes, regulatory sequences, and other
functional elements.
Genome annotation
This process helps in
understanding the biological significance of the DNA
sequence.
Genome annotation
In sequence alignment or similarity searches, it is a numerical value that quantifies the level
of similarity or quality of alignment between two
sequences.
Score
Higher scores generally indicate more
significant similarity.(T or F)
TRUE
It is a statistical
measure that estimates the number of different
alignments with scores equivalent to or better than a
given score that would occur by chance in a database
search.
Expect value (E-value)
A ___________ indicates a more significant
match or similarity.
lower E-value
A field which uses computers to store and analyze
molecular biological information
BIOINFORMATICS
It is about finding and interpreting biological data
online
BIOINFORMATICS
It is a field in which biology, mathematics, statistics, computer
science, information technology, and other health sciences are
merged into a single discipline to process biological data
BIOINFORMATICS
It uses complex machines to read biological data at a much
faster rate than before.
BIOINFORMATICS
There is a marriage between biology and informatics. (T or F)
TRUE
The science of collecting and analyzing complex
biological data
BIOINFORMATICS
Allows the storage and management of large biological data sets
THE CREATION OF DATABASES
Data is being generated at a much greater pace than
its analysis (e.g. Human Genome Project)
THE CREATION OF DATABASES
These are repositories so it’s like a bank of biologic
information and are designed to collect, archive, visualize, and
organize biologic data.
Databases
This is to enable scientists to have an
intelligent data description, interpretation, or retrieval.
Databases
There is
much data that has been generated especially since the
completion of the
Human Genome Project
When was Human Genome Project launched?
1990s
Objective of human genome project
To sequence
the entire human genome which consists of about 3.2 billion
base pairs.
It was completed in 2003 because of this there’s a
large amount of data that have to be interpreted or analyzed.
Human Genome Project
Aside from the human genome, many other organisms were
completely sequenced. So there is again an enormous amount
of data that has to be understood that is why databases have
been created. (T or F)
TRUE
PRINCIPAL COMPONENTS OF BIOINFORMATICS
*THE CREATION OF DATABASES
*THE DEVELOPMENT OF ALGORITHMS AND STATISTICS
*THE USE OF THESE TOOLS FOR THE ANALYSIS AND
INTERPRETATION OF VARIOUS TYPES OF
BIOLOGICAL DATA
Determine relationships among members of large
data sets
THE DEVELOPMENT OF ALGORITHMS AND
STATISTICS
The large set of data are organized so that relationships can
be determined that is called
Algorithm
Algorithm is applied in ________
Statistics
including DNA, RNA and protein sequences, protein
structures, gene expression profiles, and biochemical
pathways
THE USE OF THESE TOOLS FOR THE ANALYSIS AND
INTERPRETATION OF VARIOUS TYPES OF
BIOLOGICAL DATA
Sciences that attempt to describe a living organism
in terms of ‘omics’
BRANCHES OF BIOINFORMATICS
BRANCHES OF BIOINFORMATICS
Genomics
Transcriptomics
Proteomics
Microbiomics
Metabolomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
- involves the description of sequences of
the entire genome of an organism
Genomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
study of all RNA molecules in a
living organism
Transcriptomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
the description of the entire
complement of proteins in a living organism.
Proteomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
They
study the sequence, 3D structures, and other
properties of proteins.
Proteomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
It is the entire proteins found in a living organism.
Proteomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
Pertains to microbes, viruses, fungi,
parasites, bacteria.
Microbiomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
The genomes of these
microorganisms are described within a specific environmental niche
Microbiomics
IDENTIFY THE BRANCH OF BIOINFORMATICS
involves description of the chemical
processes involving metabolites.
Metabolomics
DNA/RNA BIOINFORMATICS APPLICATIONS
● Retrieving DNA sequences from databases
● Computing nucleotide compositions
● Identifying restriction sites
● Designing polymerase chain-reaction (PCR) primers
● Identifying open reading frames (ORFs).
● Predicting elements of DNA/RNA secondary structure
● Finding repeats
● Computing the optimal alignment between two or
more DNA sequences
● Finding polymorphic sites in genes (single nucleotide
polymorphisms, SNPs)
● Assembling sequence fragments
Identifying open reading frames (ORFs) - Open reading frames means that you have a sequence
which includes the
start codon until a stop codon
WHY DO BIOINFORMATICS?
● It serves to save time when doing real experiments.
design primers
● You might want to do a simulated experiment on a
computer (‘ in silico’) instead of a real environment.
Bioinformatics is very convenient for a scientist because it
serves to
Save him time when he wants to do a real
experiment. As the experiment or the research study may start by
simulating it in a computer first.
When you do simulated
experiments in a computer, that is described as “in silico” so it
is done in a computer rather than a real environment. For
example, when you do PCR and you want to amplify a
particular DNA fragment, you design primers using
bioinformatic tools or software. (T or F)
TRUE
Once you have designed a
primer, then you can do your actual laboratory experiment, we
call it the ____________
Wet lab
Where the primer would be optimized and
eventually used in the amplification reaction.
Wet lab
APPLICATIONS OF BIOINFORMATICS
● Sequence alignment and analysis
● Mapping and analyzing DNA, RNA, Protein, Amino
Acid, and Lipid sequences
● Creation and visualization of 3-D structure models for
biological molecules of significance, e.g., proteins
● Genome annotation
● Genetic diseases
● Designer Medicine
APPLICATIONS IN VARIOUS FIELDS
● Microbial genome applications
● Molecular medicine
● Personalized medicine
● Gene therapy
● Drug development
● Antibiotic resistance
● Evolutionary studies
● Waste cleanup
● Biotechnology
● Climate change studies
● Alternative energy sources
● Crop improvement
● Forensic analysis
● Bio-weapon creation
● Insect resistance
● Improve nutritional quality
● Veterinary science
The earliest databases
for DNA sequences and proteins were developed by three
groups of scientists from different parts of the world:
● Nucleic Acids (International Nucleotide Sequence
Database)
● Protein (Worldwide Protein Data Bank)
IDENTIFY THE DATABASE
DDBJ (DNA Data Bank of Japan)
Nucleic Acids (International Nucleotide Sequence
Database)
IDENTIFY THE DATABASE
EMBL (European Molecular Biology Lab)
Nucleic Acids (International Nucleotide Sequence
Database)
IDENTIFY THE DATABASE
EMBL (European Molecular Biology Lab)
Nucleic Acids (International Nucleotide Sequence
Database)
IDENTIFY THE DATABASE
Genbank (USA)
Nucleic Acids (International Nucleotide Sequence Database)
IDENTIFY THE DATABASE
PDBj (Japan)
Protein (Worldwide Protein Data Bank)
IDENTIFY THE DATABASE
RCSB PDB (USA)
Protein (Worldwide Protein Data Bank)
DNA Data Bank of Japan
DDBJ
Other databases
● Ensembl
● Human metabolome Database (HMDB)
● Gene Expression Databases - Mostly Microarray data
● Phenotypic Databases
● RNA Databases
● Amino Acid/Protein Databases
● Protein-Protein and other Molecular interactions
● Signal Transduction Pathway Databases
● Metabolic Pathway and Protein Function Databases
● Bacterial DNA Databases
Database that provides data on the genome of
characteristic organisms
Ensembl
Very useful particularly if you want to determine the
boundary of exons and introns in a eukaryotic gene.
Ensembl
GENETIC ANALYSIS APPLICATION
● A disease may arise due to changes the sequence of
the gene being expressed
● Single Nucleotide Mutation: Sickle Cell Anemia
A consequence of a change that has
occurred in the gene of hemoglobin particularly the beta
portion of hemoglobin.
Sickle cell anemia
Mutations occurred in some individuals such that A is substituted by U so that the codon became GUG which codes for Vaseline. (T or F)
FALSE (Valine NOT VASELINE)
In sickle cell anemia there was a point
mutation that occurred involving the codon GAG which codes
Glutamic acid
Genetic characteristic
Genotype
Physical characteristic
Phenotype
Recessive trait
Sickle-Cell Anemia
REVIEW THE FINDING THE DNA SEQUENCE OF A GENE, OWKI??
OWKI
A way of rearranging sequences of DNA, RNA or
protein to identify regions of similarity
SEQUENCE ALIGNMENT
Sequence alignment is made between
a known sequence (reference sequence)
and unknown sequence (query sequence)
Reference sequence
Known sequence
Query sequence
Unknown sequence
TYPES OF SEQUENCE ALIGNMENT
Pairwise
Multiple
Compare two sequences
Pairwise
Compare more than two sequences
Multiple
Pairwise
○ EMBOSS WATER
○ BLAST
Multiple
○ MUSCLE
○ MAFFT
○ CLUSTAL Omega
TYPES OF PAIRWISE SEQUENCE ALIGNMENT
Global alignment
Local alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
Matching the residues (bases or
amino acids) of two sequences across their entire length.
Global alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
matches the identical sequences
Global alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
The two sequences are treated as potentially
equivalent
Global alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
Comparing two genes with the
same function (in human vs.
mouse)
Global alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
Comparing two proteins with similar
functions
Global alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
Matching of two sequences from
regions which have more similarity with each other
Local alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
○ The two sequences may or may not be
related
Local alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
to see whether a substring (a part)
in one sequence aligns well with a substring
(a part) in the other sequence
Local alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
Searching for local similarities in
large sequences (e.g., newly
sequenced genomes)
Local alignment
IDENTIFY THE TYPE OF PAIRWISE SEQUENCE ALIGNMENT
Looking for conserved domains of
motifs in two proteins
Local alignment
The residues are colored so that you can
easily see if there is difference if there is any variation among
the sequences.
Clustal omega
When you have a multiple sequence
alignment, you will be able to determine if all of the sequences
are identical by the presence of an __________
Asterisk
if there is a variation, there is no asterisk. (T or F)
TRUE
MULTIPLE ALIGNMENT TOOLS: Analysis of more than 2 sequences
MUSCLE
MAFFT
Clustal Omega
MUSCLE
Multiple Sequence Comparison by Log
Expectation
MAFFT
Multiple Alignment using Fast Fourier
Transform
It is a multiple sequence alignment tool that
arranges the sequences of DNA, RNA or protein to
identify regions of similarity
MUSCLE (Multiple Sequence Comparison by Log Expectation)
Finds regions of local similarity between sequences just like MUSCLE and MAFT
NCBI: Basic Local Alignment Search Tool (BLAST)
The amino acid sequences of proteins or the nucleotides of DNA sequences.
NCBI: Basic Local Alignment Search Tool (BLAST)
Compare a query sequence with a library or database
of sequences, and identify library sequences that
resemble the query sequence above a certain
threshold
NCBI: Basic Local Alignment Search Tool (BLAST)
Can be used to infer functional and evolutionary
relationships between sequences as well as help
identify members of gene families
NCBI: Basic Local Alignment Search Tool (BLAST)
Read additional notes about NCBI: Basic Local Alignment Search Tool (BLAST), owki??
OWKIII
Used to infer functional and evolutionary
relationships between sequences as well as help identify members of gene families
BLAST
You supply multiple sequences to be aligned to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences
MULTIPLE ALIGNMENT
Here you supply all the
sequences with the tools that we used like MUSCLE.
MULTIPLE ALIGNMENT
it will align the sequences that you
uploaded and it does not necessarily look for
sequences in the database
MULTIPLE ALIGNMENT
Read and analyze the difference of multiple sequence alignment and BLAST, and the summary. OWKI??
OWKIII