Intro To Bioinformatics Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is bioinformatics?

A

The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic processes in biotic systems

  • Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline (NCBI, 2009)
  • In bioinformatics, computer databases are used to store, retrieve and assist in understanding biological information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give examples of biological data in bioinformatics

A
  • DNA( genome)= sequence, pathway
  • RNA(transcriptome )= structure, interaction
  • Protein(proteome)= evolution, mutations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can be used as bioinformatics in analysis of DNA?

A
  • simple sequence analysis
  • Gene finding
  • regulatory regions
  • whole genome annotations
  • comparative genomics (analysis between species and strains)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What can be used as bioinformatics in analysis of RNA?

A
  • Splice variants
  • Tisssue specific expression
  • structure
  • single gene analysis (various cloning techniques)
  • Experimental data involving thousands of genes simultaneously
  • DNA chips, micro-arrays and expression array analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can be used as bioinformatics in analysis of Proteins?

A
  • homology
  • conserved domains/regions
  • structure determination(molecular modeling): 2D, 3D & quartenary structure
  • protein function
  • Analysis often involve 2D gels & Mass spectrometers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the major Nucleotide Sequence databases?

A
  1. GenBank: National Center for Biotechnology information
    • GenBank is the NIH genetic sequence database which is part of the International Nucleotide Collaboration, it is comprised of the DNA Data Bank of Japan (DDBJ) , the European Molecular Biology Laboratory (EMBL), and gen bank at NCBI
  2. EMBL: European molecular biology laboratory
    • The European Molecular Biology Laboratiry(EMBL), Nucleotide Sequence database is the European equivalent to the U.S.’s Gen Bank
  3. DDBJ: DNA data bank of Japan
    • DNA data bank of Japan(DDDBJ) which is based in Japan’s National Institute of genetics, is the third in the trio of major nucleotide sequence databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the protein major sequence databases?

A
  1. Uniprot: United protein database
  2. PIR: Protein Information Resource Databases
  3. Swiss-Prot
  4. ExPASY
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the Uniprot: United protein database

A

Uniprot is a single database that combines the information of the major international databases, European Bioinformatics Institute (EBI), Cambridge, UK; Protein Information Resource(PIR)-Georgetown university medical center(GUMC) & National Biochemical Research Foundation (NBRF), Washington, D.C.; and Swiss Institute of Bioinformatics (SIB) -Geneva, Switzerland

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the PIR: Protein Information Resource Databases

A

PIR grew out of the Atlas of Protein Sequence and Structure (1965- 1978) which was edited by Margaret Dayhoff

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the Swiss-Prot

A

Swiss-Pot is the major European protein sequence database, from from the Swiss institute of bioinformatics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the ExPASY: Expert Protein Analysis System

A

-is the new Swiss Institute of Bioinformatics(SIB) Resource Portal which provides access to scientific databases and software tools in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Query Sequence ?

A

A sequence, either amino acid or nucleotide chosen by the user to use in a BLAST search

  • A query sequence can be typed or pasted into the query window on the search form
  • BLAST searches require a minimum query sequence length of 15 nucleotides or amino acids
  • Query sequence can either be FASTA, Bare sequence or identifier (Accession number or gene info ID(gi)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an alignment ?

A

A presentation of two compared sequences showing the regions of greatest statistical similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the score value ?

A

The score value is a measure of the quality of the alignment between the query sequence and the search results

-the higher the score, the better the alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the E-value?

A

The E-value refers to the expectation value

  • The number of different alignments with scores equivalent to or better than alignment scores that are expected to occur in a database search by chance
  • The lower the E value, the better the match
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is genome annotation?

A
  • Obtaining biological information from unprocessed sequence data
  • The ultimate goal is to create a labeled genome, where biological information is linked to sequence
  • There are two types structural and functional annotations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the structural genome annotation?

A

The identification of Genomic elements (genes and other important sequence)

  • ORFs and their localization
  • gene structure
  • coding regions
  • location of regulatory motifs
18
Q

What are the functional genome annotation?

A

Consists of attaching biological function to genomic elements

  • biochemical function
  • biological function
  • involved regulation and interactions
  • expression

The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that

19
Q

Functions of gene prediction software include:

A
  • to identify genes within a long DNA sequence
  • A DNA sequence that codes for amino acids shouldn’t contain any stop codons
  • Such a coding region is called an open reading frame
  • Because each DNA strand can be read in three reading frames and there are two DNA strands
  • The computer must analyze a given DNA sequence in six different reading frames.
  • An example of a tool that can be used is Open Reading Frame finder (ORF Finder) at NCBI and GENSCAN
20
Q

What are the diagnostic features of a gene function included in the presence of…

A
  • Open reading Frame(ORF)
  • Start codon (Met start codon-ATG)
  • Stop codon (TGA, TAG F TAA)
  • Terminator sequence (prokaryotes)
  • TATA Box (eukaryotes)
  • Shine Delgano sequence(prokaryotes)
  • Kozak sequence (Eukaryotes)
  • Poly A addition signal (eukaryotes)
  • Intron/exon boundaries.
  • CpG islands.
21
Q

What is the open reading frame (ORF)?

A

Effective for the analysis of bacteria genomic DNA sequences

  • It is not so effective for analyzing the DNA for Eukaryotes(particularly higher eukaryotes)
    • due to the intron/exon structure of eukaryotic gene
    • A more sophisticated gene identification software is often required for eukaryotes.

ORF is sometimes refferred as aB initio
-Because they attempt to predict genes based only on the knowledge and understanding gene structure

22
Q

What are the criteria for judging a good ORF?

A

An ORF should begin with a Start codon (Methionine residue) and end with an in-frame stop codon

  • It must be of reasonable size (the longer the better). Long ORFs are unlikely to occur by chance and thus signify potential genes. Short AAs sequences are probably not ORFs
  • It should end with an in-frame stop codon. Many stop codons close together suggest that an ORF is not present

NB: in the lab session, ORF will be used to search for Prokaryotic genes

23
Q

Describe sequence alignment software

A

Computer analysis can be employed to determine if a newly sequenced gene is similar to that a,ready known and stored in data base

  • the two most popular sequence alignment programs are BLAST(Basic local alignment search tool) & FASTA(fast all)
  • the BLAST program searches a nucleic acid database to find matching or similar sequences to that being tested. The BLAST approach first look for similar segments (high-scoring segment pair-HSPs ) between the query sequences and a database sequence

Next, evaluate the statistical significance of any matches that were found

-finally report only those matches that satisfy a user-selectable threshold of significance

24
Q

What is the emphasis of sequence alignment software?

A

The emphasis of this tool is to find regions of sequence similarity

These can yield clues about the structure and function of this novel sequence and about its evolutionary history and homology with other sequences in the databases

-Regions of similarity detected via this type of alignment tool can be either local, where the region of similarity is based in 1 location, or global, where regions of similarity can be detected across otherwise unrelated genetic code

25
Q

What are BLASTP used to research?

A

BLASTP is used search a protein database using a protein query

26
Q

What is BLASTX used to research?

A

BLASTX is used to search a protein database using a translated nucleotide query

It compares the 6-frame translations of DNA query to protein database

27
Q

What is tblastn used to research?

A

Used to search translated nucleotide database using a protein query

28
Q

What is tblastx used to research?

A

Used to search translated nucleotide database using a translated nucleotide query. It compares the 6 frame translations of DNA query to the 6-frame translations of a DNA database (each sequence I’d comparable to BLASTP searches)

29
Q

What is FASTA used to research?

A

Compares a DNA query to DNA database, or a protein query to protein database

30
Q

What is FASTX used to research?

A

Compares a translated DNA query to a protein database

31
Q

What is TFASTA used to research?

A

Compares a protein query to a translated DNA database

32
Q

What are the types of Search Format, Blast query input?

A

The BLAST ‘Search’ box accepts a number of different types of input and autonomically determines the formats. Accepted input types are:

  1. FASTA
  2. Bare Sequence
  3. Identifiers
33
Q

Describe FASTA as a search format

A

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (>) symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is….

  • Blank lines aren’t allowed in the middle of FASTA input
  • Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes
  • However, a single hyphen or dash can be used to represent a gap of indeterminate length
34
Q

Describe Bare sequence as a search format

A

-This may be just lines of sequence data, without the FASTA definition line, e.g.

QIKDLLGDGGGGSSSSUHTVRFVRDD

  • It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept Flatfile report
  • Blank lines aren’t allowed in the middle of bare sequence input
35
Q

Describe identifiers as a search format

A
  • Normally these are simply accession, accession version or gi’s (e.g., p01013, AAA68881.1, 129295)
  • However, a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted
  • These NCBI sequence identifiers have a very specific syntax. The identifier may consist of only one token (I.e. word)
  • Spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier aren’t allowed)
  • Examples of illegal input are: ACCENSION P01013 AAA68881.1 gi|129295
36
Q

Explain sequence homology

A

Analysis of DNA sequences of homologous genes to provide clues to the evolutionary relationships between organisms

  • Two species that are closely related to each other will have DNA (or amino acid) sequences that are more similar to each other than if they are more distantly related
  • Such sequence analyses can be used to construct family trees of organisms
  • Bacteria from different species can exchange DNA sequences (horizontal transfer), making it more difficult to establish relationships among bacteria on their DNA sequences
  • Multiple Sequence Alignment Spftware - like CLUSTAL W & COBALT (Constraint-based Multiple Alignment Tool) are used by scientists to study the phylogenies relationships between species
37
Q

Explain what is the phylogenies tree?

A

A phylogenies tree or evolutionary tree is a branching diagram or “tree”

  • It shows the inferred evolutionary relationships among various biological species or other entities
  • It is based upon similarities and differences in the species physical and/ or genetic characteristics
  • The tax’s joined together in the tree “trunk”; and organisms that have arisen from it are placed at the ends (tip) of tree “branches”
  • Closely related groups are located on branches close to one another
38
Q

What does BLAST do?

A

The most widely used software in bioinformatics research

It’s main function is to compare a sequence of interest, the query sequence, the sequences in a large database

39
Q

Explain the MASCOT Search ENGINE from Max Science

A
  • A powerful database search engine
  • Integrates all of the proven methods of database searching
  • Peptide fingerprint, Sequence Query and MS/MS Ion Search
  • Uses any nucleic or amino acid database in FASTA format:
    • to identify, characterize and quantify proteins
40
Q

What are the challenges in bioinformatics?

A

Biological Redundancy and multiplicity:

  • Different sequences with similar structures
  • Organisms with similar genes
  • Mutliple functions of single genes
  • Grouping of genes in pathways

Sequence redundancy in genomes

Significance of relationships and similarities

Signal vs Noise

Lack of data