Bioinformatic applications and Genome databases Flashcards
What is Bioinformatics?
It is the integration of mathematical, computer, statistical and biological sciences to analyze biological “big data”.
is used for a variety of tasks, some of the most common are:
Bioinformatics uses computer/software-based approaches to:
- Comparisons of related sequences from different species/organisms (alignments)
- Assemble genomes
Sequence reads → contigs → scaffolds → chromosomes → genomes - Annotate genomes
- Identifying genes and regulatory elements (promoters, enhancers, terminators)
- Structural sequences like telomere regions
- Repetitive sequences (microsatellites) - Investigate gene expression patterns (transcriptomics)
- Translate ORF to amino acid sequence for protein analysis
- Prediction of protein function (domains and motifs)
What is Genbank?
- A large repository of digital nucleic acid information and analyses tools – at your fingertips!
- The largest publicly available database of DNA sequences
As sequences are identified and genes are named, each sequence deposited into GenBank is provided with an accession number that scientists can use to access and retrieve that sequence for analysis. The NCBI is an invaluable source of public access data-bases and bioinformatics tools for analyzing genome data.
Genbank is maintained by
the National Center for Biotechnology Information (NIH)
Each sequence deposited in GenBank receives an
accession number - used to access and retrieve a sequence for analysis
To identify gene sequences
Genome projects generate tremendous amounts of DNA-sequence information, these data are simply a string of letters (ATGC) and are of little use until they have been analysed and interpreted.
For example, if we assemble contigs - how do we know where a gene is (promoter, regulatory elements, exons, introns) within these contigs?
Apart from experimental procedures to determine gene function, bioinformatics approaches can be used for the prediction of function by comparing with similar sequences that already exist in the database – BLAST.
Basic Local Alignment Search Tool
Protein-encoding sequences can be identified as open reading frames (ORFs)
Contain start (ATG) and stop (TAA, TAG, TGA) codons
Eukaryotic genes have defined regulatory elements (promoters, terminators, UTRs, polyadenylation signals and CpG islands)
Eukaryotic genes comprise of exons and introns (with defined splice sites between these)
Software can identify all these elements
Translating nucleotide sequence to amino acid sequence
Bioinformatic software can also be used to “translate” ORFs into possible polypeptide sequences as a way to predict the protein encoded by a gene.
Slide 10 -12. Chapt 21 - Part 1