VL 5 Flashcards
What Type of Bioinformatic Data is there?
- primary
– sequences, nucleotide, protein – phenotopic … - secondary
– protein structures, helices, beta-sheets – expression data, mRNA, proteins - tertiary
– protein domains, protein functions, atom coordi-
nates
– co-expression and other networks
What are the Goals in Sequence Analysis?
- find similar sequences
- must define similarity or dissimilarity
- conclude to functions based on similarities * retrieve similar sequences
- align two or more similar sequences
- explore evolution of sequences
Levenshtein Algorithm
The Levenshtein algorithm, also known as the Edit Distance algorithm, calculates the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another. It measures the similarity or dissimilarity between strings by counting these edits. It is commonly used in spell-checking, DNA sequence alignment, and natural language processing.
cell[i, j] = min(cell[i ́ 1, j], cell[i ́ 1, j ́ 1], cell[i, j ́ 1]) + editvalue
What is Levenshtein-Damerau Algorithm?
The Damerau-Levenshtein distance is a variation of the Levenshtein distance algorithm that also considers transpositions of adjacent characters. It measures the minimum number of edits needed to transform one string into another, including insertions, deletions, substitutions, and transpositions.
- flip operations are one change
- brid (old english) ñ bird (new english) ñ 1
operation
Needleman-Wunsch Algorithm?
The Needleman-Wunsch algorithm is a dynamic programming algorithm used for aligning two sequences, such as DNA or protein sequences. It finds the optimal alignment by maximizing similarity scores and considering matches, mismatches, insertions, and deletions. It is widely used in bioinformatics and sequence analysis.
- gaps can get different scoring points than edits
- exchange matrix for different letter changes
- find global alignment ñ Needleman-Wunsch
- opening and closing a gap can be punished
differentially ñ Needleman-Wunsch-Gotoh
Smith-Waterman Algorithm?
The Smith-Waterman algorithm is a dynamic programming algorithm used for local sequence alignment. It identifies the best local alignment by maximizing similarity scores and considering matches, mismatches, and gaps. It is used to find local similarities or regions of interest within longer sequences.
- find best local alignment ñ Smith-Waterman
- the exchange matrix has smaller punish values for
more similar letters - example: as d/t are both dental sounds or leucin and
isoleucin have similar biophysical properties
What is FASTA?
FASTA is a commonly used file format for representing nucleotide or protein sequences in bioinformatics. It is named after the software package that originally introduced the format, FASTA (Fast All). The FASTA format is widely supported by various bioinformatics tools and databases.
A FASTA file typically consists of one or more sequence entries, each representing a single sequence and its associated metadata. The format follows a specific structure:
Header line: Begins with a greater-than symbol “>” followed by a unique identifier or name for the sequence. Additional information, such as a description or accession number, may be included.
Sequence lines: One or more lines that contain the actual sequence data. The sequence can span multiple lines if needed, but it is often recommended to limit line lengths for easier handling.
The FASTA format is widely used for storing and sharing biological sequence data. It allows for easy identification, retrieval, and analysis of sequences using various bioinformatics tools and databases. FASTA files can contain a single sequence or a collection of sequences, making it a flexible and efficient format for handling genetic or protein data.
What is BLAST?
BLAST (Basic Local Alignment Search Tool) is a bioinformatics tool and algorithm used for sequence similarity searching. It compares a query sequence against a database to identify similar sequences and provides measures of similarity and statistical significance. BLAST is widely used in biological research for sequence analysis and annotation.
BLAST employs a heuristic algorithm that rapidly searches a query sequence against a database, aiming to find local alignments with high sequence similarity. It provides a measure of similarity, known as the alignment score, and calculates statistical significance, often represented as an E-value, to estimate the probability of obtaining the observed similarity by chance.
* Poisson-distribution of score values –> P-Value
* E-value = P-value * Number of entries in the database
What does heuristic mean?
In problem-solving, a heuristic is a practical rule or strategy that aims to find a good solution quickly, even if it is not guaranteed to be optimal. Heuristics simplify complex problems and use informed guesses or rules based on experience to guide decision-making. They strike a balance between efficiency and optimality.
BLAST IS THE MAJOR TOOL PEOPLE ARE USING
How can you plot 2 Sequences?
In a Dotplot (looks like a Chessboard)
black means match
name a non-alignment based method for sequence analysis?
- calculating codon frequencies for different species (dist, hclust)
- GC contents with sliding calculation window over the
sequence - word-frequencies for proteins and nucleotides (3,4,5
and more letters) for different species - enrichment analysis for certain nucleotides or amino
acids at certain positions (chisq.test, fisher.test, …)