2. Big data Flashcards
How to improve a task (memory requirement / speed)?
- fast computer?
next three tied together:
* algorithm - appropriate & efficient
* data structure - efficient to use/access/query, space efficient to store
* parallelization - to split up tasks and run in parallel
Most realistic time complexity for algorithms in sequence bioinformatics ?
exponential; polynomial
Define time complexity
examples?
Time complexity is defined as the amount of time taken by an algorithm to run, as a function of the length of the input.
= how does the running time of an algorithm increase
with the size of the input?
= most useful way to measure effectiveness of an algorithm
eg. constant, logarithmic, linear, polynomial, exponential
Algorithms - classifications? + examples
by complexity
* logarithmic, linear, quadratic, polynomial, etc
… by design
* brute force
* heuristic (zB greedy, seed and extend
* recursion (dynamic programming)
… by data structure
* hash table
* graphs (zB trees)
… by (biological) problem
* pattern search
Brute force algorithm?
- definition
- use
- running time
- try all possibilities to find the optimal solution –> output: the optimal solution
- can only be used for small problems in bioinformatics
- often have exponential running time
Heuristic algorithm?
- definition
- use
don’t guarantee that the optimal solution is found, because not all solutions are evaluated
are a good compromise between speed and accuracy
many different types / designs exist and are used
Two examples for heuristic algorithms
- greedy algorithms
- seed-and-extend = seed-and-filter-and-extend
Greedy logarithm - definition, use ?
- Heuristic logarithm
- at each step, the algorithm takes the best immediate solution while finding an answer –> may not find the best possible overall solution
- used for optimization problems
- used for sequence alignment & assembly, phylogenetic inference, etc
Seed-and-extend
definition, use?
Seed-and-extend = seed-and-filter-and-extend
- observation: true matches contain short, exactly matching kmers (seeds)
- these are easy & fast to find (compared to longer / inexact matches), as can exclude any point mutations
- extend these short matches in a second step, and only these (filter)
- used for alignment
Subsampling of k-mers - name 3 types?
minimizers, syncmers, strobemers
Subsampling - Minimizers?
saves on space /ram
such that the same k-mers will be selected from similar sequences
doi:10.1093/bioinformatics/bth408
Subsampling - syncmers?
within a window, select the k-mer(s) with the smallest s-mer at the start or end
- suited for comparison of seqs with mismatches
https://doi.org/10.7717/peerj.10805
Data structure - definition, common examples?
- “the organization of information such that it can be
used efficiently” - common data structures: hash table, string, list, table, (suffix) tree, graph
Suffix tree - definition, use?
- representation of a string - present the suffixes of a given string in a tree
- edges are labeled with strings, such that each suffix of a string corresponds to exactly one path from the root to a leaf
*allows efficient searching of this text present the suffixes of a given string in a tree
Suffix tree - disadvantages / advantages?
disadvantages
* not so efficient to construct
* not so efficient to store
(–> convert to more space-efficient suffix arrays!)
advantages
* very efficient to search
- search is proportional to length of the pattern