2. Big data Flashcards
How to improve a task (memory requirement / speed)?
- fast computer?
next three tied together:
* algorithm - appropriate & efficient
* data structure - efficient to use/access/query, space efficient to store
* parallelization - to split up tasks and run in parallel
Most realistic time complexity for algorithms in sequence bioinformatics ?
exponential; polynomial
Define time complexity
examples?
Time complexity is defined as the amount of time taken by an algorithm to run, as a function of the length of the input.
= how does the running time of an algorithm increase
with the size of the input?
= most useful way to measure effectiveness of an algorithm
eg. constant, logarithmic, linear, polynomial, exponential
Algorithms - classifications? + examples
by complexity
* logarithmic, linear, quadratic, polynomial, etc
… by design
* brute force
* heuristic (zB greedy, seed and extend
* recursion (dynamic programming)
… by data structure
* hash table
* graphs (zB trees)
… by (biological) problem
* pattern search
Brute force algorithm?
- definition
- use
- running time
- try all possibilities to find the optimal solution –> output: the optimal solution
- can only be used for small problems in bioinformatics
- often have exponential running time
Heuristic algorithm?
- definition
- use
don’t guarantee that the optimal solution is found, because not all solutions are evaluated
are a good compromise between speed and accuracy
many different types / designs exist and are used
Two examples for heuristic algorithms
- greedy algorithms
- seed-and-extend = seed-and-filter-and-extend
Greedy logarithm - definition, use ?
- Heuristic logarithm
- at each step, the algorithm takes the best immediate solution while finding an answer –> may not find the best possible overall solution
- used for optimization problems
- used for sequence alignment & assembly, phylogenetic inference, etc
Seed-and-extend
definition, use?
Seed-and-extend = seed-and-filter-and-extend
- observation: true matches contain short, exactly matching kmers (seeds)
- these are easy & fast to find (compared to longer / inexact matches), as can exclude any point mutations
- extend these short matches in a second step, and only these (filter)
- used for alignment
Subsampling of k-mers - name 3 types?
minimizers, syncmers, strobemers
Subsampling - Minimizers?
saves on space /ram
such that the same k-mers will be selected from similar sequences
doi:10.1093/bioinformatics/bth408
Subsampling - syncmers?
within a window, select the k-mer(s) with the smallest s-mer at the start or end
- suited for comparison of seqs with mismatches
https://doi.org/10.7717/peerj.10805
Data structure - definition, common examples?
- “the organization of information such that it can be
used efficiently” - common data structures: hash table, string, list, table, (suffix) tree, graph
Suffix tree - definition, use?
- representation of a string - present the suffixes of a given string in a tree
- edges are labeled with strings, such that each suffix of a string corresponds to exactly one path from the root to a leaf
*allows efficient searching of this text present the suffixes of a given string in a tree
Suffix tree - disadvantages / advantages?
disadvantages
* not so efficient to construct
* not so efficient to store
(–> convert to more space-efficient suffix arrays!)
advantages
* very efficient to search
- search is proportional to length of the pattern
Suffix trees - applications?
many applications in sequence bioinformatics
- keywords/patterns in text
- (maximal) repeats
- unique substrings
- longest common substring
- inexact matches
Parallelization - why? requires? types?
(parallel computing as opposed to serial computing)
why
* clock speed maxed out
* performance improvements now only possible through parallelization
requires
* special hardware and programs
* important: frequent communication between subtasks?
types
* embarrassing, coarse-grained, fine-grained
multi-core computers - where? why? requires?
- many-core processors are standard now - desktops, laptops, mobile phones, …
- communication between cores is fast / easy
- required rewriting of software (parameters)
(computing) clusters - definition/ consists of, disadvantage, requires, example in linux
definition / consists of:
* multiple computers (that are multi-core); nodes
* a login / head / master node
* linked by a fast local area network
disadvantage:
* expensive – but cheaper than a single computer of comparable size & speed
requires:
* application programs: extra programming / parameters
* requires a system to distribute jobs
example in linux:
SLURM
Heterogeneous computing
CPUs
* 2-8 cores, general-purpose
GPUs
* hundreds of cores, highly specialized - thanks to gaming!
* programmable for general purpose computing since 2007: CUDA (NVIDIA)
integration of CPUs and GPUs
–> performance boost
but: specialized programs / programming skills required
cloud computing - definition, advantages, example providors, requirments
definition: virtual machines on many networked computers
advantages:
- emulation of a physical computer (operating system, configuration, software, etc)
- high flexibility
- saving & duplicating a VM is very easy
computing power rented on demand - example:
- Amazon, Google, Microsoft
- some biological data sets available;
limiting step: data upload may otherwise be the limiting step
requires:
- specialized programming required
MapReduce / Hadoop; Go programming language
Big data - skills needed?
- know, understand, select efficient & appropriate
algorithms - efficient run time for large data sets
- efficient memory usage
- efficient data structures
(space efficient, efficient to access) - be able to work on high performance computers
- remote computing
- parallel computing
- organize data / results / documentation
- learn to work with many different programs
- work under unix/linux, in the terminal
Algorithm - definition
a list set of instructions, used to solve problems or perform tasks
input –>** algorithm **–> output
- a sequence of ordered instructions that must be carried out to solve a well-formulated problem
- algorithms can be stated in English (or German or Spanish) or in pseudocode
- algorithms can be implemented in different ways (e.g., in different programming languages)
- a correct algorithm will solve all instances of a problem
- given a problem, there may be many correct algorithms that can solve it
Connection between data structure and algorithm?
- Algorithms: complexity/efficiency: one factor for this is space requirement, which can be optimised by using the right data structures
- The way the data is accessed by the algorithm differs depending on the data structure