2. Big data Flashcards by Stevie Davies

How to improve a task (memory requirement / speed)?

fast computer?

next three tied together:
* algorithm - appropriate & efficient
* data structure - efficient to use/access/query, space efficient to store
* parallelization - to split up tasks and run in parallel

How well did you know this?

Not at all

Perfectly

Most realistic time complexity for algorithms in sequence bioinformatics ?

exponential; polynomial

How well did you know this?

Not at all

Perfectly

Define time complexity

examples?

Time complexity is defined as the amount of time taken by an algorithm to run, as a function of the length of the input.

= how does the running time of an algorithm increase
with the size of the input?

= most useful way to measure effectiveness of an algorithm

eg. constant, logarithmic, linear, polynomial, exponential

How well did you know this?

Not at all

Perfectly

Algorithms - classifications? + examples

by complexity
* logarithmic, linear, quadratic, polynomial, etc

… by design
* brute force
* heuristic (zB greedy, seed and extend
* recursion (dynamic programming)

… by data structure
* hash table
* graphs (zB trees)

… by (biological) problem
* pattern search

How well did you know this?

Not at all

Perfectly

Brute force algorithm?
- definition
- use
- running time

try all possibilities to find the optimal solution –> output: the optimal solution
can only be used for small problems in bioinformatics
often have exponential running time

How well did you know this?

Not at all

Perfectly

Heuristic algorithm?
- definition
- use

don’t guarantee that the optimal solution is found, because not all solutions are evaluated

are a good compromise between speed and accuracy

many different types / designs exist and are used

How well did you know this?

Not at all

Perfectly

Two examples for heuristic algorithms

greedy algorithms
seed-and-extend = seed-and-filter-and-extend

How well did you know this?

Not at all

Perfectly

Greedy logarithm - definition, use ?

Heuristic logarithm
at each step, the algorithm takes the best immediate solution while finding an answer –> may not find the best possible overall solution
used for optimization problems
used for sequence alignment & assembly, phylogenetic inference, etc

How well did you know this?

Not at all

Perfectly

Seed-and-extend
definition, use?

Seed-and-extend = seed-and-filter-and-extend

observation: true matches contain short, exactly matching kmers (seeds)
these are easy & fast to find (compared to longer / inexact matches), as can exclude any point mutations
extend these short matches in a second step, and only these (filter)
used for alignment

How well did you know this?

Not at all

Perfectly

Subsampling of k-mers - name 3 types?

minimizers, syncmers, strobemers

How well did you know this?

Not at all

Perfectly

Subsampling - Minimizers?

saves on space /ram

such that the same k-mers will be selected from similar sequences

doi:10.1093/bioinformatics/bth408

How well did you know this?

Not at all

Perfectly

Subsampling - syncmers?

within a window, select the k-mer(s) with the smallest s-mer at the start or end
- suited for comparison of seqs with mismatches

https://doi.org/10.7717/peerj.10805

How well did you know this?

Not at all

Perfectly

Data structure - definition, common examples?

“the organization of information such that it can be
used efficiently”
common data structures: hash table, string, list, table, (suffix) tree, graph

How well did you know this?

Not at all

Perfectly

Suffix tree - definition, use?

representation of a string - present the suffixes of a given string in a tree
edges are labeled with strings, such that each suffix of a string corresponds to exactly one path from the root to a leaf
*allows efficient searching of this text present the suffixes of a given string in a tree

How well did you know this?

Not at all

Perfectly

Suffix tree - disadvantages / advantages?

disadvantages
* not so efficient to construct
* not so efficient to store
(–> convert to more space-efficient suffix arrays!)

advantages
* very efficient to search
- search is proportional to length of the pattern

How well did you know this?

Not at all

Perfectly

Suffix trees - applications?

Study These Flashcards

many applications in sequence bioinformatics
- keywords/patterns in text
- (maximal) repeats
- unique substrings
- longest common substring
- inexact matches

Study These Flashcards

Parallelization - why? requires? types?

Study These Flashcards

(parallel computing as opposed to serial computing)

why
* clock speed maxed out
* performance improvements now only possible through parallelization

requires
* special hardware and programs
* important: frequent communication between subtasks?

types
* embarrassing, coarse-grained, fine-grained

multi-core computers - where? why? requires?

Study These Flashcards

many-core processors are standard now - desktops, laptops, mobile phones, …
communication between cores is fast / easy
required rewriting of software (parameters)

(computing) clusters - definition/ consists of, disadvantage, requires, example in linux

Study These Flashcards

definition / consists of:
* multiple computers (that are multi-core); nodes
* a login / head / master node
* linked by a fast local area network

disadvantage:
* expensive – but cheaper than a single computer of comparable size & speed

requires:
* application programs: extra programming / parameters
* requires a system to distribute jobs

example in linux:
SLURM

Heterogeneous computing

Study These Flashcards

CPUs
* 2-8 cores, general-purpose

GPUs
* hundreds of cores, highly specialized - thanks to gaming!
* programmable for general purpose computing since 2007: CUDA (NVIDIA)

integration of CPUs and GPUs
–> performance boost
but: specialized programs / programming skills required

cloud computing - definition, advantages, example providors, requirments

Study These Flashcards

definition: virtual machines on many networked computers
advantages:
- emulation of a physical computer (operating system, configuration, software, etc)
- high flexibility
- saving & duplicating a VM is very easy

computing power rented on demand - example:
- Amazon, Google, Microsoft
- some biological data sets available;

limiting step: data upload may otherwise be the limiting step

requires:
- specialized programming required
MapReduce / Hadoop; Go programming language

Big data - skills needed?

Study These Flashcards

know, understand, select efficient & appropriate
algorithms
efficient run time for large data sets
efficient memory usage
efficient data structures
(space efficient, efficient to access)
be able to work on high performance computers
remote computing
parallel computing
organize data / results / documentation
learn to work with many different programs
work under unix/linux, in the terminal

Algorithm - definition

Study These Flashcards

a list set of instructions, used to solve problems or perform tasks

input –>** algorithm **–> output

a sequence of ordered instructions that must be carried out to solve a well-formulated problem
algorithms can be stated in English (or German or Spanish) or in pseudocode
algorithms can be implemented in different ways (e.g., in different programming languages)
a correct algorithm will solve all instances of a problem
given a problem, there may be many correct algorithms that can solve it

Connection between data structure and algorithm?

- Algorithms: complexity/efficiency: one factor for this is space requirement, which can be optimised by using the right data structures - The way the data is accessed by the algorithm differs depending on the data structure

2. Big data Flashcards

(26 cards)