2. Big data Flashcards

1
Q

How to improve a task (memory requirement / speed)?

A
  • fast computer?

next three tied together:
* algorithm - appropriate & efficient
* data structure - efficient to use/access/query, space efficient to store
* parallelization - to split up tasks and run in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Most realistic time complexity for algorithms in sequence bioinformatics ?

A

exponential; polynomial

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define time complexity

examples?

A

Time complexity is defined as the amount of time taken by an algorithm to run, as a function of the length of the input.

= how does the running time of an algorithm increase
with the size of the input?

= most useful way to measure effectiveness of an algorithm

eg. constant, logarithmic, linear, polynomial, exponential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Algorithms - classifications? + examples

A

by complexity
* logarithmic, linear, quadratic, polynomial, etc

… by design
* brute force
* heuristic (zB greedy, seed and extend
* recursion (dynamic programming)

… by data structure
* hash table
* graphs (zB trees)

… by (biological) problem
* pattern search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Brute force algorithm?
- definition
- use
- running time

A
  • try all possibilities to find the optimal solution –> output: the optimal solution
  • can only be used for small problems in bioinformatics
  • often have exponential running time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Heuristic algorithm?
- definition
- use

A

don’t guarantee that the optimal solution is found, because not all solutions are evaluated

are a good compromise between speed and accuracy

many different types / designs exist and are used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Two examples for heuristic algorithms

A
  • greedy algorithms
  • seed-and-extend = seed-and-filter-and-extend
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Greedy logarithm - definition, use ?

A
  • Heuristic logarithm
  • at each step, the algorithm takes the best immediate solution while finding an answer –> may not find the best possible overall solution
  • used for optimization problems
  • used for sequence alignment & assembly, phylogenetic inference, etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Seed-and-extend
definition, use?

A

Seed-and-extend = seed-and-filter-and-extend

  • observation: true matches contain short, exactly matching kmers (seeds)
  • these are easy & fast to find (compared to longer / inexact matches), as can exclude any point mutations
  • extend these short matches in a second step, and only these (filter)
  • used for alignment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Subsampling of k-mers - name 3 types?

A

minimizers, syncmers, strobemers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Subsampling - Minimizers?

A

saves on space /ram

such that the same k-mers will be selected from similar sequences

doi:10.1093/bioinformatics/bth408

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Subsampling - syncmers?

A

within a window, select the k-mer(s) with the smallest s-mer at the start or end
- suited for comparison of seqs with mismatches

https://doi.org/10.7717/peerj.10805

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data structure - definition, common examples?

A
  • “the organization of information such that it can be
    used efficiently”
  • common data structures: hash table, string, list, table, (suffix) tree, graph
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Suffix tree - definition, use?

A
  • representation of a string - present the suffixes of a given string in a tree
  • edges are labeled with strings, such that each suffix of a string corresponds to exactly one path from the root to a leaf
    *allows efficient searching of this text present the suffixes of a given string in a tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Suffix tree - disadvantages / advantages?

A

disadvantages
* not so efficient to construct
* not so efficient to store
(–> convert to more space-efficient suffix arrays!)

advantages
* very efficient to search
- search is proportional to length of the pattern

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Suffix trees - applications?

A

many applications in sequence bioinformatics
- keywords/patterns in text
- (maximal) repeats
- unique substrings
- longest common substring
- inexact matches

17
Q
A
18
Q

Parallelization - why? requires? types?

A

(parallel computing as opposed to serial computing)

why
* clock speed maxed out
* performance improvements now only possible through parallelization

requires
* special hardware and programs
* important: frequent communication between subtasks?

types
* embarrassing, coarse-grained, fine-grained

19
Q

multi-core computers - where? why? requires?

A
  • many-core processors are standard now - desktops, laptops, mobile phones, …
  • communication between cores is fast / easy
  • required rewriting of software (parameters)
20
Q

(computing) clusters - definition/ consists of, disadvantage, requires, example in linux

A

definition / consists of:
* multiple computers (that are multi-core); nodes
* a login / head / master node
* linked by a fast local area network

disadvantage:
* expensive – but cheaper than a single computer of comparable size & speed

requires:
* application programs: extra programming / parameters
* requires a system to distribute jobs

example in linux:
SLURM

21
Q

Heterogeneous computing

A

CPUs
* 2-8 cores, general-purpose

GPUs
* hundreds of cores, highly specialized - thanks to gaming!
* programmable for general purpose computing since 2007: CUDA (NVIDIA)

integration of CPUs and GPUs
–> performance boost
but: specialized programs / programming skills required

22
Q

cloud computing - definition, advantages, example providors, requirments

A

definition: virtual machines on many networked computers
advantages:
- emulation of a physical computer (operating system, configuration, software, etc)
- high flexibility
- saving & duplicating a VM is very easy

computing power rented on demand - example:
- Amazon, Google, Microsoft
- some biological data sets available;

limiting step: data upload may otherwise be the limiting step

requires:
- specialized programming required
MapReduce / Hadoop; Go programming language

23
Q

Big data - skills needed?

A
  • know, understand, select efficient & appropriate
    algorithms
  • efficient run time for large data sets
  • efficient memory usage
  • efficient data structures
    (space efficient, efficient to access)
  • be able to work on high performance computers
  • remote computing
  • parallel computing
  • organize data / results / documentation
  • learn to work with many different programs
  • work under unix/linux, in the terminal
24
Q

Algorithm - definition

A

a list set of instructions, used to solve problems or perform tasks

input –>** algorithm **–> output

  • a sequence of ordered instructions that must be carried out to solve a well-formulated problem
  • algorithms can be stated in English (or German or Spanish) or in pseudocode
  • algorithms can be implemented in different ways (e.g., in different programming languages)
  • a correct algorithm will solve all instances of a problem
  • given a problem, there may be many correct algorithms that can solve it
25
Q

Connection between data structure and algorithm?

A
  • Algorithms: complexity/efficiency: one factor for this is space requirement, which can be optimised by using the right data structures
  • The way the data is accessed by the algorithm differs depending on the data structure
26
Q
A