coding university - Data structure and algorithms Flashcards
What is ZeroMQ?
A socket-based system, can be used as a queue, pub/sub, etc.
Carries messages across inproc, IPC, TCP, TIPC, multicast.
Smart patterns like pub-sub, push-pull (pipeline), and router-dealer.
What is ActiveMQ?
Apache ActiveMQ is an open source message broker written in Java.
What is MessagePack?
MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it’s faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.
No IDL.
What is Avro?
Apache Avro is a data serialization system. IDL-based.
Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
What is a Bloom filter?
A Bloom filter is a data structure used to quickly test membership in a set where the number and size of possible elements would be very large. Too large to keep in memory.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either ““possibly in set”” or ““definitely not in set””. Elements can be added to the set, but not removed (though this can be addressed with a ““counting”” filter). The more elements that are added to the set, the larger the probability of false positives.
How can you easily generate multiple hashes for the same element?
Double hashing. This method gives you as many hashes as you need:
hash(x,m) = (hasha(x) + i * hashb(x)) mod m
In Python:
import mmh3
mmh3.hash64(‘foo’) # two 64 bit signed ints, in a tuple
now you have 2 64-bit hashes. Substituting for i gives you multiple hashes for a Bloom filter.
What is a cache-oblivious algorithm?
A cache-oblivious algorithm does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for all cache line sizes simultaneously, removing the need to tune or optimize for a given machine.
Optimal cache-oblivious algorithms are known for the Cooley–Tukey FFT algorithm, matrix multiplication, sorting, matrix transposition, and several other problems.
How can you augment a splay tree so you can find how many items are between x and y?
Store size of subtrees at each node.
Find x, splay to root. Each splay, insert, and delete must maintain size in node.
Find y, and along the way add up the sizes in the left subtrees, and 1 for each visited left-hand node.
Splay y to root to ensure balance.
In a maximum flow problem, what is the minimum cut?
The min cut is the maximum flow through the graph.
What is the Ford-Fulkerson algorithm?
The Ford–Fulkerson method or Ford–Fulkerson algorithm (FFA) is a greedy algorithm that computes the maximum flow in a flow network. It is called a ““method”” instead of an ““algorithm”” as the approach to finding augmenting paths in a residual graph is not fully specified or it is specified in several implementations with different running times. The name ““Ford–Fulkerson”” is often also used for the Edmonds–Karp algorithm, which is a specialization of Ford–Fulkerson.
What is the running time for the disjoint set data structure?
Due to merging smaller disjoint sets into larger ones (called union by rank) (during union) and performing path compression (during find), the amortized time per operation is only O(alpha(n)), where alpha(n) is the inverse of the function and A is the extremely fast-growing Ackermann function. Since alpha(n) is the inverse of this function, alpha(n) is less than 5 for all remotely practical values of n. Thus, the amortized running time per operation is effectively a small constant.
The worst-case for find() is Theta(log u) where u is the number of unions, and no finds have been done to allow for path compression yet.
What Python flag turns on optimizations and removes assertions from code?
python -O
Why is doing work in a constructor a bad thing?
It can make your code harder to test.
What should be avoided to ensure testing is easier/possible?
- static methods and properties
- final keyword
- use of new in methods (use dependency injection)
What are some guidelines to keep in mind to not violate the dependency inversion principle?
- No variable should have a concrete class type. An abstract type is better.
- No class should derive from a concrete class.
- No method should override an implemented method of any of its base classes.
These are guidelines and may not be feasible all the time.
What is separate chaining?
In hash table conflict resolution, each bucket is independent and has some sort of linked list of entries with the same index. The time for hash table operations is the time to find the bucket (which is constant) plus the time for the list operation.
In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely more than that. Therefore, structures that are efficient in time and space for these cases are preferred. Structures that are efficient for a fairly large number of entries per bucket are not needed or desirable. If these cases happen often, the hashing function needs to be fixed.
What is open addressing?
In hash table conflict resolution, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table. The name ““open addressing”” refers to the fact that the location (““address””) of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with ““open hashing”” or ““closed addressing”” that usually mean separate chaining.)
What is the length of the longest chain in a hash table using separate chaining?
O(1 + alpha) where alpha is the load factor, n/m.
Since uniform hashing is difficult to achieve in practice, what is a great alternative?
double hashing
How can you test if a number is odd in bitwise operations?
return (x & 1)
How can you test if a number is even in bitwise operations?
return (x & 1) == 0
What is another name for a breadth-first search traversal?
Level-order traversal.
What is a 2-3-4 tree?
2–3–4 tree (also called a 2–4 tree) is a self-balancing data structure that is commonly used to implement dictionaries. The numbers mean a tree where every node with children (internal node) has either two, three, or four child nodes:
- 2-node has one data element, and if internal has two child nodes;
- 3-node has two data elements, and if internal has three child nodes;
- 4-node has three data elements, and if internal has four child nodes.
What is the complexity of all operations on a splay tree?
O(log n) on average.
A single operation Theta(n) in the worst case.
What is the maximum height of a red-black tree?
2 log n
In a b-tree, how many children are there per node?
root: 1 to 2t-1 keys
non-root: t-1 to 2t-1 keys
t could be up to 100, or more.
There are n keys and n+1 children.
Leaves are all the same level.
What does the max degree of a b-tree depend on?
The number of items being stored, and page size based on disk characteristics.
A b-tree’s data is organized to correspond with what?
Pages on disk.
Give an example of how a b-tree might be organized.
1024 children per node.
Store root in memory.
3 nodes accessed gets us 1024^3 disk pages.
4 nodes accessed gets us 1024^4 disk pages.
On descending a b-tree, what’s the rule?
Never step into a minimal node.
On insertion in a b-tree, what’s the rule?
Never step into a full node.
How many nodes of k leaves are in a compressed trie (big-O)?
O(k) nodes with k leaves due to compression.
What is a suffix tree?
A suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.
The construction of such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provide one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string’s suffix tree typically requires significantly more space than storing the string itself.
In brief, how does selection sort work?
Find the minimum item on each pass, past the previous minimum, and swap it into the leftmost position after the previous minimum.
When can insertion sort run in n log n time?
Load into a binary search tree. Then inorder traversal.
How can you speed up selection sort with a heap?
Replace the unsorted portion with a min-heap. Gives O(log n) removal. Makes n log n overall.
What data structure is well suited for a heap sort and which is bad?
Array - good
Linked list - clumsy
What data structure is well suited for a merge sort and which is just okay?
Linked list - a natural
Array does not allow for in-place
How can you optimize finding a pivot when the segment to pivot is large (not random choice)?
Choose a median of three.
What is counting sort?
Counting sort is an algorithm for sorting a collection of objects according to keys that are small integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that have each distinct key value, and using arithmetic on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum and minimum key values, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. However, it is often used as a subroutine in another sorting algorithm, radix sort, that can handle larger keys more efficiently.
What is radix sort?
Radix sort is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits which share the same significant position and value.
Two classifications of radix sorts are least significant digit (LSD) radix sorts and most significant digit (MSD) radix sorts. LSD radix sorts process the integer representations starting from the least digit and move towards the most significant digit. MSD radix sorts work the other way around.
What is the counting sort running time?
O(q + n) where q is the number of unique items. If q is in O(n), then linear time.
What radix is most natural to use?
A power of 2 radix.
How would radix sort work for IEEE floating point numbers?
Flip all bits for negative numbers, do sort, then flip back.
How to choose q for radix sort?
Choose q within a power of 2 of n. Ensures the number of passes is small. Best rule is n rounded down to the next power of 2.
To save memory, round sqrt(n) down to the next power of 2. Twice as many passes.
What operations are a treap optimized for?
- union
- intersection
- difference
What is the Day–Stout–Warren (DSW) algorithm?
The Day–Stout–Warren (DSW) algorithm is a method for efficiently balancing binary search trees — that is, decreasing their height to O(log n) nodes, where n is the total number of nodes. Unlike a self-balancing binary search tree, it does not do this incrementally during each operation, but periodically, so that its cost can be amortized over many operations.
What is the insertion sort algorithm?
for (i = 0; i < n; ++i) {
j = i; while (j > 0 && a[j - 1] > a[j]) { swap(a, j, j - 1); j -= 1; }
}
Is radix sort stable?
yes
What is the algorithmic time complexity of radix sort?
O(digits)
Give the code for selection sort.
for (i = 0; i < n; ++i) {
min_index = i: for (j = i; j < n; ++j) { if (a[j] < a[min_index]) { min_index = j; } } swap(a, i, min_index)
}
All comparison-based sorting is bounded by what complexity?
Omega(n log n)
What do you call a linear ordering of a directed graph of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering?
Topological sort
What is a good method for performing a topological sort?
- Calculate in-degree for each node. O(v + e)
- Go through 0s, add to queue.
- For each item in queue, look at each connection, and decrement in-degree of each, if they got to 0, add to queue, repeat.
How many possible trees are there that span all nodes in a graph?
4^n
What is Prim’s algorithm?
def prim(self):
"""""" Returns a dictionary of parents of vertices in a minimum spanning tree :rtype: dict """""" s = set() q = queue.PriorityQueue() parents = {} start_weight = float(""inf"") weights = {} # since we can't peek into queue for i in self.get_vertex(): weight = start_weight if i == 0: q.put(([0, i])) weights[i] = weight parents[i] = None while not q.empty(): v_tuple = q.get() vertex = v_tuple[1] s.add(vertex) for u in self.get_neighbor(vertex): if u.vertex not in s: if u.weight < weights[u.vertex]: parents[u.vertex] = vertex weights[u.vertex] = u.weight q.put(([u.weight, u.vertex])) return parents
What is the time complexity of Prim’s algorithm on an adjacency matrix?
O(v^2)
What is the time complexity of Prim’s algorithm on an adjacency list and a binary heap?
O(e log v)
derived from:
O((e + v) log v)
What is the time complexity of Prim’s algorithm on an adjacency list and a Fibonacci heap?
O(e + v log v)
What is the pseudocode Kruskal’s algorithm?
KRUSKAL(G):
A = ∅
foreach v ∈ G.V:
MAKE-SET(v)
foreach (u, v) in G.E ordered by weight(u, v), increasing:
if FIND-SET(u) ≠ FIND-SET(v):
A = A ∪ {(u, v)} UNION(u, v)
return A
What is the time complexity of Kruskal’s algorithm?
O(E log V)
or
O(e log e + e α(v) + v)
What is Kruskal’s algorithm?
Kruskal’s algorithm is a minimum-spanning-tree algorithm which finds an edge of the least possible weight that connects any two trees in the forest. It is a greedy algorithm in graph theory as it finds a minimum spanning tree for a connected weighted graph adding increasing cost arcs at each step. This means it finds a subset of the edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. If the graph is not connected, then it finds a minimum spanning forest (a minimum spanning tree for each connected component).
How can you find the number of connected components?
For each node:
if node not yet visited, increment component count and do DFS.
How can you get a topological sort with DFS?
Do a DFS, and when each node is being marked as complete, add node to a list.
Reverse the list.
How can you check for a cycle with DFS?
for each neighbor node:
if not marked as visited (and is not parent) then DFS
else it’s a cycle
How can you get the strongly connected components of a graph?
- DFS - calculate the finish times for each node
- Reverse the edges in the graph
- Call DFS on nodes in reverse graph in reverse order of finishing times.
How do you reverse the edges in a directed graph represented as an adjacency matrix?
Transpose the matrix, so [i, j] becomes [j, i]
How can you find the shortest path on a DAG?
- Topological sort
2. follow the topological sort, relaxing edges
How to find the longest path on a weighted DAG?
- Set all edges to their negative weight.
- Topological sort
- follow the topological sort, relaxing edges
What is the diameter of a graph?
The shortest path of the farthest nodes. That is, it is the greatest distance between any pair of vertices. To find the diameter of a graph, first find the shortest path between each pair of vertices. The greatest length of any of these paths is the diameter of the graph.
Under what condition can you not use Djikstra’s algorithm?
When the graph contains a negative edge. Can cause a cycle that will be traversed infinitely.
In plain words, how does Kruskal’s algorithm work?
- Create a set T and list for result
- Make a list of all edges in G
- Sort edges by weight, from least to greatest.
- Iterate edges in sorted order.
- For each edge, if u and v are not in T, add u and v to T, and add edge to result list.
What can most dynamic programming problems be expressed as?
Finding the shortest path in a DAG. Formulating it this way ensures you can solve it in linear or linearithmic time.
What metric can you use to measure the badness of a line in a text justification problem?
(page width - text width)^3
Minimize the sum of the badness of the lines.
How can you tell if a graph is 2-colorable?
If it’s bipartite. All trees are bipartite.
What is it called when you have too many base cases in your recursion?
arm’s length recursion
What is the base case of a recursion?
The code required to give the solution to the smallest subproblem.
What is the formula for n choose k?
n! / k!(n - k)!
What is the general outline of a backtracking algorithm?
ef solve(conf):
if (no more choices): return conf choices = get_available_choices for choice in choices: c = pick one if solve(conf using c): return true unmake choice c return false
At how many items should you expect a collision when hashing among n buckets?
At sqrt(n) the probability is 1/2
What is n/n^2?
1/n
What does it mean when a problem is NP-Hard?
It is as hard as any other problem in NP. A problem X is NP-Hard if every problem Y in NP-Hard reduces to X.
“Is ““3-D matching”” NP-Complete?”
yes
“Is ““triple coloring a graph”” NP-Complete?”
YEs
“Is ““two coloring a graph”” NP-Complete?”
No
“Is ““subset sum”” NP-Complete?”
Yes
“Is ““bin packing”” NP-Complete?”
Yes
“Is ““vertex cover”” NP-Complete?”
Yes
“Is ““set cover”” NP-Complete?”
Yes
Name some NP-Complete problems.
- tsp
- knapsack problem
- satisfiability
- 3D matching
- tricoloring
- subset sum
- rectangle packing
- bin packing
- vertex cover
- set cover
What is one way of doing approximate traveling salesman?
Select a vertex as root.
Build a MST.
Do a preorder traversal, store nodes in H.
Return H (a Hamiltonian cycle)
How can an LRU cache be implemented with a linked list?
When an item is accessed, it moves to the head of the list.
The trailing items can be overwritten with new items, or removed.
What is a skip list?
A data structure that allows fast search within an ordered sequence of elements. Fast search is made possible by maintaining a linked hierarchy of subsequences, with each successive subsequence skipping over fewer elements than the previous one. Searching starts in the sparsest subsequence until two consecutive elements have been found, one smaller and one larger than or equal to the element searched for.
A skip list is built in layers. The bottom layer is an ordinary ordered linked list. Each higher layer acts as an ““express lane”” for the lists below, where an element in layer i appears in layer i+1 with some fixed probability p (two commonly used values for p are 1/2 or 1/4).
What operations does a skip list support and what is their avg and worst case times?
search: O(log n) O(n)
insert: O(log n) O(n)
delete: O(log n) O(n)
What operations does a van Emde Boas tree support and what are the time complexities?
All are O(log log M), where M is the total number of items that can be stored = 2^m
Or O(log m) where m is the actual number of items stored
Space: O(M)
Search
Insert
Delete
Predecessor
Successor
What are the complexities for treap operations?
For all the basic maintenance operations, they are O(log n) average case and O(n) worst case.
- Search
- Insert
- Delete
For these operations, O(m log n/m) for treaps of sizes m and n, with m ≤ n.
- union
- intersection
- difference