coding university - Data structure and algorithms Flashcards
What is ZeroMQ?
A socket-based system, can be used as a queue, pub/sub, etc.
Carries messages across inproc, IPC, TCP, TIPC, multicast.
Smart patterns like pub-sub, push-pull (pipeline), and router-dealer.
What is ActiveMQ?
Apache ActiveMQ is an open source message broker written in Java.
What is MessagePack?
MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it’s faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.
No IDL.
What is Avro?
Apache Avro is a data serialization system. IDL-based.
Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
What is a Bloom filter?
A Bloom filter is a data structure used to quickly test membership in a set where the number and size of possible elements would be very large. Too large to keep in memory.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either ““possibly in set”” or ““definitely not in set””. Elements can be added to the set, but not removed (though this can be addressed with a ““counting”” filter). The more elements that are added to the set, the larger the probability of false positives.
How can you easily generate multiple hashes for the same element?
Double hashing. This method gives you as many hashes as you need:
hash(x,m) = (hasha(x) + i * hashb(x)) mod m
In Python:
import mmh3
mmh3.hash64(‘foo’) # two 64 bit signed ints, in a tuple
now you have 2 64-bit hashes. Substituting for i gives you multiple hashes for a Bloom filter.
What is a cache-oblivious algorithm?
A cache-oblivious algorithm does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for all cache line sizes simultaneously, removing the need to tune or optimize for a given machine.
Optimal cache-oblivious algorithms are known for the Cooley–Tukey FFT algorithm, matrix multiplication, sorting, matrix transposition, and several other problems.
How can you augment a splay tree so you can find how many items are between x and y?
Store size of subtrees at each node.
Find x, splay to root. Each splay, insert, and delete must maintain size in node.
Find y, and along the way add up the sizes in the left subtrees, and 1 for each visited left-hand node.
Splay y to root to ensure balance.
In a maximum flow problem, what is the minimum cut?
The min cut is the maximum flow through the graph.
What is the Ford-Fulkerson algorithm?
The Ford–Fulkerson method or Ford–Fulkerson algorithm (FFA) is a greedy algorithm that computes the maximum flow in a flow network. It is called a ““method”” instead of an ““algorithm”” as the approach to finding augmenting paths in a residual graph is not fully specified or it is specified in several implementations with different running times. The name ““Ford–Fulkerson”” is often also used for the Edmonds–Karp algorithm, which is a specialization of Ford–Fulkerson.
What is the running time for the disjoint set data structure?
Due to merging smaller disjoint sets into larger ones (called union by rank) (during union) and performing path compression (during find), the amortized time per operation is only O(alpha(n)), where alpha(n) is the inverse of the function and A is the extremely fast-growing Ackermann function. Since alpha(n) is the inverse of this function, alpha(n) is less than 5 for all remotely practical values of n. Thus, the amortized running time per operation is effectively a small constant.
The worst-case for find() is Theta(log u) where u is the number of unions, and no finds have been done to allow for path compression yet.
What Python flag turns on optimizations and removes assertions from code?
python -O
Why is doing work in a constructor a bad thing?
It can make your code harder to test.
What should be avoided to ensure testing is easier/possible?
- static methods and properties
- final keyword
- use of new in methods (use dependency injection)
What are some guidelines to keep in mind to not violate the dependency inversion principle?
- No variable should have a concrete class type. An abstract type is better.
- No class should derive from a concrete class.
- No method should override an implemented method of any of its base classes.
These are guidelines and may not be feasible all the time.
What is separate chaining?
In hash table conflict resolution, each bucket is independent and has some sort of linked list of entries with the same index. The time for hash table operations is the time to find the bucket (which is constant) plus the time for the list operation.
In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely more than that. Therefore, structures that are efficient in time and space for these cases are preferred. Structures that are efficient for a fairly large number of entries per bucket are not needed or desirable. If these cases happen often, the hashing function needs to be fixed.
What is open addressing?
In hash table conflict resolution, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table. The name ““open addressing”” refers to the fact that the location (““address””) of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with ““open hashing”” or ““closed addressing”” that usually mean separate chaining.)
What is the length of the longest chain in a hash table using separate chaining?
O(1 + alpha) where alpha is the load factor, n/m.
Since uniform hashing is difficult to achieve in practice, what is a great alternative?
double hashing
How can you test if a number is odd in bitwise operations?
return (x & 1)
How can you test if a number is even in bitwise operations?
return (x & 1) == 0
What is another name for a breadth-first search traversal?
Level-order traversal.
What is a 2-3-4 tree?
2–3–4 tree (also called a 2–4 tree) is a self-balancing data structure that is commonly used to implement dictionaries. The numbers mean a tree where every node with children (internal node) has either two, three, or four child nodes:
- 2-node has one data element, and if internal has two child nodes;
- 3-node has two data elements, and if internal has three child nodes;
- 4-node has three data elements, and if internal has four child nodes.
What is the complexity of all operations on a splay tree?
O(log n) on average.
A single operation Theta(n) in the worst case.