ChatGPT Missed ?s Flashcards
IOPS (intro)
Input/Output Operations per second, measures performance of data access in storage systems, crucial metric for high throughput in big data systems
Inverse of the 80-20 Pareto Rule (intro)
Before 80% of data would be used and 20% not, now it is the reverse
HDFS federation (storage)
Multiple independent NameNodes managing namespace (helps scalability)
Secondary NameNode (storage)
Different than standby node, this one takes stores and compacts edit logs (that grow too large) provide checkpoints to NameNode (snapshots)
When is replication better than erasure coding?(storage)
when fast access of lost data is more important than storage optimization
Repetition level vs Definition level Parquet (storage)
repetition - how deep in nested structure
definition - defined or null
3 Parquet compression techniques (storage)
- dictionary encoding - low cardinality
- run-length encoding - long runs of same value
- bit-packing - reduce numb of bits requires
Shuffle vs Sorting Confusion (mapreduce)
shuffle is moving data in key groups, sorting happens on the reducers so that keys are processed in ordered manner
Why is commutativity imporant in map reduce operations? (Mapreduce)
data gets reordered during the shuffle phase
why is non-associativity important in map reduce operations? (mapreduce)
intermediate results must be combined in any order without changing fnial outcome
for sum it doesnt matter, but if we wanted to find variance, order and context matters
Hashing (mapreduce)
Assigns each key numerical value to ensure all instances with same key go to same reducer, while still attempting to balance workload.
What are the (ε, δ)-guarantees in approximation algorithms? (Streaming)
in the context of approximating stream - epsilon is the error margin we are willing to accept, and delta is the probability that the algorithm fails to give a good approximation.
reservoir sampling (streaming)
when a new value arrives into streaming system, we can probabilistically decide whether to add it to collection (replace a slot) or discard it
Count-min sketch algorithm (streaming)
used to estimate frequency, uses polylog function to store more data in smaller dataset
Hyperloglog (streaming)
estimates number of unique elements by hashing (also acheives polylog)
rollback recovery (streaming)
logs state at regular intervals to revert to previous state if fails
5 streaming algorithm (streaming)
- one pass
- small space for state
- fast update state
- fast computation
- approximation with conf. guarantee
why do NoSQL databases avoid joins and rigid schemas (nosql)
complex queries slow down performance.
schemas make it harder to partition and scale