ChatGPT Missed ?s Flashcards
IOPS (intro)
Input/Output Operations per second, measures performance of data access in storage systems, crucial metric for high throughput in big data systems
Inverse of the 80-20 Pareto Rule (intro)
Before 80% of data would be used and 20% not, now it is the reverse
HDFS federation (storage)
Multiple independent NameNodes managing namespace (helps scalability)
Secondary NameNode (storage)
Different than standby node, this one takes stores and compacts edit logs (that grow too large) provide checkpoints to NameNode (snapshots)
When is replication better than erasure coding?(storage)
when fast access of lost data is more important than storage optimization
Repetition level vs Definition level Parquet (storage)
repetition - how deep in nested structure
definition - defined or null
3 Parquet compression techniques (storage)
- dictionary encoding - low cardinality
- run-length encoding - long runs of same value
- bit-packing - reduce numb of bits requires
Shuffle vs Sorting Confusion (mapreduce)
shuffle is moving data in key groups, sorting happens on the reducers so that keys are processed in ordered manner
Why is commutativity imporant in map reduce operations? (Mapreduce)
data gets reordered during the shuffle phase
why is non-associativity important in map reduce operations? (mapreduce)
intermediate results must be combined in any order without changing fnial outcome
for sum it doesnt matter, but if we wanted to find variance, order and context matters
Hashing (mapreduce)
Assigns each key numerical value to ensure all instances with same key go to same reducer, while still attempting to balance workload.
What are the (ε, δ)-guarantees in approximation algorithms? (Streaming)
in the context of approximating stream - epsilon is the error margin we are willing to accept, and delta is the probability that the algorithm fails to give a good approximation.
reservoir sampling (streaming)
when a new value arrives into streaming system, we can probabilistically decide whether to add it to collection (replace a slot) or discard it
Count-min sketch algorithm (streaming)
used to estimate frequency, uses polylog function to store more data in smaller dataset
Hyperloglog (streaming)
estimates number of unique elements by hashing (also acheives polylog)
rollback recovery (streaming)
logs state at regular intervals to revert to previous state if fails
5 streaming algorithm (streaming)
- one pass
- small space for state
- fast update state
- fast computation
- approximation with conf. guarantee
why do NoSQL databases avoid joins and rigid schemas (nosql)
complex queries slow down performance.
schemas make it harder to partition and scale
schemaless indexing (nosql)
efficient querying even in databases that don’t enforce rigid structures
BASE philosophy (nosql)
Basically available, soft-state, eventually consistent.
(avail > consistency)
polyglot persistence (nosql)
using different databases for different taks within same application
PACELC (nosql)
theres another tradeoff in addition to partition tolerance and availability, which is latency and consistency
partition tolerance (nosql)
if there are network errors while partition happens, can still
operate
mutli-model nosql databases
support multiple types of data models
datalakehouse (data platforms)
metadata layer which is enhanced by delta lake which keeps a log of json files that keep track of versions over time, using checkpoints divided by timer periods so you don’t have to scan entire log to access a specific version
data steward (data platforms)
enforces data governance policies
3 key aspects to data management in platform (data platforms)
data integration
data quality
metadata
active metadata (data platforms)
use open APIs to hook into every piece of data platform and get real time information about all the data so automating governance tasks
data versioning significanc in pipelines (data platforms)
tracks changes to datasets, rolls back to earlier version of data if a trasnformation error occurs).
two best practices for ensuring continous improvement in DataOps (data platforms)
feedback loops (always monitor data pipeline performance)
automated testing (catch issues early)
5 charactersitics of cloud computing according to NIST (cloud computing)
- on demand self service
- broad network access
- resource pooling
- elasticity
- measure service
Define cloud migration (migration)
Moving data and business operations, applications etc from on premises to remote cloud provider server
6 common migration strategies (migration)
- Rehost “lift and shift”
- Replatform “make some adjustments”
- Repurchase “on prem CRF –> salesforce”
- Refactor “rewrite applications to be cloud-native… AWS LAMBDA”
- Retire “decommission outdates applications”
- Retain “hybrid - keep some on”
When is the rehost strategy (“lift and shift”) usesful (migration)
No time to redesign
key steps in migration process (migration)
Assessment (review current system)
Planning (design cloud architecture)
Migration
Testing
Optimization
How do cloud providers ensure data privacy and compliance during and after migration (migration)
AWS KMS (key management service) ensure data remains secure, AWS shield protects DDoS attack
encryption (SSL/TLS)
What two things does a hybrid cloud solution blend? (migration)
Control (on premises) and scalability (c cloud)
2 tools that monitor cloud costs? (MIGRATION)
AWS Cost Explorer
Azure Cost Management
Example of pub/sub model real world (cloud streaming)
In a stock trading platform puiblishers (so the stock exchanges) send updates on stock prices and consumers (traders) receive only the messages related to stocks they are interested in
3 common processing techniques applied to event streams (cloud streaming)
filtering
enriching (adding context)
aggregation
What is AWS Kinesis (cloud streaming)
a cloud service that enables real-time data ingestion and processing. Divides streams into shards to be processed in parallel
“priority queue pattern” (cloud streaming)
processes high-priority events faster
pipes and filters pattern (cloud streaming)
breaks complex task into smaller pieces (filters), pipes connect them
Modular (scalable and reusable)
real-time system security concern (cloud streaming)
unauthorized access (need policies)
data breaches (encyrption)
real world example of company using streaming system (cloud streaming)
Netflix to monitor users activity and streaming quality, to quickly detect and resolve issues
2 important industries that benefit from real-time analytics (cloud streaming)
Finance - stock price for time-sensitive decisions
Healthcare - detect emergencies and trigger intervention