ChatGPT Missed ?s Flashcards

Question 1

Q

IOPS (intro)

Answer

A

Input/Output Operations per second, measures performance of data access in storage systems, crucial metric for high throughput in big data systems

Question 2

Q

Inverse of the 80-20 Pareto Rule (intro)

Answer

A

Before 80% of data would be used and 20% not, now it is the reverse

Question 3

Q

HDFS federation (storage)

Answer

A

Multiple independent NameNodes managing namespace (helps scalability)

Question 4

Q

Secondary NameNode (storage)

Answer

A

Different than standby node, this one takes stores and compacts edit logs (that grow too large) provide checkpoints to NameNode (snapshots)

Question 5

Q

When is replication better than erasure coding?(storage)

Answer

A

when fast access of lost data is more important than storage optimization

Question 6

Q

Repetition level vs Definition level Parquet (storage)

Answer

A

repetition - how deep in nested structure
definition - defined or null

Question 7

Q

3 Parquet compression techniques (storage)

Answer

A

dictionary encoding - low cardinality
run-length encoding - long runs of same value
bit-packing - reduce numb of bits requires

Question 8

Q

Shuffle vs Sorting Confusion (mapreduce)

Answer

A

shuffle is moving data in key groups, sorting happens on the reducers so that keys are processed in ordered manner

Question 9

Q

Why is commutativity imporant in map reduce operations? (Mapreduce)

Answer

A

data gets reordered during the shuffle phase

Question 10

Q

why is non-associativity important in map reduce operations? (mapreduce)

Answer

A

intermediate results must be combined in any order without changing fnial outcome

for sum it doesnt matter, but if we wanted to find variance, order and context matters

Question 11

Q

Hashing (mapreduce)

Answer

A

Assigns each key numerical value to ensure all instances with same key go to same reducer, while still attempting to balance workload.

Question 12

Q

What are the (ε, δ)-guarantees in approximation algorithms? (Streaming)

Answer

A

in the context of approximating stream - epsilon is the error margin we are willing to accept, and delta is the probability that the algorithm fails to give a good approximation.

Question 13

Q

reservoir sampling (streaming)

Answer

A

when a new value arrives into streaming system, we can probabilistically decide whether to add it to collection (replace a slot) or discard it

Question 14

Q

Count-min sketch algorithm (streaming)

Answer

A

used to estimate frequency, uses polylog function to store more data in smaller dataset

Question 15

Q

Hyperloglog (streaming)

Answer

A

estimates number of unique elements by hashing (also acheives polylog)

Question 16

Q

rollback recovery (streaming)

Answer

A

logs state at regular intervals to revert to previous state if fails

Question 17

Q

5 streaming algorithm (streaming)

Answer

A

one pass
small space for state
fast update state
fast computation
approximation with conf. guarantee

Question 18

Q

why do NoSQL databases avoid joins and rigid schemas (nosql)

Answer

A

complex queries slow down performance.

schemas make it harder to partition and scale

Question 19

Q

schemaless indexing (nosql)

Answer

A

efficient querying even in databases that don’t enforce rigid structures

Question 20

Q

BASE philosophy (nosql)

Answer

A

Basically available, soft-state, eventually consistent.

(avail > consistency)

Question 21

Q

polyglot persistence (nosql)

Answer

A

using different databases for different taks within same application

Question 22

Q

PACELC (nosql)

Answer

A

theres another tradeoff in addition to partition tolerance and availability, which is latency and consistency

Question 23

Q

partition tolerance (nosql)

Answer

A

if there are network errors while partition happens, can still
operate

Question 24

Q

mutli-model nosql databases

Answer

A

support multiple types of data models

Question 25

Q

datalakehouse (data platforms)

Answer

A

metadata layer which is enhanced by delta lake which keeps a log of json files that keep track of versions over time, using checkpoints divided by timer periods so you don’t have to scan entire log to access a specific version

Question 26

Q

data steward (data platforms)

Answer

A

enforces data governance policies

Question 27

Q

3 key aspects to data management in platform (data platforms)

Answer

A

data integration
data quality
metadata

Question 28

Q

Question 29

Q

active metadata (data platforms)

Answer

A

use open APIs to hook into every piece of data platform and get real time information about all the data so automating governance tasks

Question 30

Q

data versioning significanc in pipelines (data platforms)

Answer

A

tracks changes to datasets, rolls back to earlier version of data if a trasnformation error occurs).

Question 31

Q

two best practices for ensuring continous improvement in DataOps (data platforms)

Answer

A

feedback loops (always monitor data pipeline performance)
automated testing (catch issues early)

Question 32

Q

5 charactersitics of cloud computing according to NIST (cloud computing)

Answer

A

on demand self service
broad network access
resource pooling
elasticity
measure service

Question 33

Q

Define cloud migration (migration)

Answer

A

Moving data and business operations, applications etc from on premises to remote cloud provider server

Question 34

Q

6 common migration strategies (migration)

Answer

A

Rehost “lift and shift”
Replatform “make some adjustments”
Repurchase “on prem CRF –> salesforce”
Refactor “rewrite applications to be cloud-native… AWS LAMBDA”
Retire “decommission outdates applications”
Retain “hybrid - keep some on”

Question 35

Q

When is the rehost strategy (“lift and shift”) usesful (migration)

Answer

A

No time to redesign

Question 36

Q

key steps in migration process (migration)

Answer

A

Assessment (review current system)
Planning (design cloud architecture)
Migration
Testing
Optimization

Question 37

Q

How do cloud providers ensure data privacy and compliance during and after migration (migration)

Answer

A

AWS KMS (key management service) ensure data remains secure, AWS shield protects DDoS attack

encryption (SSL/TLS)

Question 38

Q

What two things does a hybrid cloud solution blend? (migration)

Answer

A

Control (on premises) and scalability (c cloud)

Question 39

Q

2 tools that monitor cloud costs? (MIGRATION)

Answer

A

AWS Cost Explorer
Azure Cost Management

Question 40

Q

Example of pub/sub model real world (cloud streaming)

Answer

A

In a stock trading platform puiblishers (so the stock exchanges) send updates on stock prices and consumers (traders) receive only the messages related to stocks they are interested in

Question 41

Q

3 common processing techniques applied to event streams (cloud streaming)

Answer

A

filtering
enriching (adding context)
aggregation

Question 42

Q

What is AWS Kinesis (cloud streaming)

Answer

A

a cloud service that enables real-time data ingestion and processing. Divides streams into shards to be processed in parallel

Question 43

Q

“priority queue pattern” (cloud streaming)

Answer

A

processes high-priority events faster

Question 44

Q

pipes and filters pattern (cloud streaming)

Answer

A

breaks complex task into smaller pieces (filters), pipes connect them

Modular (scalable and reusable)

Question 45

Q

real-time system security concern (cloud streaming)

Answer

A

unauthorized access (need policies)
data breaches (encyrption)

Question 46

Q

real world example of company using streaming system (cloud streaming)

Answer

A

Netflix to monitor users activity and streaming quality, to quickly detect and resolve issues

Question 47

Q

2 important industries that benefit from real-time analytics (cloud streaming)

Answer

A

Finance - stock price for time-sensitive decisions
Healthcare - detect emergencies and trigger intervention