ChatGPT Missed ?s Flashcards

1
Q

IOPS (intro)

A

Input/Output Operations per second, measures performance of data access in storage systems, crucial metric for high throughput in big data systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Inverse of the 80-20 Pareto Rule (intro)

A

Before 80% of data would be used and 20% not, now it is the reverse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HDFS federation (storage)

A

Multiple independent NameNodes managing namespace (helps scalability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Secondary NameNode (storage)

A

Different than standby node, this one takes stores and compacts edit logs (that grow too large) provide checkpoints to NameNode (snapshots)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is replication better than erasure coding?(storage)

A

when fast access of lost data is more important than storage optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Repetition level vs Definition level Parquet (storage)

A

repetition - how deep in nested structure
definition - defined or null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

3 Parquet compression techniques (storage)

A
  1. dictionary encoding - low cardinality
  2. run-length encoding - long runs of same value
  3. bit-packing - reduce numb of bits requires
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Shuffle vs Sorting Confusion (mapreduce)

A

shuffle is moving data in key groups, sorting happens on the reducers so that keys are processed in ordered manner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is commutativity imporant in map reduce operations? (Mapreduce)

A

data gets reordered during the shuffle phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

why is non-associativity important in map reduce operations? (mapreduce)

A

intermediate results must be combined in any order without changing fnial outcome

for sum it doesnt matter, but if we wanted to find variance, order and context matters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hashing (mapreduce)

A

Assigns each key numerical value to ensure all instances with same key go to same reducer, while still attempting to balance workload.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the (ε, δ)-guarantees in approximation algorithms? (Streaming)

A

in the context of approximating stream - epsilon is the error margin we are willing to accept, and delta is the probability that the algorithm fails to give a good approximation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

reservoir sampling (streaming)

A

when a new value arrives into streaming system, we can probabilistically decide whether to add it to collection (replace a slot) or discard it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Count-min sketch algorithm (streaming)

A

used to estimate frequency, uses polylog function to store more data in smaller dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hyperloglog (streaming)

A

estimates number of unique elements by hashing (also acheives polylog)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

rollback recovery (streaming)

A

logs state at regular intervals to revert to previous state if fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

5 streaming algorithm (streaming)

A
  1. one pass
  2. small space for state
  3. fast update state
  4. fast computation
  5. approximation with conf. guarantee
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

why do NoSQL databases avoid joins and rigid schemas (nosql)

A

complex queries slow down performance.

schemas make it harder to partition and scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

schemaless indexing (nosql)

A

efficient querying even in databases that don’t enforce rigid structures

20
Q

BASE philosophy (nosql)

A

Basically available, soft-state, eventually consistent.

(avail > consistency)

21
Q

polyglot persistence (nosql)

A

using different databases for different taks within same application

22
Q

PACELC (nosql)

A

theres another tradeoff in addition to partition tolerance and availability, which is latency and consistency

23
Q

partition tolerance (nosql)

A

if there are network errors while partition happens, can still
operate

24
Q

mutli-model nosql databases

A

support multiple types of data models

25
Q

datalakehouse (data platforms)

A

metadata layer which is enhanced by delta lake which keeps a log of json files that keep track of versions over time, using checkpoints divided by timer periods so you don’t have to scan entire log to access a specific version

26
Q

data steward (data platforms)

A

enforces data governance policies

27
Q

3 key aspects to data management in platform (data platforms)

A

data integration
data quality
metadata

28
Q
A
29
Q

active metadata (data platforms)

A

use open APIs to hook into every piece of data platform and get real time information about all the data so automating governance tasks

30
Q

data versioning significanc in pipelines (data platforms)

A

tracks changes to datasets, rolls back to earlier version of data if a trasnformation error occurs).

31
Q

two best practices for ensuring continous improvement in DataOps (data platforms)

A

feedback loops (always monitor data pipeline performance)
automated testing (catch issues early)

32
Q

5 charactersitics of cloud computing according to NIST (cloud computing)

A
  1. on demand self service
  2. broad network access
  3. resource pooling
  4. elasticity
  5. measure service
33
Q

Define cloud migration (migration)

A

Moving data and business operations, applications etc from on premises to remote cloud provider server

34
Q

6 common migration strategies (migration)

A
  1. Rehost “lift and shift”
  2. Replatform “make some adjustments”
  3. Repurchase “on prem CRF –> salesforce”
  4. Refactor “rewrite applications to be cloud-native… AWS LAMBDA”
  5. Retire “decommission outdates applications”
  6. Retain “hybrid - keep some on”
35
Q

When is the rehost strategy (“lift and shift”) usesful (migration)

A

No time to redesign

36
Q

key steps in migration process (migration)

A

Assessment (review current system)
Planning (design cloud architecture)
Migration
Testing
Optimization

37
Q

How do cloud providers ensure data privacy and compliance during and after migration (migration)

A

AWS KMS (key management service) ensure data remains secure, AWS shield protects DDoS attack

encryption (SSL/TLS)

38
Q

What two things does a hybrid cloud solution blend? (migration)

A

Control (on premises) and scalability (c cloud)

39
Q

2 tools that monitor cloud costs? (MIGRATION)

A

AWS Cost Explorer
Azure Cost Management

40
Q

Example of pub/sub model real world (cloud streaming)

A

In a stock trading platform puiblishers (so the stock exchanges) send updates on stock prices and consumers (traders) receive only the messages related to stocks they are interested in

41
Q

3 common processing techniques applied to event streams (cloud streaming)

A

filtering
enriching (adding context)
aggregation

42
Q

What is AWS Kinesis (cloud streaming)

A

a cloud service that enables real-time data ingestion and processing. Divides streams into shards to be processed in parallel

43
Q

“priority queue pattern” (cloud streaming)

A

processes high-priority events faster

44
Q

pipes and filters pattern (cloud streaming)

A

breaks complex task into smaller pieces (filters), pipes connect them

Modular (scalable and reusable)

45
Q

real-time system security concern (cloud streaming)

A

unauthorized access (need policies)
data breaches (encyrption)

46
Q

real world example of company using streaming system (cloud streaming)

A

Netflix to monitor users activity and streaming quality, to quickly detect and resolve issues

47
Q

2 important industries that benefit from real-time analytics (cloud streaming)

A

Finance - stock price for time-sensitive decisions
Healthcare - detect emergencies and trigger intervention