L8 Flashcards

1
Q

def BD processing

A

a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions
-> in BD happens before storing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 Types of data processing

A
  1. centralized data processing
  2. distributed data processing: distributed across different physical locations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Batch processing

A
  • computer processes a nr of tasks that have been collected in a group, often simultanously, in non-stop, sequential order
    Pros:
  • good when response time is not important
  • suitable for large data volume
  • fast, inexpensive and accurate
  • offline
  • query-driven as static and more about historical fact finding
    eg) mothly payroll system, credit card billing system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Real-time processing

A
  • streams of data
  • processing is done as data is input: filtering, aggregating and preparing data
  • often optimized for analytics and visualization and directly ingested in tools for it -> data-driven
    eg) bank ATMs, control systems, social media
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Parallel computing

A
  • splitting up larger tasks into multiple subtasks and execute at the same time
  • reduces execution time
  • multiple processing within single machine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Distributed computing

A
  • splits up larger tasks into subtasks and executes in separate machines networked together as a cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop Ecosystem

A

system comprised of several components that cover several aspects of data ingestion, processing, analysis exploration and storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sqoop

A
  • interface application for transferring structured data between relational databases and Hadoop -> data ingestion
  • can import and export
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Flume

A
  • collects large amount of semi and unstructured streaming data from multiple sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kafka

A
  • streaming platform that handles constant influx of data and processes it incrementally and sequentially
  • used to build real-time streaming pipelines
  • combines messaging, storage and stream processing -> storage and analysis of both historical and real-time data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Storm

A
  • real-time big data processing system
  • handles influx of data and easily process unbounded stream of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Storm vs MapRedue?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hadoop MapReduce

A
  • software for programming firn-specific model that can process data in-parallel on clusters of commodity hardware
  • fault-tolerant and reliable
  • divide and conquer -> only for batch workloads
  • disk-based processing
    2 phases:
  • Map: splitting and mapping
  • Reduce: shuffling and reducing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Spark

A
  • for large-scale data processing
  • good for batch and stream data
  • in-memory processing -> fast
  • supports large-scale data science project s, SQL analytics and ML
  • many languages: python, SQL, Java, R, etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MapReduce vs Spark

A
  • spark is 100x faster due to in-memory proc
  • MR only batch, while spark both real-time and batch
  • spark good for ML
  • MR low cost, spark high cost
  • spark easy to combine databases
  • MR linear processing -> slow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Pig and Hive

A

data-analysis platforms on top of MapReduce

14
Q

YARN, Oozie, Zookeeper

A
  1. YARN (Yet Another Resource Negotiatior): takes over resource management and job scheduling from MapReduce
  2. Apache Oozie: manages workflow in Hadoop environment at desired order
  3. Apache ZooKeeper: maintaining open-source server to enable reliable distributed coordination
15
Q

Explain Hadoop Ecosystem

A

?

16
Q

BD with AWS

A

AWS has ecosystem with analytical solutions specifically for growing amount of data

17
Q

Application: Clickstream analysis through AWS

A
  1. clickstream data sent to Kinesis Stream
  2. stored exposed for processing
  3. custom application programmed on Kinesis makes real-time recommendations
  4. output to user who sees personalized content suggestions
18
Q

Application: Data Warehousing through AWS

A
  1. data is uploaded to S3
  2. EMR is used to transform and clean data
  3. is loaded back into S3
  4. loaded into Redshift where it is parallelized for fast analytics
  5. analysed and visualized with Quicksight
19
Q

Smart Applications through AWS

A
  1. Amazon Kinesis receives data
    (2. AWS lambda is used to write code to coordinate data flow)
  2. Amazon Machine Learning model is for real-time predictions
  3. Amazon SNS is used to notify customer support agents
20
Q

Amazon S3

A

Amazon Simple Storage Service
- object storage service offering industry-leading scalability, data avilability, security and performance

21
Q

Amazon EMR

A

highly distributed computing framework for processing and storing big data
- EMR apache hadoop allows Hive, Pig, Spark to run on top

22
Q

Amazon Kinesis

A

data-streaming platform, makes it easy to download and analyze streaming data
eg) site clicks, IoT data, data lake

23
Q

AWS Lambda

A

serverless, event-driven computing service to run code for any type of application or backend service without provisioning or managing servers

24
Q

Amazon ML

A

makes it easy for anyone to use predictive analytics, ML functions and GenAI

25
Q

Amazon DynamoDB

A

fully managed NoSQL database service providing fast and predictable performance with scalability

26
Q

Amazon Redshift

A

fast, large and fully managed cloud data warehouse storage service to analyze structured or semi-structured data efficiently at lower price