- computer processes a nr of tasks that have been collected in a group, often simultanously, in non-stop, sequential order Pros: - good when response time is not important - suitable for large data volume - fast, inexpensive and accurate - offline - query-driven as static and more about historical fact finding eg) mothly payroll system, credit card billing system

- streams of data - processing is done as data is input: filtering, aggregating and preparing data - often optimized for analytics and visualization and directly ingested in tools for it -> data-driven eg) bank ATMs, control systems, social media

- splitting up larger tasks into multiple subtasks and execute at the same time - reduces execution time - multiple processing within single machine

- interface application for transferring structured data between relational databases and Hadoop -> data ingestion - can import and export

- collects large amount of semi and unstructured streaming data from multiple sources

- streaming platform that handles constant influx of data and processes it incrementally and sequentially - used to build real-time streaming pipelines - combines messaging, storage and stream processing -> storage and analysis of both historical and real-time data

- real-time big data processing system - handles influx of data and easily process unbounded stream of data

- software for programming firn-specific model that can process data in-parallel on clusters of commodity hardware - fault-tolerant and reliable - divide and conquer -> only for batch workloads - disk-based processing 2 phases: - Map: splitting and mapping - Reduce: shuffling and reducing

- for large-scale data processing - good for batch and stream data - in-memory processing -> fast - supports large-scale data science project s, SQL analytics and ML - many languages: python, SQL, Java, R, etc

- spark is 100x faster due to in-memory proc - MR only batch, while spark both real-time and batch - spark good for ML - MR low cost, spark high cost - spark easy to combine databases - MR linear processing -> slow

L8 Flashcards by El vi

def BD processing

a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions
-> in BD happens before storing

How well did you know this?

Not at all

Perfectly

2 Types of data processing

centralized data processing
distributed data processing: distributed across different physical locations

How well did you know this?

Not at all

Perfectly

Batch processing

computer processes a nr of tasks that have been collected in a group, often simultanously, in non-stop, sequential order
Pros:
good when response time is not important
suitable for large data volume
fast, inexpensive and accurate
offline
query-driven as static and more about historical fact finding
eg) mothly payroll system, credit card billing system

How well did you know this?

Not at all

Perfectly

Real-time processing

streams of data
processing is done as data is input: filtering, aggregating and preparing data
often optimized for analytics and visualization and directly ingested in tools for it -> data-driven
eg) bank ATMs, control systems, social media

How well did you know this?

Not at all

Perfectly

Parallel computing

splitting up larger tasks into multiple subtasks and execute at the same time
reduces execution time
multiple processing within single machine

How well did you know this?

Not at all

Perfectly

Distributed computing

splits up larger tasks into subtasks and executes in separate machines networked together as a cluster

How well did you know this?

Not at all

Perfectly

Hadoop Ecosystem

system comprised of several components that cover several aspects of data ingestion, processing, analysis exploration and storage

How well did you know this?

Not at all

Perfectly

Sqoop

interface application for transferring structured data between relational databases and Hadoop -> data ingestion
can import and export

How well did you know this?

Not at all

Perfectly

Flume

collects large amount of semi and unstructured streaming data from multiple sources

How well did you know this?

Not at all

Perfectly

Kafka

streaming platform that handles constant influx of data and processes it incrementally and sequentially
used to build real-time streaming pipelines
combines messaging, storage and stream processing -> storage and analysis of both historical and real-time data

How well did you know this?

Not at all

Perfectly

Storm

real-time big data processing system
handles influx of data and easily process unbounded stream of data

How well did you know this?

Not at all

Perfectly

Storm vs MapRedue?

How well did you know this?

Not at all

Perfectly

Hadoop MapReduce

software for programming firn-specific model that can process data in-parallel on clusters of commodity hardware
fault-tolerant and reliable
divide and conquer -> only for batch workloads
disk-based processing
2 phases:
Map: splitting and mapping
Reduce: shuffling and reducing

How well did you know this?

Not at all

Perfectly

Spark

for large-scale data processing
good for batch and stream data
in-memory processing -> fast
supports large-scale data science project s, SQL analytics and ML
many languages: python, SQL, Java, R, etc

How well did you know this?

Not at all

Perfectly

MapReduce vs Spark

spark is 100x faster due to in-memory proc
MR only batch, while spark both real-time and batch
spark good for ML
MR low cost, spark high cost
spark easy to combine databases
MR linear processing -> slow

How well did you know this?

Not at all

Perfectly

Pig and Hive

Study These Flashcards

data-analysis platforms on top of MapReduce

YARN, Oozie, Zookeeper

Study These Flashcards

YARN (Yet Another Resource Negotiatior): takes over resource management and job scheduling from MapReduce
Apache Oozie: manages workflow in Hadoop environment at desired order
Apache ZooKeeper: maintaining open-source server to enable reliable distributed coordination

Explain Hadoop Ecosystem

Study These Flashcards

BD with AWS

Study These Flashcards

AWS has ecosystem with analytical solutions specifically for growing amount of data

Application: Clickstream analysis through AWS

Study These Flashcards

clickstream data sent to Kinesis Stream
stored exposed for processing
custom application programmed on Kinesis makes real-time recommendations
output to user who sees personalized content suggestions

Application: Data Warehousing through AWS

Study These Flashcards

data is uploaded to S3
EMR is used to transform and clean data
is loaded back into S3
loaded into Redshift where it is parallelized for fast analytics
analysed and visualized with Quicksight

Smart Applications through AWS

Study These Flashcards

Amazon Kinesis receives data
(2. AWS lambda is used to write code to coordinate data flow)
Amazon Machine Learning model is for real-time predictions
Amazon SNS is used to notify customer support agents

Amazon S3

Study These Flashcards

Amazon Simple Storage Service
- object storage service offering industry-leading scalability, data avilability, security and performance

Amazon EMR

Study These Flashcards

highly distributed computing framework for processing and storing big data
- EMR apache hadoop allows Hive, Pig, Spark to run on top

Amazon Kinesis

data-streaming platform, makes it easy to download and analyze streaming data eg) site clicks, IoT data, data lake

AWS Lambda

serverless, event-driven computing service to run code for any type of application or backend service without provisioning or managing servers

Amazon ML

makes it easy for anyone to use predictive analytics, ML functions and GenAI

Amazon DynamoDB

fully managed NoSQL database service providing fast and predictable performance with scalability

Amazon Redshift

fast, large and fully managed cloud data warehouse storage service to analyze structured or semi-structured data efficiently at lower price

L8 Flashcards

(29 cards)