L8 Flashcards
def BD processing
a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions
-> in BD happens before storing
2 Types of data processing
- centralized data processing
- distributed data processing: distributed across different physical locations
Batch processing
- computer processes a nr of tasks that have been collected in a group, often simultanously, in non-stop, sequential order
Pros: - good when response time is not important
- suitable for large data volume
- fast, inexpensive and accurate
- offline
- query-driven as static and more about historical fact finding
eg) mothly payroll system, credit card billing system
Real-time processing
- streams of data
- processing is done as data is input: filtering, aggregating and preparing data
- often optimized for analytics and visualization and directly ingested in tools for it -> data-driven
eg) bank ATMs, control systems, social media
Parallel computing
- splitting up larger tasks into multiple subtasks and execute at the same time
- reduces execution time
- multiple processing within single machine
Distributed computing
- splits up larger tasks into subtasks and executes in separate machines networked together as a cluster
Hadoop Ecosystem
system comprised of several components that cover several aspects of data ingestion, processing, analysis exploration and storage
Sqoop
- interface application for transferring structured data between relational databases and Hadoop -> data ingestion
- can import and export
Flume
- collects large amount of semi and unstructured streaming data from multiple sources
Kafka
- streaming platform that handles constant influx of data and processes it incrementally and sequentially
- used to build real-time streaming pipelines
- combines messaging, storage and stream processing -> storage and analysis of both historical and real-time data
Storm
- real-time big data processing system
- handles influx of data and easily process unbounded stream of data
Storm vs MapRedue?
Hadoop MapReduce
- software for programming firn-specific model that can process data in-parallel on clusters of commodity hardware
- fault-tolerant and reliable
- divide and conquer -> only for batch workloads
-
disk-based processing
2 phases: - Map: splitting and mapping
- Reduce: shuffling and reducing
Spark
- for large-scale data processing
- good for batch and stream data
- in-memory processing -> fast
- supports large-scale data science project s, SQL analytics and ML
- many languages: python, SQL, Java, R, etc
MapReduce vs Spark
- spark is 100x faster due to in-memory proc
- MR only batch, while spark both real-time and batch
- spark good for ML
- MR low cost, spark high cost
- spark easy to combine databases
- MR linear processing -> slow
Pig and Hive
data-analysis platforms on top of MapReduce
YARN, Oozie, Zookeeper
- YARN (Yet Another Resource Negotiatior): takes over resource management and job scheduling from MapReduce
- Apache Oozie: manages workflow in Hadoop environment at desired order
- Apache ZooKeeper: maintaining open-source server to enable reliable distributed coordination
Explain Hadoop Ecosystem
?
BD with AWS
AWS has ecosystem with analytical solutions specifically for growing amount of data
Application: Clickstream analysis through AWS
- clickstream data sent to Kinesis Stream
- stored exposed for processing
- custom application programmed on Kinesis makes real-time recommendations
- output to user who sees personalized content suggestions
Application: Data Warehousing through AWS
- data is uploaded to S3
- EMR is used to transform and clean data
- is loaded back into S3
- loaded into Redshift where it is parallelized for fast analytics
- analysed and visualized with Quicksight
Smart Applications through AWS
- Amazon Kinesis receives data
(2. AWS lambda is used to write code to coordinate data flow) - Amazon Machine Learning model is for real-time predictions
- Amazon SNS is used to notify customer support agents
Amazon S3
Amazon Simple Storage Service
- object storage service offering industry-leading scalability, data avilability, security and performance
Amazon EMR
highly distributed computing framework for processing and storing big data
- EMR apache hadoop allows Hive, Pig, Spark to run on top
Amazon Kinesis
data-streaming platform, makes it easy to download and analyze streaming data
eg) site clicks, IoT data, data lake
AWS Lambda
serverless, event-driven computing service to run code for any type of application or backend service without provisioning or managing servers
Amazon ML
makes it easy for anyone to use predictive analytics, ML functions and GenAI
Amazon DynamoDB
fully managed NoSQL database service providing fast and predictable performance with scalability
Amazon Redshift
fast, large and fully managed cloud data warehouse storage service to analyze structured or semi-structured data efficiently at lower price