L8 Flashcards
1
Q
def BD processing
A
a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions
-> in BD happens before storing
2
Q
2 Types of data processing
A
- centralized data processing
- distributed data processing: distributed across different physical locations
3
Q
Batch processing
A
- computer processes a nr of tasks that have been collected in a group, often simultanously, in non-stop, sequential order
Pros: - good when response time is not important
- suitable for large data volume
- fast, inexpensive and accurate
- offline
- query-driven as static and more about historical fact finding
eg) mothly payroll system, credit card billing system
4
Q
Real-time processing
A
- streams of data
- processing is done as data is input: filtering, aggregating and preparing data
- often optimized for analytics and visualization and directly ingested in tools for it -> data-driven
eg) bank ATMs, control systems, social media
4
Q
Parallel computing
A
- splitting up larger tasks into multiple subtasks and execute at the same time
- reduces execution time
- multiple processing within single machine
5
Q
Distributed computing
A
- splits up larger tasks into subtasks and executes in separate machines networked together as a cluster
5
Q
Hadoop Ecosystem
A
system comprised of several components that cover several aspects of data ingestion, processing, analysis exploration and storage
5
Q
Sqoop
A
- interface application for transferring structured data between relational databases and Hadoop -> data ingestion
- can import and export
6
Q
Flume
A
- collects large amount of semi and unstructured streaming data from multiple sources
7
Q
Kafka
A
- streaming platform that handles constant influx of data and processes it incrementally and sequentially
- used to build real-time streaming pipelines
- combines messaging, storage and stream processing -> storage and analysis of both historical and real-time data
8
Q
Storm
A
- real-time big data processing system
- handles influx of data and easily process unbounded stream of data
9
Q
Storm vs MapRedue?
A
10
Q
Hadoop MapReduce
A
- software for programming firn-specific model that can process data in-parallel on clusters of commodity hardware
- fault-tolerant and reliable
- divide and conquer -> only for batch workloads
-
disk-based processing
2 phases: - Map: splitting and mapping
- Reduce: shuffling and reducing
11
Q
Spark
A
- for large-scale data processing
- good for batch and stream data
- in-memory processing -> fast
- supports large-scale data science project s, SQL analytics and ML
- many languages: python, SQL, Java, R, etc
12
Q
MapReduce vs Spark
A
- spark is 100x faster due to in-memory proc
- MR only batch, while spark both real-time and batch
- spark good for ML
- MR low cost, spark high cost
- spark easy to combine databases
- MR linear processing -> slow