introduction Flashcards
What is Big Data?
Refers to large, complex datasets that exceed the processing capabilities of traditional tools.
What are the 4 V’s of Big Data?
Volume: Large quantity of data; Velocity: Speed of data production, consumption, and analysis; Variety: Structured, unstructured, and multimedia data; Veracity: Trustworthiness and quality of data.
How can Big Data be referred to as a noun and adjective?
As a noun: vague boundary between normal and big data; As an adjective: specific meaning (e.g., Big Data tools, architecture).
Why is there hype around Big Data?
Growth from new data sources; Opportunities for insights; Smarter applications like Google Translate.
What are some success stories of Big Data applications?
Crime prevention, healthcare, finance, astronomy, sports injury prevention.
What are challenges in Big Data acquisition?
Selecting valuable data, filtering, and metadata collection.
What are challenges in Big Data processing?
Parallelization, fault tolerance, scalability.
What frameworks address Big Data processing challenges?
Hadoop and Spark.
What are the three main scenarios for data processing solutions?
Analytics (batch), Interactive (near real-time), Streaming (near real-time).
What is a Data Lake?
A centralized repository for raw data in various formats, processed as needed.
What are NoSQL/NewSQL DBMSs designed for?
Scalability and distributed environments.
What are the types of analytics in Big Data?
Descriptive: Insights into past events; Diagnostic: Explains why events occurred; Predictive: Anticipates future trends; Prescriptive: Recommends actions to leverage or mitigate trends.
What roles exist in Big Data careers?
Data analysts, architects, engineers, scientists.
What skills are required for Big Data careers?
Programming, data management, statistical analysis, domain expertise.
What are the two types of scaling in Big Data infrastructure?
Scale-Up (Vertical) and Scale-Out (Horizontal).
What is SMP architecture and its limitations?
Symmetric MultiProcessing with bottlenecks due to shared resources and limited scalability.
What is MPP architecture and its challenges?
Massively Parallel Processing with vendor lock-in and limited scalability.
What is cluster architecture and its advantages/trade-offs?
Unlimited scalability without vendor lock-in; slower interconnect speed compared to MPP.
What are the pros and cons of commodity hardware in clusters?
Pros: Cost-effective and scalable; Cons: Requires handling failures.
What is Lambda Architecture?
Combines Hot Path (real-time processing) and Cold Path (delayed but accurate processing).
What is Kappa Architecture?
Unified stream processing where all events are processed in real-time.
Who introduced MapReduce and what is it used for?
Introduced by Dean & Ghemawat at Google; used for processing large datasets using Map and Reduce functions.
What does the Map function do in MapReduce?
Processes key-value pairs to generate intermediate key-value pairs.
What does the Reduce function do in MapReduce?
Aggregates intermediate values associated with the same key.
How does Hadoop MapReduce handle execution steps?
Input Splitting -> Mapping -> Shuffling & Sorting -> Reducing.
How does MapReduce handle word count as an example task?
Map emits word-key pairs; Reduce aggregates counts for each word.