Big Data Technologies Flashcards
What is Big Data?
3 V’s: Volume, Variety, Velocity
Data that is not atomic (not at lowest level).
Data that needs horizontal scaling to handle it - Scale Out not Scale Up
What is the CAP Theorem?
CAP (Brewers) theorem states: It’s impossible for a web service to provide all 3 of these at same time, can only have two:
Consistency
Availability
Partition Tolerance (failing to achieve consistency within set time causes partition). Either network pipe broken or delay in response
Explain structured v unstructured data
Structured data is data such as that held in relational DBs, with ACID (Atomicity, Consistency, Isolation, Durability) features.
‘Unstructured’ data is messy data, variety of formats. Data is only really unstructured in counterpoint to structured relational data.
Can adhere to BASE (Basic availablility - allow temp inconsistency to allow info to be available; soft-state - some data may not persist; Eventual consistency)
What is NoSQL?
Means Not Only SQL, as well as No SQL.
From 2010 on, idea that relational DBs could handle Big Data
What are the types of NoSQL DB?
Key-Value: Key Value Pair e.g. Amazon Dynamo
Column Family: Extended Key-Value e.g. Cassandra
Document (store JSON) e.g. MongoDB
Graph DB e.g. Neo4J
What are NewSQL DBs?
NewSQL systems offer the best of both worlds: the relational data model and ACID transactional consistency of traditional operational databases; the familiarity and interactivity of SQL; and the scalability and speed of NoSQL
e.g. VoltDB
What is Scale Out v Scale Up?
With large amounts of processing required, instead of getting bigger servers (scale up), use parallel approach of lots of servers (scale out). Instead of adding bigger machines, add more machines.
What is MapReduce (MR) good for?
Solved Problem 1: With lots of servers there are problems with the connections btw nodes. MR ensure nodes have their own disk, memory, CPU - no issues with message passing btw nodes.
Solved Problem 2. Issues with disk access and speed and access. MR spreads the processing out, much faster and resistant to node loss.
What is Hadoop?
1st: Processing framework - MapReduce. Splits a task across processors near the data, then assembles results.
2nd. Storage system. HDFS
3rd. YARN. Scheduler that allows other frameworks like Spark to be available - don;t need MR.
Batch processing approach.
Is Hadoop any good?
3 Commercial vendors.
MapR has gone bust.
Cloudera and Hortonworks have merged
Hadoop deemed slow and now maybe legacy, but still in use.
What is Pig?
Pig is a high-level query language on top of MapReduce. Code converted to MR job when run - replacement for ETL stuff.
if given a file and told to look at it, Pig is good for exploration.
What is Hive?
Designed by FB to abstract from messy MR/Java code.
Hive language (HQL Hive Query lang) converted to MR job. Allows a schema to be defined on data in HDFS.
Not designed for real-time query response, designed for Big Data queries.
Pig used to clean data and prep for Hive.
What is Spark?
Spark is a cluster computing framework, compatible with Hadoop (runs via YARN), to process iterative jobs and interactive analytics. Doesn’t use MR.
Spark all done in memory => typically faster than running MR jobs.
Written in Scala language
What is Apache Airflow?
Airflow is a Workflow Management Platform developed by AirBnb and now open sourced to Apache.
Sort of like a powerful ETL/scheduler, uses DAGs as one of it’s constructs.
What is DataBricks?
A Platform that unifies data science and engineering across the Machine Learning lifecycle from data prep, to experimentation and deployment of ML applications.
Created by founders of Apache Spark