Big Data Technologies Flashcards

Question 1

Q

What is Big Data?

Answer

A

3 V’s: Volume, Variety, Velocity

Data that is not atomic (not at lowest level).

Data that needs horizontal scaling to handle it - Scale Out not Scale Up

Question 2

Q

What is the CAP Theorem?

Answer

A

CAP (Brewers) theorem states: It’s impossible for a web service to provide all 3 of these at same time, can only have two:

Consistency
Availability
Partition Tolerance (failing to achieve consistency within set time causes partition). Either network pipe broken or delay in response

Question 3

Q

Explain structured v unstructured data

Answer

A

Structured data is data such as that held in relational DBs, with ACID (Atomicity, Consistency, Isolation, Durability) features.

‘Unstructured’ data is messy data, variety of formats. Data is only really unstructured in counterpoint to structured relational data.

Can adhere to BASE (Basic availablility - allow temp inconsistency to allow info to be available; soft-state - some data may not persist; Eventual consistency)

Question 4

Q

What is NoSQL?

Answer

A

Means Not Only SQL, as well as No SQL.

From 2010 on, idea that relational DBs could handle Big Data

Question 5

Q

What are the types of NoSQL DB?

Answer

A

Key-Value: Key Value Pair e.g. Amazon Dynamo

Column Family: Extended Key-Value e.g. Cassandra

Document (store JSON) e.g. MongoDB

Graph DB e.g. Neo4J

Question 6

Q

What are NewSQL DBs?

Answer

A

NewSQL systems offer the best of both worlds: the relational data model and ACID transactional consistency of traditional operational databases; the familiarity and interactivity of SQL; and the scalability and speed of NoSQL

e.g. VoltDB

Question 7

Q

What is Scale Out v Scale Up?

Answer

A

With large amounts of processing required, instead of getting bigger servers (scale up), use parallel approach of lots of servers (scale out). Instead of adding bigger machines, add more machines.

Question 8

Q

What is MapReduce (MR) good for?

Answer

A

Solved Problem 1: With lots of servers there are problems with the connections btw nodes. MR ensure nodes have their own disk, memory, CPU - no issues with message passing btw nodes.

Solved Problem 2. Issues with disk access and speed and access. MR spreads the processing out, much faster and resistant to node loss.

Question 9

Q

What is Hadoop?

Answer

A

1st: Processing framework - MapReduce. Splits a task across processors near the data, then assembles results.
2nd. Storage system. HDFS
3rd. YARN. Scheduler that allows other frameworks like Spark to be available - don;t need MR.

Batch processing approach.

Question 10

Q

Is Hadoop any good?

Answer

A

3 Commercial vendors.

MapR has gone bust.
Cloudera and Hortonworks have merged

Hadoop deemed slow and now maybe legacy, but still in use.

Question 11

Q

What is Pig?

Answer

A

Pig is a high-level query language on top of MapReduce. Code converted to MR job when run - replacement for ETL stuff.

if given a file and told to look at it, Pig is good for exploration.

Question 12

Q

What is Hive?

Answer

A

Designed by FB to abstract from messy MR/Java code.

Hive language (HQL Hive Query lang) converted to MR job. Allows a schema to be defined on data in HDFS.

Not designed for real-time query response, designed for Big Data queries.

Pig used to clean data and prep for Hive.

Question 13

Q

What is Spark?

Answer

A

Spark is a cluster computing framework, compatible with Hadoop (runs via YARN), to process iterative jobs and interactive analytics. Doesn’t use MR.

Spark all done in memory => typically faster than running MR jobs.

Written in Scala language

Question 14

Q

What is Apache Airflow?

Answer

A

Airflow is a Workflow Management Platform developed by AirBnb and now open sourced to Apache.

Sort of like a powerful ETL/scheduler, uses DAGs as one of it’s constructs.

Question 15

Q

What is DataBricks?

Answer

A

A Platform that unifies data science and engineering across the Machine Learning lifecycle from data prep, to experimentation and deployment of ML applications.

Created by founders of Apache Spark

Big Data Technologies Flashcards

(15 cards)