Big Data Technologies Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is Big Data?

A

3 V’s: Volume, Variety, Velocity

Data that is not atomic (not at lowest level).

Data that needs horizontal scaling to handle it - Scale Out not Scale Up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the CAP Theorem?

A

CAP (Brewers) theorem states: It’s impossible for a web service to provide all 3 of these at same time, can only have two:

Consistency
Availability
Partition Tolerance (failing to achieve consistency within set time causes partition). Either network pipe broken or delay in response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain structured v unstructured data

A

Structured data is data such as that held in relational DBs, with ACID (Atomicity, Consistency, Isolation, Durability) features.

‘Unstructured’ data is messy data, variety of formats. Data is only really unstructured in counterpoint to structured relational data.

Can adhere to BASE (Basic availablility - allow temp inconsistency to allow info to be available; soft-state - some data may not persist; Eventual consistency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is NoSQL?

A

Means Not Only SQL, as well as No SQL.

From 2010 on, idea that relational DBs could handle Big Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the types of NoSQL DB?

A

Key-Value: Key Value Pair e.g. Amazon Dynamo

Column Family: Extended Key-Value e.g. Cassandra

Document (store JSON) e.g. MongoDB

Graph DB e.g. Neo4J

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are NewSQL DBs?

A

NewSQL systems offer the best of both worlds: the relational data model and ACID transactional consistency of traditional operational databases; the familiarity and interactivity of SQL; and the scalability and speed of NoSQL

e.g. VoltDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Scale Out v Scale Up?

A

With large amounts of processing required, instead of getting bigger servers (scale up), use parallel approach of lots of servers (scale out). Instead of adding bigger machines, add more machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is MapReduce (MR) good for?

A

Solved Problem 1: With lots of servers there are problems with the connections btw nodes. MR ensure nodes have their own disk, memory, CPU - no issues with message passing btw nodes.

Solved Problem 2. Issues with disk access and speed and access. MR spreads the processing out, much faster and resistant to node loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Hadoop?

A

1st: Processing framework - MapReduce. Splits a task across processors near the data, then assembles results.
2nd. Storage system. HDFS
3rd. YARN. Scheduler that allows other frameworks like Spark to be available - don;t need MR.

Batch processing approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is Hadoop any good?

A

3 Commercial vendors.

MapR has gone bust.
Cloudera and Hortonworks have merged

Hadoop deemed slow and now maybe legacy, but still in use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Pig?

A

Pig is a high-level query language on top of MapReduce. Code converted to MR job when run - replacement for ETL stuff.

if given a file and told to look at it, Pig is good for exploration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Hive?

A

Designed by FB to abstract from messy MR/Java code.

Hive language (HQL Hive Query lang) converted to MR job. Allows a schema to be defined on data in HDFS.

Not designed for real-time query response, designed for Big Data queries.

Pig used to clean data and prep for Hive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Spark?

A

Spark is a cluster computing framework, compatible with Hadoop (runs via YARN), to process iterative jobs and interactive analytics. Doesn’t use MR.

Spark all done in memory => typically faster than running MR jobs.

Written in Scala language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Apache Airflow?

A

Airflow is a Workflow Management Platform developed by AirBnb and now open sourced to Apache.

Sort of like a powerful ETL/scheduler, uses DAGs as one of it’s constructs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is DataBricks?

A

A Platform that unifies data science and engineering across the Machine Learning lifecycle from data prep, to experimentation and deployment of ML applications.

Created by founders of Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly