Week6 - Apache Spark Flashcards

1
Q

What is RDD stand for?

A

Resilient Distributed dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is RDD read only?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RDDs can only be created through? (2)

A

1) Data in stable Storage

2) other RDDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RDD is a restricted Distributed shared____ what

A

Memory System ( Cached dataset shared memory)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RDD Contains dataset?

A

Atomic pieces of the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

RDD Contains dependencies on?

A

Parent RDDs

for fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does a RDD compute the dataset

A

It is based on its parents (for fault tolerance)

metadata about its partitioning scheme and data placement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RDD read only and

A

Partitioned collections of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Two important features of RDD and Apache Spark

A

1) Fault Tolerance

2) Lazy Evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe RDD Fault Tolerance

A

It is achieved through lineage retrieval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe RDD Lazy Evaluation

A

A RDD will not be created until a reduce-like job or persist job is created ( create meaningful output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What two classes of operations can you do on RDDs

A

1) Transformations

2) actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

RDD Transformations

A

Build RDDs through operations on other RDDs

1) map, filter, join
2) lazy operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

RDD Actions

A

1) Count, Collect, save

2) trigger execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

hdfs is ?

A

1) text file (Hadoop file system)
2) Distributed file system
3) contain text, log files, errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to find errors in htfs files

A

file.filter(_.contians(“ERROR”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

DAG Scheduler

A

Partition DAG into efficient stages (think narrow and wide dependencies) Pu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Narrow Dependencies

A

Transformation: output needs input from only one partition (very title communications )

1) map
2) union

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Wide Dependencies

A

Multiple dependencies… need data from other partition

1) Group by key
2) join with inputs not on the same partitioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

DAG wide dependencies early or late in the process

A

late (less amount of data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Hadoop scheduler

A

Only 2

1) Map
2) reduce

22
Q

Hadoop where is data stored?

A

Assumes all data is on disk (intermediate data has to be on disk)

Why fault tolerance

23
Q

Hadoop API

A

Only has Map and Reduce procedure programming

24
Q

Hadoop storage

A

Only on HDSF file system (now extended)

25
Q

Hadoop more or less memory than spark

A

Less, stores data locally

26
Q

Apache Spark has what APIs

A

1) Java
2) Scala
3) Python

27
Q

4 Main Apache Spark Libraries

A

1) Spark SQL
2) Spark Streaming
3) MLibe (machine learning)
4) GraphX

28
Q

What is a DataFrame?

A

Looks like a table (can run SQL operations on it)

29
Q

What Spark Library does Google PageRank and Shortest Path use for

A

GraphX

30
Q

What Spark Library is used for Streaming

A

Spark Streaming (process data in real time)

31
Q

Spark Dstream

A

Abstraction that represents the data streaming source.

32
Q

What does Spark Dstreamdo?

A

1) Chops data into batches of x seconds
2) Process the batches like RDD
3) returns the processed RDD results in batches

33
Q

Spark Dstream batch sizes

A

1) Low 1/2 second

2) Latency about 1 second

34
Q

Apache Kafka

A

Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

35
Q

Dstream - Operations State / Transformations

A

1) Window (timed window, counting, finding first or last)

36
Q

Apache Hadoop - Apache Hive

A

SQL like - A data warehouse infrastructure that provides data summarization and ad hoc querying (HiveQL)

37
Q

Apache Hadoop - Apache Pig

A

A high-level data-flow language and execution framework for parallel computation (less ridged than normal Hadoop)

38
Q

Apache Hadoop - Apache HBase

A

NoSQL database. (based on BigTable)

A scalable, distributed database that supports structured data storage for large tables.

39
Q

Apache Hadoop - Apache Zookeeper

A

Basically Ansible - high-performance Coordination Service for distributed architecture

40
Q

Apache Mahout

A

Machine Learn algorithms - distributed linear algebra framework and mathematically expressive Scala DSL

41
Q

What does Geospark

A

It takes the RDD layer (generic data processing ) and extends it with spatial data processing operations.

42
Q

Spatial query processing layer

A

Out-of-the-box implementation for de facto spatial queries that exist out there. (range query) (KNN k-nearest neighbor query)
(join queries)

43
Q

What might cause data skew (Geospark)

A

Creating a grid and inserting data based on the grid coordinates

Creates load balancing problem

44
Q

Load balancing problem

A

grid problem. data Skew. Some boxes emply, some boxes too full

45
Q

For types of grids

A

1) uniform gird
2) Quad-tree - based on density
3) KDB-tree - no overlap
4) R-Tree - based on clusters (overlap)

46
Q

When do you build a local index (Geospark)

A

When running a lot, maybe the same computation

hundreds of thousands of times or thousands of times per each partition

47
Q

Spatial Join Query (Geospark)

A

Count the number of points within an area

48
Q

(Geospark) Is a join or filter more expensive

A

Join

49
Q

What are space-filling curves

A

1) Map 2D data into one number
2) lose details

Space-filling curves is a kind of geometrical slash
mathematical property of the 2D data as you can see that you can partition, you can easily map each 2D cell,
which is represented in a space, which is just one number,
and the numbers of cells represent how close they are in space.

50
Q

Spatial indexes

A

1) similar hash tables (uniform grid) index based on grid id
2) Can also use Guad Tree (partition into 4, and partition that into 4..)
3) R-Tree -find rectangles
4) Voronoi Diagram
it partitions the space into different cells such that it has a mathematical property, such that all like, if you do any queries within the cells, the points within these cells are the closest to that point k-nearest