Week6 - Apache Spark Flashcards

Question

Hadoop more or less memory than spark

Answer 1

Less, stores data locally

Answer 2

1) Java 2) Scala 3) Python

Answer 3

1) Spark SQL 2) Spark Streaming 3) MLibe (machine learning) 4) GraphX

Answer 4

Looks like a table (can run SQL operations on it)

Answer 5

Spark Streaming (process data in real time)

Answer 6

Abstraction that represents the data streaming source.

Answer 7

1) Chops data into batches of x seconds 2) Process the batches like RDD 3) returns the processed RDD results in batches

Answer 8

1) Low 1/2 second | 2) Latency about 1 second

Answer 9

Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Answer 10

1) Window (timed window, counting, finding first or last)

Answer 11

SQL like - A data warehouse infrastructure that provides data summarization and ad hoc querying (HiveQL)

Answer 12

A high-level data-flow language and execution framework for parallel computation (less ridged than normal Hadoop)

Answer 13

NoSQL database. (based on BigTable) | A scalable, distributed database that supports structured data storage for large tables.

Answer 14

Basically Ansible - high-performance Coordination Service for distributed architecture

Answer 15

Machine Learn algorithms - distributed linear algebra framework and mathematically expressive Scala DSL

Answer 16

It takes the RDD layer (generic data processing ) and extends it with spatial data processing operations.

Answer 17

Out-of-the-box implementation for de facto spatial queries that exist out there. (range query) (KNN k-nearest neighbor query) (join queries)

Answer 18

Creating a grid and inserting data based on the grid coordinates Creates load balancing problem

Answer 19

grid problem. data Skew. Some boxes emply, some boxes too full

Answer 20

1) uniform gird 2) Quad-tree - based on density 3) KDB-tree - no overlap 4) R-Tree - based on clusters (overlap)

Answer 21

When running a lot, maybe the same computation | hundreds of thousands of times or thousands of times per each partition

Answer 22

Count the number of points within an area

Answer 23

1) Map 2D data into one number 2) lose details Space-filling curves is a kind of geometrical slash mathematical property of the 2D data as you can see that you can partition, you can easily map each 2D cell, which is represented in a space, which is just one number, and the numbers of cells represent how close they are in space.

Answer 24

1) similar hash tables (uniform grid) index based on grid id 2) Can also use Guad Tree (partition into 4, and partition that into 4..) 3) R-Tree -find rectangles 4) Voronoi Diagram it partitions the space into different cells such that it has a mathematical property, such that all like, if you do any queries within the cells, the points within these cells are the closest to that point k-nearest

Week6 - Apache Spark Flashcards

(50 cards)