Big Data & Cloud Module I Flashcards

Question

What is SQL-on-Hadoop?

Answer 1

It's a class of applications that combine SQL querying with new Hadoop elements. SQL on Hadoop allows for a wider group of enterprise developers and business analysis work on Hadoop on commodity hardware

Answer 2

1. Batch SQL, so SQL like queries translated to MapReduce or Spark Jobs. Tools: Apache Hive or Apache Spark 2. Interactive SQL: interative queries to enable traditional BI and analytics. Tools: Impala, Apache Drill 3. Operational SQL: OLTP workloads and apps that operate over smaller datasets with Insert, Update and Deletes. Tools: Apache HBase, NoSQL

Answer 3

Apache Hive is a distributed, fault tolerant data Warehouse that enables analytics at a massive scale Hive is built and closely integrated with Hadoop. Gives the ability to query large datasets, with an SQL like interface Best use: batch jobs over large quantities of append only data. No real time queries

Answer 4

Spark SQL is a Spark Module for structured data processing. It provides a programming abstraction called DataFrames and can act as a distributed SQL query engine. It enables unmodified Hive queries to run up to 100x faster

Answer 5

Adaptive Query Execution is one of the greatest features of Spark 3.0. It optimizes and readjust queries based on runtime statistics during the execution of the query. The main idea is that the query plan is not final, and additional optimizations are possibly applied

Answer 6

NoSQL is an approach to database design that enables the storage and queriying of data outside the traditional structures found in relational databases. NoSQL can have several data models, gives freedom from joins, that are not supported or discouraged, and freedom from rigid schemas NoSQL is a flexible and scalable database system that excels at handling large amounts of unstructured data. It offers advantages such as scalability, flexibility, high performance, support for big data and unstructured data, horizontal scalability, and availability. NoSQL databases excel at handling unstructured and semi-structured data, such as social media posts, sensor data, and multimedia content. They can efficiently store and process diverse data formats, including JSON, XML, key-value pairs, and document-oriented data.

Answer 7

1. Graph: each DB contains one or more graphs, and each graph contains vertices and arcs. Vertices represent real world entities Arcs represent relationships between entities 2. Key Value: each DB contains one or more collections (tables): Each one contains a list of key value pairs 3. Document: each DB contains one or more collections, and each collection contains a list of documents 4. Wide column: a key value pair 5. Row oriented, Column oriented

Answer 8

Database Sharding is the process of storing a large database across multiple machines. One of the strenghts of NoSQL are its scale-out capabilities A good sharding strategy is fundamental to optimize performance

Answer 9

Replication means that the data is copied on several nodes. Improves the robustness of the system: if a node fails, replicas prevent data loss

Answer 10

Data consistency means that the user sees a consistent view of the data. ACID means for properties are guaranteed: Atomicity: the transaction is indivisible Consistency: the the transaction leaves the DB in a consistent state Isolation: the transaction is independent from the others Durability: the DBMS protects the DB from failures Consistency is guaranteed to the detriment of speed and availability. NoSQL is mainly based on PA -EL, so system focused on speed and availability.

Answer 11

It means using multiple data storage technologies for varying data storage across the application. Using a single DBMS to handle everything usually leads to inefficient solutions. Each activity has its own requirements. One size fits allo does not work anymore.

Answer 12

Streaming data is data that is continuously generated by thousands of data sources. Big data is not only about batch analysis, but also about analyzing data streams - latency, lower the better - workload balancing In our context, a system for data streaming is a type of data processing engine designed with infinite datasets in mind

Answer 13

Hard: latency of nanoseconds, microseconds Firm: latency of milliseconds, seconds Soft: latency of seconds, minutes

Answer 14

Time Series Model Cash Register Model Turnstile Model

Answer 15

There is a potentially significant difference between the time at which an event occurs and the time at which an event enters the streaming system Stream time: defined by the system as the event enters the pipeline Event time: carried by the event itself

Answer 16

It's a common technique in data streaming. The goal is to perform real time statistical analysis of whatched videos

Answer 17

bit pattern-based algorithms that can be used to solve the count distinct problem more efficiently.

Answer 18

The membership problem involves determining whether a given element is a member of a set or not. In other words, it checks if an item exists in a collection of items. We can rely on an old data structure: Bloom filters (Conceived by Burton Howard Bloom in 1970) A Bloom filter can return false positives, but no false negatives. If it says that “it hasn’t been seen”, then it hasn’t been seen. A Bloom filter is an array of m bits, with m > n

Answer 19

The most common algorithm for this problem is Count-Min Sketch → Because you first do a series of approximated counts, then you keep the minimum of those