Lecture 8 Flashcards

Question 1

Q

Big Data

Answer

A

Data that cannot be stored and processed on a singled device.

Question 2

Q

2 aspects of big data

Answer

A

Distributed storage (Distributed File Systems / Sharing)
Distributed processing (and handling derived data)

Question 3

Q

Processing on Big-Data

Answer

A

Not as easy as writing an SQL Query and expecting fast results
- Exploration
- Analytics
- Processing
- Publishing

Question 4

Q

Database architecture

Answer

A

Used by management to monitor business performance.

Dashboard are built once in software based on information needs.

View
(Rest API / SDK)
Controller
(Database API / SQL)
Model / DB
(Database API / SQL)
Power BI, etc

Question 5

Q

Data warehousing

Answer

A

Collecting data for reporting purposes.

Make static snapshots to send to a central data warehouse.
Extract, transform, load (ETL)
Staging - preparing data for reporting an integration.
Takes load off operational systems.
Enriches information by combining systems.

Question 6

Q

Database Architecture

Answer

A

Buy bigger machines

Effectiveness of upgrading hardware is limited and expensive
Single point of failure

Buy more machines

Create replicas for instances

Question 7

Q

Data Processing

Answer

A

ETL to Big Data

Relational databases
ETL
Big Data
Cloud Solutions

Question 8

Q

HDFS

Answer

A

Hadoop Distributed File System

Storage layer for Hadoop BigData System
Based on Google File system
Fault tolerant distributed file system
Designed to turn a computing cluster (a large collection of loosely connected compute nodes) into a massively scalable pool of storage.
-Provides redundant storage for massive amounts of data

Question 9

Q

Properties of HDFS

Answer

A

Made to be resilient and fail proof, when each data node writes its memory to disk data blocks, it also replicates that memory to another server.
Data nodes can be made rack aware, since redundancy does not work when you write data to two disk drives in the same rack.
The name node tells the data nodes where to write data.
The name node also tells your application which data nodes hold the file.

Question 10

Q

HBASE Column Family

Answer

A

Column families give way to optimal sharding and compression.

Question 11

Q

Streaming Data

Answer

A

Imagine have to process incoming messages from:

A mmorpg where players are moving around, finding gold and loot.
Uber drivers all over a country moving around.

We need real-time processing of information.

Question 12

Q

Apache Kafka

Answer

A

Functions like a distributed publish-subscribe messaging system.

Question 13

Q

Apache Kafka features

Answer

A

Durability
Scalability
High availability
High throughput (scalable managing system)
Distributed, reliable publish-subscribe system
Design as message queue and implementation as a distributed log service.

Question 14

Q

Batch processing

Answer

A

Processing of blocks of data that have already been stored over a period of time.
- Often on disk.
- Hadoop and MapReduce

e.g processing transactions that have been performed by a financial firm in a week.

Question 15

Q

Stream

Answer

A

Process data in real-time as they arrive and detect conditions within a small period of time from the point of receiving the data.

Often in memory.
Multiple publishers.
Concurrency
Kafka and Spark Streaming

e.g fraud detection, social media sentiment analysis, log monitoring, analysing customer behaviour.

Question 16

Q

Data Exploration

Answer

Study These Flashcards

A

Data exploration is about describing the data by means of statistical and visualisation techniques.
We explore in order to understand the features and bring important features to our models.

Question 17

Q

Data Exploration with big data

Answer

Study These Flashcards

A

We cannot load all data in memory
Some operations take too much time to run on a single machine.

Question 18

Q

Exploring using Pandas

Answer

Study These Flashcards

A

Pandas is an implementation of the DataFrame data structure.
PySpark uses the same data structure to distribute computation in a cluster.
Dask provides another distributed DataFrame alternative.
-

Question 19

Q

When to use DataFrame?

Answer

Study These Flashcards

A

The Pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc

Question 20

Q

Why use DataFrame?

Answer

Study These Flashcards

A

DataFrame allows to store heterogeneous data while Series allows to store homogeneous data.

Question 21

Q

Distributed DataFrame

Answer

Study These Flashcards

A

Original DataFrame
Split (the data frame in in-memory manageable chunks)
Apply (the transformation to each chunk independently)
Combine (each chunk back into a data frame)

Lecture 8 Flashcards

(21 cards)