data Flashcards

1
Q

Scaling

A

increase the number or decreasing the number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Que decoupling

A

By decoupling data, it helps to remove any implementation dependencies between them. Independent releases. Streamlined and faster development. Improved testability of computing components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

node and instance

A

some sort of machine unit, like a webserver that is processing data in some way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

vertical scaling

A

Take one machine and give it more stuff like ram, adding storage, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

immutable

A

data will not change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Map Reduce

A

A programming model that is used for bagtch analysis in a wide range of application including: Web analytics, networking, E-commerce, Finance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Map reduce - Map

A

input for the map phase is in key-value pairs. two phase, spliting and mapping out, key, value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Map reduce - Reduce

A

Output for the reduce phase is in key-value pairs. Reduce shuffle & sort, Reducer - combines key value pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are four types of Analytics

A
  1. Descriptive Analytics
  2. Diagnostic Analytics
  3. Predictive Analytics
  4. Prescriptive Analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Descriptive Analytics

A

What has happened? Example: What is the average number of visitors to a website in a day?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Diagnostic Analytics

A

Why did it happen? Example: What is the reason that this patients heart failed at exactly 12:03 PM?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Predictive Analytics

A

What is likely to happen? Example: When will the stock prices for Amazon begin to go down again?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Prescriptive Analytics

A

What can we do to make it happen? Example: What is the best route to drive to Alderwood Mall at 5:00 PM?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is Big Data?

A

Collections of datasets whose volume, velocity and variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How BIG is Big Data?

A

2.5 quintrillion bytes of data every day

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

5 Characteristics of Big Data

A

Volume
Velocity
Variety
Veracity
Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Volume

A

how much

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Velocity

A

How fast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Variety

A

structured, unstructured and semi-structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Veracity

A

refers to the biasedness, noise, and abnormality in data. It also refers to incomplete data or errors, outliers, and missing values.

21
Q

Value

A

Usefulness

22
Q

Analytics Flow for Big Data

A

Data Collection
Data Preparation
Analysis Types
Analysis Modes
Visualizations

23
Q

Data Collection

A

Connectors:
 Publish-Subscribe messaging frameworks
 Messaging queues
 Source-sink connectors
 Database connectors

24
Q

Data Preparation

A

Data Cleaning
 Wrangling/Munging
 De-duplication
 Normalization
 Sampling
 Filtering

25
Q

Analysis Types

A
  • Basic statistics
  • Regression
  • Recommendation
  • Classification
  • Clustering
  • Text Analysis
  • Pattern Mining
26
Q

Analysis Modes

A

 Batch (useful if you do not need results right away)
 Real-time (useful if you need results right away)
 Interactive (useful if you need results right away and
want the user to be able to change the parameters of the
analysis)

27
Q

Visualizations

A

 Static (use if you just want to display the results)
 Dynamic (results need to be updated regularly)
 Interactive (results needs to be updated regularly and
receive input from the user)

28
Q

Big Data Stack

A

The Big Data Stack is a series of applications that are used to accomplish the analytics flow just mentioned.
 These applications can be stored on one server/computer or are multiple ones and accessed across a network.
 Hadoop can usually handle most of the tasks but not always. We will use the computational tasks association to
figure out which it can and which it can’t.

29
Q

Hadoop

A

Hadoop is an open-source framework for distributed batch processing of massive scale data using the MapReduce
programming model.

30
Q

MapReduce

A

programming model useful for big data that won’t fit on a single machine. MapReduce’s magic comes from the fact that it does computation at the location of the files instead of transferring the data.

31
Q

Big Data
Stack
Overview

A

Raw Data Sources
Data Access Connectors
Data Storage
Batch Analytics
Real-time Analytics
Interactive Querying
Serving Databases, Web & Visualization Frameworks

32
Q

Data Access Connectors

A

publish-subscribe messaging, source-sink connectors, datbase connectors, Messaging Quess, Custom Connectors

33
Q

Data Storage

A

Distributed Filesystem (HDFS)
 Optimized for MapReduce to be used with it.
NoSQL (Hbase, MongoDB)
 Stands for “Not-Only-SQL”.
 SQL-type code that has programmatic
capabilities.

34
Q

Analytics Patterns

A

Alpha Pattern
Beta Pattern
Gamma Pattern
Delta Pattern

35
Q

Alpha Pattern

A

Batch Analysis, Data Storage - relational or non-relational databases. Examples - Web Analytics, weather monitoring

36
Q

Beta Pattern

A

Real-time analysis. Examples - internet of Things applications and real-time monitoring applications

37
Q

Gamma Pattern

A

Combines Batch and real time

38
Q

Delta Pattern

A

Interactive Querying. Examples - web analytics, advertisement targeting, inventory management and enterprise applications.

39
Q

Analytics Architectures

A

Load Leveling with Queues
Load Balancing with Multiple Consumers
Lambda Architecture
Scheduler-Agent-Supervisor

40
Q

Queue

A

A queue is a data structure that holds data that is executed upon one element at a time.

41
Q

Advantages

A

 Better Horizontal scaling capability
 Better performance for big data
 Works well with unstructured data
 Optimized for real-time performance
 Designed for fast retrieval

42
Q

Disadvantages

A

 Still in development (newer than relational DBs)
 We will see some disadvantages too as we explore their
functionality

43
Q

No SQL - Key-Value Databases

A

-Stores data in the form of key-value pairs
 Keys are used to uniquely identify the values stored
 Keys are also used to determine where the value should be stored
-Distributed architectures comprising of multiple storage
nodes
 Data partitioned across storage nodes with the keys
 Hash functions are used to determine the partition number for the
key

44
Q

No SQL - Document Databases

A

 Used to store semi-structured data in the form of
documents
 Documents are encoded in a variety of standards including
JSON, XML, BSON, or YAML
 These are all just forms of semi-structured data languages. We’ll
see some examples
 Semi-structured data: documents stored are similar to each
other but there is no strict schema.

45
Q

Benefit of using document DBs over key-value DBs

A

he
querying is more efficient based on the attribute values in
the documents

46
Q

No SQL - HBase

A

 Scalable linearly with the addition of new nodes
 Distributed
 Column family usage
 Provides structured data storage for large tables
 Can store both structured and unstructured data

47
Q

Compaction Types

A

 Minor: merges the small files into a single files
when number exceeds a threshold.
 Major: merges all store files into a single large
store file.
The outdated and deleted values (Tombstone
marked) are removed.

48
Q

Bloom Filters in HBase

A

 Bloom Filters determine if an element is in a
particular set.
 Bloom Filters work with HBase to exclude store
files that need to be looked up while serving read
requests for a particular row key.
 Basically, Bloom Filters make the lookup process
more effective by reducing the amount to be
searched through.

49
Q

Graph Databases

A

Graph structure with nodes and links. Think Social media.