data Flashcards

1
Q

Scaling

A

increase the number or decreasing the number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Que decoupling

A

By decoupling data, it helps to remove any implementation dependencies between them. Independent releases. Streamlined and faster development. Improved testability of computing components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

node and instance

A

some sort of machine unit, like a webserver that is processing data in some way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

vertical scaling

A

Take one machine and give it more stuff like ram, adding storage, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

immutable

A

data will not change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Map Reduce

A

A programming model that is used for bagtch analysis in a wide range of application including: Web analytics, networking, E-commerce, Finance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Map reduce - Map

A

input for the map phase is in key-value pairs. two phase, spliting and mapping out, key, value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Map reduce - Reduce

A

Output for the reduce phase is in key-value pairs. Reduce shuffle & sort, Reducer - combines key value pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are four types of Analytics

A
  1. Descriptive Analytics
  2. Diagnostic Analytics
  3. Predictive Analytics
  4. Prescriptive Analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Descriptive Analytics

A

What has happened? Example: What is the average number of visitors to a website in a day?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Diagnostic Analytics

A

Why did it happen? Example: What is the reason that this patients heart failed at exactly 12:03 PM?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Predictive Analytics

A

What is likely to happen? Example: When will the stock prices for Amazon begin to go down again?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Prescriptive Analytics

A

What can we do to make it happen? Example: What is the best route to drive to Alderwood Mall at 5:00 PM?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is Big Data?

A

Collections of datasets whose volume, velocity and variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How BIG is Big Data?

A

2.5 quintrillion bytes of data every day

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

5 Characteristics of Big Data

A

Volume
Velocity
Variety
Veracity
Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Volume

A

how much

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Velocity

A

How fast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Variety

A

structured, unstructured and semi-structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Veracity

A

refers to the biasedness, noise, and abnormality in data. It also refers to incomplete data or errors, outliers, and missing values.

21
Q

Value

A

Usefulness

22
Q

Analytics Flow for Big Data

A

Data Collection
Data Preparation
Analysis Types
Analysis Modes
Visualizations

23
Q

Data Collection

A

Connectors:
 Publish-Subscribe messaging frameworks
 Messaging queues
 Source-sink connectors
 Database connectors

24
Q

Data Preparation

A

Data Cleaning
 Wrangling/Munging
 De-duplication
 Normalization
 Sampling
 Filtering

25
Analysis Types
* Basic statistics * Regression * Recommendation * Classification * Clustering * Text Analysis * Pattern Mining
26
Analysis Modes
 Batch (useful if you do not need results right away)  Real-time (useful if you need results right away)  Interactive (useful if you need results right away and want the user to be able to change the parameters of the analysis)
27
Visualizations
 Static (use if you just want to display the results)  Dynamic (results need to be updated regularly)  Interactive (results needs to be updated regularly and receive input from the user)
28
Big Data Stack
The Big Data Stack is a series of applications that are used to accomplish the analytics flow just mentioned.  These applications can be stored on one server/computer or are multiple ones and accessed across a network.  Hadoop can usually handle most of the tasks but not always. We will use the computational tasks association to figure out which it can and which it can’t.
29
Hadoop
Hadoop is an open-source framework for distributed batch processing of massive scale data using the MapReduce programming model.
30
MapReduce
programming model useful for big data that won’t fit on a single machine. MapReduce’s magic comes from the fact that it does computation at the location of the files instead of transferring the data.
31
Big Data Stack Overview
Raw Data Sources Data Access Connectors Data Storage Batch Analytics Real-time Analytics Interactive Querying Serving Databases, Web & Visualization Frameworks
32
Data Access Connectors
publish-subscribe messaging, source-sink connectors, datbase connectors, Messaging Quess, Custom Connectors
33
Data Storage
Distributed Filesystem (HDFS)  Optimized for MapReduce to be used with it. NoSQL (Hbase, MongoDB)  Stands for “Not-Only-SQL”.  SQL-type code that has programmatic capabilities.
34
Analytics Patterns
Alpha Pattern Beta Pattern Gamma Pattern Delta Pattern
35
Alpha Pattern
Batch Analysis, Data Storage - relational or non-relational databases. Examples - Web Analytics, weather monitoring
36
Beta Pattern
Real-time analysis. Examples - internet of Things applications and real-time monitoring applications
37
Gamma Pattern
Combines Batch and real time
38
Delta Pattern
Interactive Querying. Examples - web analytics, advertisement targeting, inventory management and enterprise applications.
39
Analytics Architectures
Load Leveling with Queues Load Balancing with Multiple Consumers Lambda Architecture Scheduler-Agent-Supervisor
40
Queue
A queue is a data structure that holds data that is executed upon one element at a time.
41
Advantages
 Better Horizontal scaling capability  Better performance for big data  Works well with unstructured data  Optimized for real-time performance  Designed for fast retrieval
42
Disadvantages
 Still in development (newer than relational DBs)  We will see some disadvantages too as we explore their functionality
43
No SQL - Key-Value Databases
-Stores data in the form of key-value pairs  Keys are used to uniquely identify the values stored  Keys are also used to determine where the value should be stored -Distributed architectures comprising of multiple storage nodes  Data partitioned across storage nodes with the keys  Hash functions are used to determine the partition number for the key
44
No SQL - Document Databases
 Used to store semi-structured data in the form of documents  Documents are encoded in a variety of standards including JSON, XML, BSON, or YAML  These are all just forms of semi-structured data languages. We'll see some examples  Semi-structured data: documents stored are similar to each other but there is no strict schema.
45
Benefit of using document DBs over key-value DBs
he querying is more efficient based on the attribute values in the documents
46
No SQL - HBase
 Scalable linearly with the addition of new nodes  Distributed  Column family usage  Provides structured data storage for large tables  Can store both structured and unstructured data
47
Compaction Types
 Minor: merges the small files into a single files when number exceeds a threshold.  Major: merges all store files into a single large store file. The outdated and deleted values (Tombstone marked) are removed.
48
Bloom Filters in HBase
 Bloom Filters determine if an element is in a particular set.  Bloom Filters work with HBase to exclude store files that need to be looked up while serving read requests for a particular row key.  Basically, Bloom Filters make the lookup process more effective by reducing the amount to be searched through.
49
Graph Databases
Graph structure with nodes and links. Think Social media.