data Flashcards

Question

Analysis Types

Answer 1

* Basic statistics * Regression * Recommendation * Classification * Clustering * Text Analysis * Pattern Mining

Answer 2

 Batch (useful if you do not need results right away)  Real-time (useful if you need results right away)  Interactive (useful if you need results right away and want the user to be able to change the parameters of the analysis)

Answer 3

 Static (use if you just want to display the results)  Dynamic (results need to be updated regularly)  Interactive (results needs to be updated regularly and receive input from the user)

Answer 4

The Big Data Stack is a series of applications that are used to accomplish the analytics flow just mentioned.  These applications can be stored on one server/computer or are multiple ones and accessed across a network.  Hadoop can usually handle most of the tasks but not always. We will use the computational tasks association to figure out which it can and which it can’t.

Answer 5

Hadoop is an open-source framework for distributed batch processing of massive scale data using the MapReduce programming model.

Answer 6

programming model useful for big data that won’t fit on a single machine. MapReduce’s magic comes from the fact that it does computation at the location of the files instead of transferring the data.

Answer 7

Raw Data Sources Data Access Connectors Data Storage Batch Analytics Real-time Analytics Interactive Querying Serving Databases, Web & Visualization Frameworks

Answer 8

publish-subscribe messaging, source-sink connectors, datbase connectors, Messaging Quess, Custom Connectors

Answer 9

Distributed Filesystem (HDFS)  Optimized for MapReduce to be used with it. NoSQL (Hbase, MongoDB)  Stands for “Not-Only-SQL”.  SQL-type code that has programmatic capabilities.

Answer 10

Alpha Pattern Beta Pattern Gamma Pattern Delta Pattern

Answer 11

Batch Analysis, Data Storage - relational or non-relational databases. Examples - Web Analytics, weather monitoring

Answer 12

Real-time analysis. Examples - internet of Things applications and real-time monitoring applications

Answer 13

Combines Batch and real time

Answer 14

Interactive Querying. Examples - web analytics, advertisement targeting, inventory management and enterprise applications.

Answer 15

Load Leveling with Queues Load Balancing with Multiple Consumers Lambda Architecture Scheduler-Agent-Supervisor

Answer 16

A queue is a data structure that holds data that is executed upon one element at a time.

Answer 17

 Better Horizontal scaling capability  Better performance for big data  Works well with unstructured data  Optimized for real-time performance  Designed for fast retrieval

Answer 18

 Still in development (newer than relational DBs)  We will see some disadvantages too as we explore their functionality

Answer 19

-Stores data in the form of key-value pairs  Keys are used to uniquely identify the values stored  Keys are also used to determine where the value should be stored -Distributed architectures comprising of multiple storage nodes  Data partitioned across storage nodes with the keys  Hash functions are used to determine the partition number for the key

Answer 20

 Used to store semi-structured data in the form of documents  Documents are encoded in a variety of standards including JSON, XML, BSON, or YAML  These are all just forms of semi-structured data languages. We'll see some examples  Semi-structured data: documents stored are similar to each other but there is no strict schema.

Answer 21

he querying is more efficient based on the attribute values in the documents

Answer 22

 Scalable linearly with the addition of new nodes  Distributed  Column family usage  Provides structured data storage for large tables  Can store both structured and unstructured data

Answer 23

 Minor: merges the small files into a single files when number exceeds a threshold.  Major: merges all store files into a single large store file. The outdated and deleted values (Tombstone marked) are removed.

Answer 24

 Bloom Filters determine if an element is in a particular set.  Bloom Filters work with HBase to exclude store files that need to be looked up while serving read requests for a particular row key.  Basically, Bloom Filters make the lookup process more effective by reducing the amount to be searched through.

Answer 25

Graph structure with nodes and links. Think Social media.