Big Data Flashcards

Question 1

Q

The 3 Vs of big data

Answer

A

volume
velocity
variety

Question 2

Q

The 2 additional dimensions of big data

Answer

A

variability

complexity

Question 3

Q

“Volume” in big data

Answer

A

organizations collect data from a variety of sources

Question 4

Q

“Velocity” in big data

Answer

A

data streams in at an unprecedented speed and must be dealt with in a timely manner

Question 5

Q

“Variety” in big data

Answer

A

data comes in all type of formats

Question 6

Q

“Variability” in big data

Answer

A

data flows can be highly inconsistent with periodic peaks

Question 7

Q

“Complexity” in big data

Answer

A

the variety of sources of data makes it difficult to link, match, cleanse, and transform data across systems

Question 8

Q

what does acid stand for?

Answer

A

Atomicity
Consistency
Isolation
Durability

Question 9

Q

ACID: Atomicity

Answer

A

requires each transaction to be all or nothing

Question 10

Q

ACID: Consistency

Answer

A

ensures that any transaction will bring the database from one valid state to another

Question 11

Q

ACID: Isolation

Answer

A

ensures that concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially (one after another)

Question 12

Q

ACID: Durability

Answer

A

ensures that once a transaction has been committed, it will remain so even in the event of power loss, crashes or errors

Question 13

Q

What is Hadoop?

Answer

A

a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure

Question 14

Q

the 2 components of Hadoop

Answer

A

the distributed file system

the MapReduce programming paradigm for managing applications on multiple distributed servers

Question 15

Q

what is data mining?

Answer

A

-the discovery of useful patterns in the data

–the nontrivial extraction of implicit, previously unknown information from data

-the exploration and analysis of large quantities of data to discover meaningful patterns

Question 16

Q

7 data mining tasks

Answer

Study These Flashcards

A

classification
clustering
association rule discovery
sequential pattern discovery
regression
deviation detection
collaborative filter

Question 17

Q

What is data mining classification?

Answer

Study These Flashcards

A

assign items in a collection (aka training set) to classes or target categories

ex: bank loan officer assigning loan applicants as low, medium, or high risk

Question 18

Q

What is data mining clustering?

Answer

Study These Flashcards

A

the process of making a group of abstract objects into classes of similar objects

we then treat clusters of objects as one group

unlike classification, its adaptive to changes and it highlights useful features that distinguish different groups

Question 19

Q

what is K-clustering?

Answer

Study These Flashcards

A

assign a cluster to each data point and position clusters (i.e. create partitions of the data) in such a way that minimizes the distance from the data points to the cluster

theoretically, we could partition the data any way we want and arbitrarily create clusters. However, we want our partitioning to represent the tightest most compact clusters possible

Question 20

Q

What is data mining association rules?

Answer

Study These Flashcards

A

if/then statements that help uncover relationships between seemingly unrelated data in a database or information repository

Question 21

Q

what is data mining collaborative filter?

Answer

Study These Flashcards

A

predict what a person may be interested in on the basis of

past preferences
other people with similar preferences
the preferences of such people for something new

Question 22

Q

what is the repeated clustering approach to collaborative filtering?

Answer

Study These Flashcards

A

an example with movies:

cluster people based on a movie they like
the cluster movies based on being liked by similar clusters of people (even if you didn’t officially like one movie in the cluster too bad you are now part of this cluster)
cluster people based on preferences for new clusters of movies

-google page rank is another example

repeat until no more clustering possible based on certain thresholds