Big Data Flashcards

1
Q

The 3 Vs of big data

A

volume
velocity
variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The 2 additional dimensions of big data

A

variability

complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

“Volume” in big data

A

organizations collect data from a variety of sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

“Velocity” in big data

A

data streams in at an unprecedented speed and must be dealt with in a timely manner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

“Variety” in big data

A

data comes in all type of formats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

“Variability” in big data

A

data flows can be highly inconsistent with periodic peaks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

“Complexity” in big data

A

the variety of sources of data makes it difficult to link, match, cleanse, and transform data across systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what does acid stand for?

A

Atomicity
Consistency
Isolation
Durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ACID: Atomicity

A

requires each transaction to be all or nothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ACID: Consistency

A

ensures that any transaction will bring the database from one valid state to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ACID: Isolation

A

ensures that concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially (one after another)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ACID: Durability

A

ensures that once a transaction has been committed, it will remain so even in the event of power loss, crashes or errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Hadoop?

A

a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

the 2 components of Hadoop

A

the distributed file system

the MapReduce programming paradigm for managing applications on multiple distributed servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is data mining?

A

-the discovery of useful patterns in the data

–the nontrivial extraction of implicit, previously unknown information from data

-the exploration and analysis of large quantities of data to discover meaningful patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

7 data mining tasks

A
classification
clustering
association rule discovery
sequential pattern discovery
regression
deviation detection
collaborative filter
17
Q

What is data mining classification?

A

assign items in a collection (aka training set) to classes or target categories

ex: bank loan officer assigning loan applicants as low, medium, or high risk

18
Q

What is data mining clustering?

A

the process of making a group of abstract objects into classes of similar objects

we then treat clusters of objects as one group

unlike classification, its adaptive to changes and it highlights useful features that distinguish different groups

19
Q

what is K-clustering?

A

assign a cluster to each data point and position clusters (i.e. create partitions of the data) in such a way that minimizes the distance from the data points to the cluster

theoretically, we could partition the data any way we want and arbitrarily create clusters. However, we want our partitioning to represent the tightest most compact clusters possible

20
Q

What is data mining association rules?

A

if/then statements that help uncover relationships between seemingly unrelated data in a database or information repository

21
Q

what is data mining collaborative filter?

A

predict what a person may be interested in on the basis of

  • past preferences
  • other people with similar preferences
  • the preferences of such people for something new
22
Q

what is the repeated clustering approach to collaborative filtering?

A

an example with movies:

  • cluster people based on a movie they like
  • the cluster movies based on being liked by similar clusters of people (even if you didn’t officially like one movie in the cluster too bad you are now part of this cluster)
  • cluster people based on preferences for new clusters of movies

-google page rank is another example

repeat until no more clustering possible based on certain thresholds

23
Q

what is text mining?

A

application of data mining to textual documents

24
Q

what is graph mining?

A

data mining with graph data