Big Data Flashcards
The 3 Vs of big data
volume
velocity
variety
The 2 additional dimensions of big data
variability
complexity
“Volume” in big data
organizations collect data from a variety of sources
“Velocity” in big data
data streams in at an unprecedented speed and must be dealt with in a timely manner
“Variety” in big data
data comes in all type of formats
“Variability” in big data
data flows can be highly inconsistent with periodic peaks
“Complexity” in big data
the variety of sources of data makes it difficult to link, match, cleanse, and transform data across systems
what does acid stand for?
Atomicity
Consistency
Isolation
Durability
ACID: Atomicity
requires each transaction to be all or nothing
ACID: Consistency
ensures that any transaction will bring the database from one valid state to another
ACID: Isolation
ensures that concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially (one after another)
ACID: Durability
ensures that once a transaction has been committed, it will remain so even in the event of power loss, crashes or errors
What is Hadoop?
a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure
the 2 components of Hadoop
the distributed file system
the MapReduce programming paradigm for managing applications on multiple distributed servers
what is data mining?
-the discovery of useful patterns in the data
–the nontrivial extraction of implicit, previously unknown information from data
-the exploration and analysis of large quantities of data to discover meaningful patterns
7 data mining tasks
classification clustering association rule discovery sequential pattern discovery regression deviation detection collaborative filter
What is data mining classification?
assign items in a collection (aka training set) to classes or target categories
ex: bank loan officer assigning loan applicants as low, medium, or high risk
What is data mining clustering?
the process of making a group of abstract objects into classes of similar objects
we then treat clusters of objects as one group
unlike classification, its adaptive to changes and it highlights useful features that distinguish different groups
what is K-clustering?
assign a cluster to each data point and position clusters (i.e. create partitions of the data) in such a way that minimizes the distance from the data points to the cluster
theoretically, we could partition the data any way we want and arbitrarily create clusters. However, we want our partitioning to represent the tightest most compact clusters possible
What is data mining association rules?
if/then statements that help uncover relationships between seemingly unrelated data in a database or information repository
what is data mining collaborative filter?
predict what a person may be interested in on the basis of
- past preferences
- other people with similar preferences
- the preferences of such people for something new
what is the repeated clustering approach to collaborative filtering?
an example with movies:
- cluster people based on a movie they like
- the cluster movies based on being liked by similar clusters of people (even if you didn’t officially like one movie in the cluster too bad you are now part of this cluster)
- cluster people based on preferences for new clusters of movies
-google page rank is another example
repeat until no more clustering possible based on certain thresholds
what is text mining?
application of data mining to textual documents
what is graph mining?
data mining with graph data