Midterm Flashcards
5 V’s of Big Data
Value - Turning big data into value
Velocity - Speed at which data is emanating and changes are occurring between the diverse data sets
Volume - The amount of data being generated
Variety - Can use structures as well as unstructured state
Veracity - Data reliability and trust
Data Mining
Extraction of interesting patterns or knowledge from huge amounts of data
Web Mining Framework
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
AKA
Data pre-processing
Data Mining
Post-processing
Patterns, Info, Knowledge
Data Mining on what data?
- Text files
- Database-oriented data sets and applications
- Advanced data sets and advanced applications
Supervised learning (classification)
Supervision: The trained data are accompanied by labels indicating the class of the observations
- New data based on the training set
Unsupervised learning (clustering)
- The class labels of training data is unknown
- Given a set of measurements, observations, etc. - try to establish the existence of classes or clusters in the data
Classification and label prediction
- construct models based on some training examples
- describe and distinguish classes or concepts for future prediction
- predict the class, classify the new example
Regression
- Predict a value of a given continuous-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency
Attribute
A property or characteristic of an object (columns)
Object
A collection of attributes describe an object
Types of Data sets
Record (Data matrix, documents, transactions)
Graph ( World Wide Web, molecular structures)
Ordered (spatial data, temporal data, sequential data, genetic sequence data)
Structured vs unstructured data
Important characteristics of structured data
Dimensionality - Many attributes per object
Sparsity - only presence counts
Resolution - Patterns depend on the scale
Distribution
Types of Attributes
Nominal - ID numbers, gender, zip codes
Ordinal - rankings, grades, height
Numeric Attribute Types:
Interval - measures on a scale of equal-sized units
Ratio - Inherent zero-point
Properties of Attribute Values
The type of an attribute depends on which of the following properties/operations it possesses:
Distinctness
Order
Differences are meaningful
Ratios are meaningful
Discrete vs Continuous Attributes
Discrete Attribute - Has only a finite or countably infinite set of values
- Sometimes represented as integer variables
- countable
- number of students, shoe size
Continuous attribute - measurable
- height, weight, length
- represented as floating-point variables
Similarity and Dissimilarity Measures
Similarity - numerical measure of how alike two data objects are
- Value is higher when objects are more alike
- Often falls in the range [0,1]
Dissimilarity - numerical measure of how different two data objects are
- Value is lower when objects are more alike
- minimum dissimilarity is often 0
Proximity refers to a similarity or dissimilarity
Cosine Similarity
Cosine measure can be used to measure the similarity between 2 document vectors
What is frequent pattern analysis?
Frequent pattern: a pattern that occurs frequently in a data set
Motivation: Finding inherent regularities in data