Midterm Flashcards
5 V’s of Big Data
Value - Turning big data into value
Velocity - Speed at which data is emanating and changes are occurring between the diverse data sets
Volume - The amount of data being generated
Variety - Can use structures as well as unstructured state
Veracity - Data reliability and trust
Data Mining
Extraction of interesting patterns or knowledge from huge amounts of data
Web Mining Framework
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
AKA
Data pre-processing
Data Mining
Post-processing
Patterns, Info, Knowledge
Data Mining on what data?
- Text files
- Database-oriented data sets and applications
- Advanced data sets and advanced applications
Supervised learning (classification)
Supervision: The trained data are accompanied by labels indicating the class of the observations
- New data based on the training set
Unsupervised learning (clustering)
- The class labels of training data is unknown
- Given a set of measurements, observations, etc. - try to establish the existence of classes or clusters in the data
Classification and label prediction
- construct models based on some training examples
- describe and distinguish classes or concepts for future prediction
- predict the class, classify the new example
Regression
- Predict a value of a given continuous-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency
Attribute
A property or characteristic of an object (columns)
Object
A collection of attributes describe an object
Types of Data sets
Record (Data matrix, documents, transactions)
Graph ( World Wide Web, molecular structures)
Ordered (spatial data, temporal data, sequential data, genetic sequence data)
Structured vs unstructured data
Important characteristics of structured data
Dimensionality - Many attributes per object
Sparsity - only presence counts
Resolution - Patterns depend on the scale
Distribution
Types of Attributes
Nominal - ID numbers, gender, zip codes
Ordinal - rankings, grades, height
Numeric Attribute Types:
Interval - measures on a scale of equal-sized units
Ratio - Inherent zero-point
Properties of Attribute Values
The type of an attribute depends on which of the following properties/operations it possesses:
Distinctness
Order
Differences are meaningful
Ratios are meaningful
Discrete vs Continuous Attributes
Discrete Attribute - Has only a finite or countably infinite set of values
- Sometimes represented as integer variables
- countable
- number of students, shoe size
Continuous attribute - measurable
- height, weight, length
- represented as floating-point variables
Similarity and Dissimilarity Measures
Similarity - numerical measure of how alike two data objects are
- Value is higher when objects are more alike
- Often falls in the range [0,1]
Dissimilarity - numerical measure of how different two data objects are
- Value is lower when objects are more alike
- minimum dissimilarity is often 0
Proximity refers to a similarity or dissimilarity
Cosine Similarity
Cosine measure can be used to measure the similarity between 2 document vectors
What is frequent pattern analysis?
Frequent pattern: a pattern that occurs frequently in a data set
Motivation: Finding inherent regularities in data
(absolute) support, or support count of X is
The frequency or occurance of an itemset X
(relative) support
Is the fraction of transactions that contains X
An itemset X is frequent IF
X’s support is no less than a minimum support threshold
support s is the probability that
a transaction contains X or Y
confidence c, conditional probability that a transaction
Having x also contains Y
Frequent itemsets
- An itemset that contains k items is a k-itemset
- rules that satisfy the minimum support and minimum confidence thresholds are considered strong rules
Basic association rule process
- Find all frequent itemsets - each of these itemsets must occur at least as frequently as predetermined by the minimum support count
- Generate strong association rules from the frequent itemsets: These rules must satisfy the minimum support and minimum confidence
Apriori: A candidate generation and test approach
If there is any itemset which is infrequent, its superset should not be generated/tested
- in other words, all subsets of a frequent itemset must be frequent
General apriori method:
- scan dataset to get frequent 1-itemsets
- generate length (k+1) candidate itemsets from length k frequent itemsets
- test the candidates against dataset to obtain support counts
- terminate when no frequent or candidate set can be generated
Major Tasks in Data Preprocessing
Data cleaning
- Fill in missing values, smooth noisy data, identify or remove
Data Integration
- Integration of multiple databases, data streams or files
Data reduction
- Dimensionality reduction
- Numerosity reduction
Data transformation and data discretization
- Normalization
Data Cleaning
incomplete, noisy, inconsistent
How to handle missing data?
Ignore the record or fill it automatically with a constant like NA, the attribute mean, or the attribute mean for all samples belonging to the same class - the smartest approach
How to handle noisy data?
- Binning - first sort data and partition into equal frequency bins
-Regression - smooth by fitting the data into regression funcitons - Clustering - detect and remove outliers
- Combined computer and human inspection - detect suspicious values and manually check
What is data integration?
Combining data from multiple sources into a coherent dataset
Schema integration - integrate metadata from different sources
Handling Redundancy in Data Integration
Object identification
Derivable data
Redundant attributes may be detected by correlation analysis and covariance analysis
Correlation Analysis (Nominal Data)
chi-squared test
SUM OF (O-E)^2/E
The larger the X^2, the more likely the variables are related
CORRELATION DOESNT IMPLY CAUSALITY
Covariance
How much do attributes change together
Positive covariance - If coA,B>) then A and B both tend to be larger than their expected values
Negative covariance - If CovA,B<0 then A is larger than its expected value, B is likely to be smaller than its expected value
Independence CovA,B=0
Data Reduction
Obtain a reducted representation of the data set that is much smaller in volume, but produces the same analytical results
Normalization is
Scaling data to fall within a smaller, more specified range
Sampling
Main technique for data reduction
- Used because obtaining the entire set of data of interest is too expensive or time consuming
- Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming
Types of Sampling
- Simple random sampling - There is an equal probability of selecting any particular item
- Sampling without replacement - Once an object is selected, it is removed from the population
- Sampling with replacement
- Stratified sampling
- Partition data set and draw samples from each partition
Curse of Dimesionality
When dimensionality increases, data becomes increasingly sparse in the space that it occupies
Discretization
- The process of converting a continuous attribute into an ordinal attribute
- A potentially infinite number of values are mapped into a small number of categories
- Discretization is commonly used in classification
Binning
Partition based on set bin width, partition based on frequency in bin
Unsupervised discretization
Finds breaks in the data values
Supervised discretization
Uses class labels to find breaks
Binarization
- Maps a continuous or categorical attribute into one or more binary values
- Typically used for association analysis
- continuous to categorical then categorical to binary
- Association analysis needs asymmetric binary attributes