Lecture 1 Flashcards
What is data mining?
identifying novel potentially useful ultimately understandable patters in data. (Paitesky-Shapiro)
What fields is data mining a combination of?
Machine learning Application domain Databases Statistics Visualization
What is the difference between data mining and statistics?
datasets in data mining typically are:
- samples
- larger
- nosier, incomplete, heterogeneous
statistics
- often deals with the whole population
- often is concerned with hypothesis testing
What is big data?
data that is too large to be analyzed with today’s resources
What are some questions that we can answer with data mining techniques?
Is this object a star or galaxy?
Are customers likely to buy bread together with milk?
What is the value for a particular stock going to be in …
What book is a customer likely going to buy?
…
What are the 7 main data mining tasks?
Sequential pattern discovery Outlier detection Association rule discovery Regression Clustering Visualization Classification
Describe the data mining task “classification”
Assigning a category to each object in a data set
Describe the data mining task “visualization”
Vizualizing the data. What does the data look like?
Describe the data mining task “clustering”
Determining groups of objects in a data set.
Describe the data mining task “Association Rule Discovery”
Determining which objects belong together in a data set.
Describe the data mining task “Outlier Detection”
Determining which objects do not belong with the rest.
Describe the data mining task “Sequential Pattern Discovery”
Determining what happens in the data over time.
Describe the data mining task “Regression”
Assigning a numerical value to each object in the data
Explain the data mining process
Collect data Prepare data Build model Evaluate model Deploy model
How must data be prepared for neural networks?
wants numbers (categorical attributes must be transformed)
likes data to be scaled
does not like noisy data, especially for small datasets
can handle irrelevant or redundant attributes, while they may lead to large decision trees
How must data be prepared for decision trees?
work better with discrete attributes that have small numbers of possible values
does not care about scaling
Can nearest-neighbour data mining techniques handle noise well?
approaches can handle noise if a certain parameter is adjusted
How must data be prepared for distance-based approaches
do not work well if the attributes are equally weighted and typically work with numerical data only
Can expectation expectation maximization data mining techniques deal with missing data?
approaches can deal with missing data, but k-means techniques require substitution of missing data
Describe why data transformations are used in data mining?
data must be transformed into an appropriate type for specific data mining techniques to work properly.
eg. neural networks only like data to be numerical values
What are the two types of data mining techniques?
Predictive (supervised)
Descriptive (unsupervised)
Describe predictive (supervised) data mining techniques
predict (discrete or continuous) class attributes based on other attribute values
this is like learning from a teacher
Describe descriptive (unsupervised) data mining techniques
discover structure of data without prior knowledge of class labels.
Is the following data mining technique descriptive or predictive?
Classification
Predictive
Is the following data mining technique descriptive or predictive?
Visualization
Descriptive
Is the following data mining technique descriptive or predictive?
Clustering
Descriptive
Is the following data mining technique descriptive or predictive?
Association Rule Discovery
Descriptive
Is the following data mining technique descriptive or predictive?
Outlier detection
Predictive or Descriptive
Which data mining technique is the following:
is this object a star or a galaxy?
Classification
Which data mining technique is the following:
What book is this customer likely to buy? Are there additional books we should recommend?
Clustering & Visualization
Which data mining technique is the following:
How many groups of customers are there in the data we collected?
Clustering & Visualization
Which data mining technique is the following:
Are customers likely to buy bread together with milk?
Association rule discovery
Does traffic we currently see in our network contain any malicious packets?
Outlier detection
What is a class label?
A label which identifies the class of the observation in a data set.
A class is just a way to identify a specific type of observation.
An example might be a data set containing both normal and abnormal readings from a sensor where each sensor reading is a combination of measurements (e.g. temperature, humidity, etc.).
The class label for a given sensor reading, in this case, would be normal or abnormal and the measurements would have to meet certain criteria to be identified as either class.
This would be a two class situation, you can have data sets which have multiple classes of observations or none.
What is an application domain?
An application domain is simply the domain under study.
For example, you might be looking at data mining techniques for biological data, industry, or astronomy, to name a few. In these cases biology, industry, and astronomy in turn would be the application domains for each of the previously mentioned data sources.