Lecture 1 Flashcards
What is data mining?
identifying novel potentially useful ultimately understandable patters in data. (Paitesky-Shapiro)
What fields is data mining a combination of?
Machine learning Application domain Databases Statistics Visualization
What is the difference between data mining and statistics?
datasets in data mining typically are:
- samples
- larger
- nosier, incomplete, heterogeneous
statistics
- often deals with the whole population
- often is concerned with hypothesis testing
What is big data?
data that is too large to be analyzed with today’s resources
What are some questions that we can answer with data mining techniques?
Is this object a star or galaxy?
Are customers likely to buy bread together with milk?
What is the value for a particular stock going to be in …
What book is a customer likely going to buy?
…
What are the 7 main data mining tasks?
Sequential pattern discovery Outlier detection Association rule discovery Regression Clustering Visualization Classification
Describe the data mining task “classification”
Assigning a category to each object in a data set
Describe the data mining task “visualization”
Vizualizing the data. What does the data look like?
Describe the data mining task “clustering”
Determining groups of objects in a data set.
Describe the data mining task “Association Rule Discovery”
Determining which objects belong together in a data set.
Describe the data mining task “Outlier Detection”
Determining which objects do not belong with the rest.
Describe the data mining task “Sequential Pattern Discovery”
Determining what happens in the data over time.
Describe the data mining task “Regression”
Assigning a numerical value to each object in the data
Explain the data mining process
Collect data Prepare data Build model Evaluate model Deploy model
How must data be prepared for neural networks?
wants numbers (categorical attributes must be transformed)
likes data to be scaled
does not like noisy data, especially for small datasets
can handle irrelevant or redundant attributes, while they may lead to large decision trees