Lecture 4 Flashcards
1
Q
Big data 5 V’s
A
Volume variety velocity veracity: generated by organic distributed processes i.e. quality ( missingm clean value) value
2
Q
Factors impacting choice of method
A
- size of dataset
- types of patterns that exist in dataset
- if the data meet some underlying assumptions of the method
- how noisy the given data
- the goal of the analysis
3
Q
steps in a typical data mining effort
A
- develop an understanding of the purpose of the data mining project
- obtain the dataset to be used
- explore, clean, and preprocess the data
- reduce the data dimension, if necessary
- determine the data mining task (step 1 more specific)
- partition the data (for supervised tasks)
- choose the data mining algorithm
- use algorithm to perform the task
- interpret the results
- deploy the model
4
Q
Semma methodology tasks
A
- sample
- explore (visualization and basic description data)
- modify (select variables, transform variable representation)
- model (use statistical an ML models)
- assess (accuracy model)
5
Q
CRISP DM
A
Cross industry standard process for data mining
- business understanding
- data understanding
- data preparation
- model building
- testing and evaluation
- deployment
(4,5,6 are semma)