Lecture 4 Flashcards

1
Q

Big data 5 V’s

A
Volume
variety
velocity
veracity: generated by organic distributed processes i.e. quality ( missingm clean value)
value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Factors impacting choice of method

A
  • size of dataset
  • types of patterns that exist in dataset
  • if the data meet some underlying assumptions of the method
  • how noisy the given data
  • the goal of the analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

steps in a typical data mining effort

A
  1. develop an understanding of the purpose of the data mining project
  2. obtain the dataset to be used
  3. explore, clean, and preprocess the data
  4. reduce the data dimension, if necessary
  5. determine the data mining task (step 1 more specific)
  6. partition the data (for supervised tasks)
  7. choose the data mining algorithm
  8. use algorithm to perform the task
  9. interpret the results
  10. deploy the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Semma methodology tasks

A
  • sample
  • explore (visualization and basic description data)
  • modify (select variables, transform variable representation)
  • model (use statistical an ML models)
  • assess (accuracy model)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

CRISP DM

A

Cross industry standard process for data mining

  1. business understanding
  2. data understanding
  3. data preparation
  4. model building
  5. testing and evaluation
  6. deployment

(4,5,6 are semma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly