Lecture 4 Flashcards by Ricardo Beenen

Q

Big data 5 V’s

A

Volume
variety
velocity
veracity: generated by organic distributed processes i.e. quality ( missingm clean value)
value

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Factors impacting choice of method

A

size of dataset
types of patterns that exist in dataset
if the data meet some underlying assumptions of the method
how noisy the given data
the goal of the analysis

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

steps in a typical data mining effort

A

develop an understanding of the purpose of the data mining project
obtain the dataset to be used
explore, clean, and preprocess the data
reduce the data dimension, if necessary
determine the data mining task (step 1 more specific)
partition the data (for supervised tasks)
choose the data mining algorithm
use algorithm to perform the task
interpret the results
deploy the model

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Semma methodology tasks

A

sample
explore (visualization and basic description data)
modify (select variables, transform variable representation)
model (use statistical an ML models)
assess (accuracy model)

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

CRISP DM

A

Cross industry standard process for data mining

business understanding
data understanding
data preparation
model building
testing and evaluation
deployment

(4,5,6 are semma)

How well did you know this?

1

Not at all

2

3

4

5

Perfectly