Chapter 1: Data Mining Flashcards

Question 1

Q

what is data mining

Answer

A

Extracting useful information from large data sets

Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.

[Data mining] is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques.

Question 2

Q

where is data mining used

Answer

A

Numerous applications:
Military: Accuracy of weapons
Intelligence: Determine which communication is of interest
Medical: Determine likelihood of cancer relapse

In Business:
Prospective Customers: Which are most likely to respond

Customer Risk: Which are most likely to commit fraud

Finance: which loans are likely to default
Logistic regression to assign a “probability if default” value
Subscriptions: which customers most likely to abandon

crm

Question 3

Q

virtuous cycle

Answer

A

Transform Data > Act > Measure Results > Identify
The Virtuous Cycle
Identify Business Opportunities
Mining data to transform the data into actionable information
Acting on the information
Measuring the results

Question 4

Q

origins of data mining

Answer

A

Data Mining: at the confluence of statistics and machine learning (aka Artificial Intelligence)
Statistics: ex. Linear regression, logistic regression, discriminant analysis, principal component analysis

Question 5

Q

Key differences between classical stats, data mining:

Answer

A

Stats emphasis on inference (determining whether a pattern of interesting result might have happened by chance) is missing in data mining.
Data mining deals with large datasets in open-ended fashion, making it impossible to put strict limits around the question being address that inference would require

Question 6

Q

Overfitting risk

Answer

A

Statistics - emphasis on inference (limited data); Data mining – large data sets, open-ended (no strict limits on question)

Question 7

Q

big data V’s

Answer

A

Value: The usefulness of the data (from Big Data Fundamentals, Erl)
Volume: The Amount of Data
Velocity: The speed which it is generated and changed
Variety: The different types of data being generated
Veracity: Not subject to the controls or quality checks that apply to data collected in a study

Question 8

Q

better data make better data science

is your data :

Answer

A

relevant, connected, accurate, enought to work with

Question 9

Q

Data Science

Answer

A

The process of using names, numbers to predict an answer to a question (Names = Categories, Labels)

Question 10

Q

The 5 questions data science can answer

Answer

A

Is this A or B? - classification
Is this weird? - finding anomlies 
How much or how many?  - regression
How is this organized?  - clustering
What should I do next?  - reinforcement learning