Chapter 1: Data Mining Flashcards

1
Q

what is data mining

A

Extracting useful information from large data sets

Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.

[Data mining] is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

where is data mining used

A

Numerous applications:
Military: Accuracy of weapons
Intelligence: Determine which communication is of interest
Medical: Determine likelihood of cancer relapse

In Business:
Prospective Customers: Which are most likely to respond

Customer Risk: Which are most likely to commit fraud

Finance: which loans are likely to default
Logistic regression to assign a “probability if default” value
Subscriptions: which customers most likely to abandon

crm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

virtuous cycle

A

Transform Data > Act > Measure Results > Identify
The Virtuous Cycle
Identify Business Opportunities
Mining data to transform the data into actionable information
Acting on the information
Measuring the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

origins of data mining

A

Data Mining: at the confluence of statistics and machine learning (aka Artificial Intelligence)
Statistics: ex. Linear regression, logistic regression, discriminant analysis, principal component analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Key differences between classical stats, data mining:

A

Stats emphasis on inference (determining whether a pattern of interesting result might have happened by chance) is missing in data mining.
Data mining deals with large datasets in open-ended fashion, making it impossible to put strict limits around the question being address that inference would require

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Overfitting risk

A

Statistics - emphasis on inference (limited data); Data mining – large data sets, open-ended (no strict limits on question)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

big data V’s

A

Value: The usefulness of the data (from Big Data Fundamentals, Erl)
Volume: The Amount of Data
Velocity: The speed which it is generated and changed
Variety: The different types of data being generated
Veracity: Not subject to the controls or quality checks that apply to data collected in a study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

better data make better data science

is your data :

A

relevant, connected, accurate, enought to work with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Science

A

The process of using names, numbers to predict an answer to a question (Names = Categories, Labels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The 5 questions data science can answer

A
Is this A or B? - classification
Is this weird? - finding anomlies 
How much or how many?  - regression
How is this organized?  - clustering
What should I do next?  - reinforcement learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly