Chapter 1: Data Mining Flashcards
what is data mining
Extracting useful information from large data sets
Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.
[Data mining] is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques.
where is data mining used
Numerous applications:
Military: Accuracy of weapons
Intelligence: Determine which communication is of interest
Medical: Determine likelihood of cancer relapse
In Business:
Prospective Customers: Which are most likely to respond
Customer Risk: Which are most likely to commit fraud
Finance: which loans are likely to default
Logistic regression to assign a “probability if default” value
Subscriptions: which customers most likely to abandon
crm
virtuous cycle
Transform Data > Act > Measure Results > Identify
The Virtuous Cycle
Identify Business Opportunities
Mining data to transform the data into actionable information
Acting on the information
Measuring the results
origins of data mining
Data Mining: at the confluence of statistics and machine learning (aka Artificial Intelligence)
Statistics: ex. Linear regression, logistic regression, discriminant analysis, principal component analysis
Key differences between classical stats, data mining:
Stats emphasis on inference (determining whether a pattern of interesting result might have happened by chance) is missing in data mining.
Data mining deals with large datasets in open-ended fashion, making it impossible to put strict limits around the question being address that inference would require
Overfitting risk
Statistics - emphasis on inference (limited data); Data mining – large data sets, open-ended (no strict limits on question)
big data V’s
Value: The usefulness of the data (from Big Data Fundamentals, Erl)
Volume: The Amount of Data
Velocity: The speed which it is generated and changed
Variety: The different types of data being generated
Veracity: Not subject to the controls or quality checks that apply to data collected in a study
better data make better data science
is your data :
relevant, connected, accurate, enought to work with
Data Science
The process of using names, numbers to predict an answer to a question (Names = Categories, Labels)
The 5 questions data science can answer
Is this A or B? - classification Is this weird? - finding anomlies How much or how many? - regression How is this organized? - clustering What should I do next? - reinforcement learning