Lecture 4 - Data mining introduction Flashcards
1
Q
Business Intelligence definitions
A
- Creative process that provides results useful for decision making
- Requires several different skills:
*
2
Q
Data mining
A
- Creative process that provides results useful for decision making
- Requires several different skills, e.g.: Statistics, ML, Programming, etc
- Ability to ope with huge amounts of data → used frequently for Big Data
3
Q
Big Data & Challenges
A
Current data is big (by reference to the past)
Challenges of Big Data are typically characterised by:
- Volume: amount of data
- Velocity: flow rate, i.e. speed at which data is being generated and changed
- Variety: different types of data being generated, i.e., currency, dates numbers, text, etc.
- Veracity: data is being generated by organic distributed processes, ie.e, quality (missing, clean value)
- Value: data have no value for the company unless yo can turn it into something useful
4
Q
Methods
A
- Classification and Prediction
- Regression
- K-nearest neighbors
- Decision trees
- Association
- Association rules
- Clustering
- k-means
5
Q
Existence of various methods
A
Each method has advantages and disadvantages, depending on several factors:
- Size of the dataset
- Types of patterns that exist in the data
- If the data meet some underlying assumptions on the method
- How noisy is the given data
- The goal of the analysis
6
Q
Data
A
- A collection of facts usually stained as the result of experiences, web page visits, observations, or experiments, extracted from documents
- Lowest level of abstraction (from which information and knowledge are derived
- May consist of numbers, words, images, etc.
7
Q
Terminology
A
8
Q
Data & Types of Variables
A
- Data may consist of numbers, words, images, etc
- Classification:
9
Q
NOIR
A
- *Nominal data** - values are distinct symbols
- *Ordinal** - order on values, but no distance between values defined
- *Interval** - quantities are not only ordered but measured in fixed and equal units
- *Ratio** - quantities for which the measurement scheme defines a zero point
10
Q
Steps in a typical data mining effort
A
- Develop an understanding of the purpose of the data mining project
- Obtain the dataset to be used in the analysis
- Explore, clean, and preprocess the data
- Reduce the data dimension, if necessary
- Determine the data mining task
- Partition the data (for supervised tasks)
- Choose the data mining technique(s)
- Use algorithms to perform the task
- Interpret the results of the algorithm
- Deploy the model
11
Q
SEMMA methodology
A
- Includes the 10 data mining steps
- Developed by the software company SAS
Sample, Explore, Modify, Model, Assess
12
Q
CRISP-DM
A
- Similar methodology to SEMMA, by IBM SPSS Modeler
- CRoss Industry Standard Process for Data Mining
- Business understanding
- Data understanding
- Data preparation
- Model Building
- Testing and Evaluation
- Deployment
13
Q
CRISP-DM Steps
A