Introduction Flashcards
What is Data Minig
- Large quantities of data
- data contains interesting patterns
Data Mining helps to
- discover patterns in data
- use the patterns for decision making
Large data sources set the foundation for data mining
- Law enforcement agencies -> terrorist detection
- Facebook -> interest and behavior of its users
- Sloan Digital Sky Survey -> predict type of sky object
What is the problem with the large amount of data available
- Many data is collected
- Only a small amount can be looked at by humans
- We are interested in the patterns, and not the data itself
Definitions of Data Mining
Exploration & analysis of large quantities of data in order to discover meaningful patterns
Data Mining methods
- Detect interesting patterns
- Support human decision making with patterns
- Predict the outcome of a future observation based on patterns
Origins of data mining - relation of data mining to other areas
Combination of those areas:
- Statistics
- Machine Learning AI
- Database Systems
Origins of data mining - motivating challanges
- Large amount of data
- high dimensionality of data
- heterogenous and complex data
- explorative analysis > hypothesize-and-test paradigm
What are the two data mining tasks and what is the ML terminology
Descriptive Tasks (Unsupervised) Goal: Find patterns in the data
Predictive Tasks (Supervised) Goal: Predict unknown values of a variable
Data Mining Tasks
- Cluster Analysis (Descriptive)
- Classification (Predictive)
- Regression (Predictive)
- Association Analysis (Descriptive)
The most used methods in practice
- Regression
- Decision Trees
- Clustering
- Random Forests
The steps of the data mining process
1) Data selection
2) Data preprocessing
3) Data transformation
4) Data mining
5) Interpretation / Evaluation of patterns
Questions that come up for data selection
- What data is useful for the task?
- What data is available?
- How is the data quality?
What does exploration / profiling mean
- Develop initial understanding of the data
- Calculate basic summarization statistics
- Visualize the data
- Identify data problems (outliers, missing values, duplicate records)
What does preprocessing and transformation mean
Transform data into a representation that is suitable for the chosen data mining method
Data integration and preperation takes 70 - 80 % of the time for a data mining project
Important transformation aspects
- scale attributes (nominal, ordinal, numeric)
- number of dimensions (represent relevant information with less attributes)
- amount of data (determines hardware requirements)
Transformation methods
- Discretization and binarization
- feature subset selection / dimensionality reduction
- attribute transformation
- aggregation, sampling
- integrate data from multiple sources
Concept of Data mining
Input: Preprocessed Data
Output: Model / Patterns
- Apply data mining method
- Evaluate resulting model
- Iterate (change the parameter settings / use other methods, improve preprocessing, increase quality of training data)
Description of the deployment step
- Use the model in the business context
- Keep iterating to improve the model
How do data scientists spend their days?
- 60% cleaning data
- 19 % collecting data sets
- 9 % mining data for patterns
Common data mining software
- Python
- RapidMiner
- scikit-learn
- SQL
- Anaconda
Advantages of RapidMiner
- Visual modeling of data mining pipelines
- Faster learning curve for applying data mining methods
Different attribute types
Categorial (qualitative)
- Nominal : Values of the attributes can only be distinguished from another ( equal or unequal)
- Ordinal : Values of the attribute can be ordered (>,