Introduction Flashcards
What is Data Minig
- Large quantities of data
- data contains interesting patterns
Data Mining helps to
- discover patterns in data
- use the patterns for decision making
Large data sources set the foundation for data mining
- Law enforcement agencies -> terrorist detection
- Facebook -> interest and behavior of its users
- Sloan Digital Sky Survey -> predict type of sky object
What is the problem with the large amount of data available
- Many data is collected
- Only a small amount can be looked at by humans
- We are interested in the patterns, and not the data itself
Definitions of Data Mining
Exploration & analysis of large quantities of data in order to discover meaningful patterns
Data Mining methods
- Detect interesting patterns
- Support human decision making with patterns
- Predict the outcome of a future observation based on patterns
Origins of data mining - relation of data mining to other areas
Combination of those areas:
- Statistics
- Machine Learning AI
- Database Systems
Origins of data mining - motivating challanges
- Large amount of data
- high dimensionality of data
- heterogenous and complex data
- explorative analysis > hypothesize-and-test paradigm
What are the two data mining tasks and what is the ML terminology
Descriptive Tasks (Unsupervised) Goal: Find patterns in the data
Predictive Tasks (Supervised) Goal: Predict unknown values of a variable
Data Mining Tasks
- Cluster Analysis (Descriptive)
- Classification (Predictive)
- Regression (Predictive)
- Association Analysis (Descriptive)
The most used methods in practice
- Regression
- Decision Trees
- Clustering
- Random Forests
The steps of the data mining process
1) Data selection
2) Data preprocessing
3) Data transformation
4) Data mining
5) Interpretation / Evaluation of patterns
Questions that come up for data selection
- What data is useful for the task?
- What data is available?
- How is the data quality?
What does exploration / profiling mean
- Develop initial understanding of the data
- Calculate basic summarization statistics
- Visualize the data
- Identify data problems (outliers, missing values, duplicate records)
What does preprocessing and transformation mean
Transform data into a representation that is suitable for the chosen data mining method
Data integration and preperation takes 70 - 80 % of the time for a data mining project
Important transformation aspects
- scale attributes (nominal, ordinal, numeric)
- number of dimensions (represent relevant information with less attributes)
- amount of data (determines hardware requirements)