Introduction Flashcards

1
Q

What is Data Minig

A
  • Large quantities of data
  • data contains interesting patterns

Data Mining helps to

  • discover patterns in data
  • use the patterns for decision making
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Large data sources set the foundation for data mining

A
  • Law enforcement agencies -> terrorist detection
  • Facebook -> interest and behavior of its users
  • Sloan Digital Sky Survey -> predict type of sky object
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the problem with the large amount of data available

A
  • Many data is collected
  • Only a small amount can be looked at by humans
  • We are interested in the patterns, and not the data itself
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Definitions of Data Mining

A

Exploration & analysis of large quantities of data in order to discover meaningful patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Mining methods

A
  1. Detect interesting patterns
  2. Support human decision making with patterns
  3. Predict the outcome of a future observation based on patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Origins of data mining - relation of data mining to other areas

A

Combination of those areas:

  • Statistics
  • Machine Learning AI
  • Database Systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Origins of data mining - motivating challanges

A
  • Large amount of data
  • high dimensionality of data
  • heterogenous and complex data
  • explorative analysis > hypothesize-and-test paradigm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two data mining tasks and what is the ML terminology

A
Descriptive Tasks (Unsupervised)
Goal: Find patterns in the data
Predictive Tasks (Supervised)
Goal: Predict unknown values of a variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Mining Tasks

A
  • Cluster Analysis (Descriptive)
  • Classification (Predictive)
  • Regression (Predictive)
  • Association Analysis (Descriptive)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The most used methods in practice

A
  • Regression
  • Decision Trees
  • Clustering
  • Random Forests
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The steps of the data mining process

A

1) Data selection
2) Data preprocessing
3) Data transformation
4) Data mining
5) Interpretation / Evaluation of patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Questions that come up for data selection

A
  • What data is useful for the task?
  • What data is available?
  • How is the data quality?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does exploration / profiling mean

A
  • Develop initial understanding of the data
  • Calculate basic summarization statistics
  • Visualize the data
  • Identify data problems (outliers, missing values, duplicate records)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does preprocessing and transformation mean

A

Transform data into a representation that is suitable for the chosen data mining method

Data integration and preperation takes 70 - 80 % of the time for a data mining project

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Important transformation aspects

A
  • scale attributes (nominal, ordinal, numeric)
  • number of dimensions (represent relevant information with less attributes)
  • amount of data (determines hardware requirements)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Transformation methods

A
  • Discretization and binarization
  • feature subset selection / dimensionality reduction
  • attribute transformation
  • aggregation, sampling
  • integrate data from multiple sources
17
Q

Concept of Data mining

A

Input: Preprocessed Data
Output: Model / Patterns

  1. Apply data mining method
  2. Evaluate resulting model
  3. Iterate (change the parameter settings / use other methods, improve preprocessing, increase quality of training data)
18
Q

Description of the deployment step

A
  • Use the model in the business context

- Keep iterating to improve the model

19
Q

How do data scientists spend their days?

A
  • 60% cleaning data
  • 19 % collecting data sets
  • 9 % mining data for patterns
20
Q

Common data mining software

A
  • Python
  • RapidMiner
  • scikit-learn
  • SQL
  • Anaconda
21
Q

Advantages of RapidMiner

A
  • Visual modeling of data mining pipelines

- Faster learning curve for applying data mining methods

22
Q

Different attribute types

A

Categorial (qualitative)

  • Nominal : Values of the attributes can only be distinguished from another ( equal or unequal)
  • Ordinal : Values of the attribute can be ordered (>,