Lecture 4 - Data mining introduction Flashcards

1
Q

Business Intelligence definitions

A
  • Creative process that provides results useful for decision making
  • Requires several different skills:
    *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data mining

A
  • Creative process that provides results useful for decision making
  • Requires several different skills, e.g.: Statistics, ML, Programming, etc
  • Ability to ope with huge amounts of data → used frequently for Big Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Big Data & Challenges

A

Current data is big (by reference to the past)
Challenges of Big Data are typically characterised by:

  • Volume: amount of data
  • Velocity: flow rate, i.e. speed at which data is being generated and changed
  • Variety: different types of data being generated, i.e., currency, dates numbers, text, etc.
  • Veracity: data is being generated by organic distributed processes, ie.e, quality (missing, clean value)
    • Value: data have no value for the company unless yo can turn it into something useful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Methods

A
  • Classification and Prediction
    • Regression
    • K-nearest neighbors
    • Decision trees
  • Association
    • Association rules
  • Clustering
    • k-means
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Existence of various methods

A

Each method has advantages and disadvantages, depending on several factors:

  • Size of the dataset
  • Types of patterns that exist in the data
  • If the data meet some underlying assumptions on the method
  • How noisy is the given data
  • The goal of the analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data

A
  • A collection of facts usually stained as the result of experiences, web page visits, observations, or experiments, extracted from documents
  • Lowest level of abstraction (from which information and knowledge are derived
  • May consist of numbers, words, images, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Terminology

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data & Types of Variables

A
  • Data may consist of numbers, words, images, etc
  • Classification:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

NOIR

A
  • *Nominal data** - values are distinct symbols
  • *Ordinal** - order on values, but no distance between values defined
  • *Interval** - quantities are not only ordered but measured in fixed and equal units
  • *Ratio** - quantities for which the measurement scheme defines a zero point
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Steps in a typical data mining effort

A
  1. Develop an understanding of the purpose of the data mining project
  2. Obtain the dataset to be used in the analysis
  3. Explore, clean, and preprocess the data
  4. Reduce the data dimension, if necessary
  5. Determine the data mining task
  6. Partition the data (for supervised tasks)
  7. Choose the data mining technique(s)
  8. Use algorithms to perform the task
  9. Interpret the results of the algorithm
  10. Deploy the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SEMMA methodology

A
  • Includes the 10 data mining steps
    • Developed by the software company SAS

Sample, Explore, Modify, Model, Assess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

CRISP-DM

A
  • Similar methodology to SEMMA, by IBM SPSS Modeler
  • CRoss Industry Standard Process for Data Mining
  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Model Building
  5. Testing and Evaluation
  6. Deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CRISP-DM Steps

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly