Data Mining Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is Data Mining

A

The process of extracting information from large databases and using it to make decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name two methods of Data Mining

A

Predictive and Description

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the prediction method

A

use some variables to predict unknown or future values of other variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the description method

A

find human-interpretable patterns that describe the da

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 basic tasks of data mining

A

Classification, Regression, Clustering and Association rule discovery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Classification

A

maps data into predefined groups or classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Regression

A

maps a data item to a real valued prediction variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Clustering

A

maps data into groups or classes which are defined by the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Association rule discovery

A

uncover relationships among data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 5 stages of the Data Mining Process

A
Data Gathering,
Data Preparation and Cleansing
Pattern Extraction and Discovery
Visualisation of the data
Analysis and Evaluation of Results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name two types of learning

A

deductive and inductive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is deductive learning

A

uses existing knowledge to deduce new knowledge. It is from general rules to special cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is inductive learning

A

uses many examples to produce a generalisation of the examples that were given

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name the 3 types of inductive learning

A

Supervised, Unsupervised and Reinforcement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Supervised learning?

A

Training examples are input-output pairs with informative output

Classification learning is sometimes called supervised, because, in a sense, the
scheme operates under supervision by being provided with the actual outcome for
each of the training examples—the play or don’t play judgment, the lens recommendation, the type of iris, the acceptability of the labor contract.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Unsupervised learning?

A

Training examples are input patterns with no associated output patterns

17
Q

What is Reinforcement learning?

A

Training examples are input-output pairs with evaluative output only

18
Q

Name two types of data values

A

nominal and real

19
Q

What is included in data preparation

A

data selection, data transformation

20
Q

What is included in data cleansing

A
Check if/for:
free from errors
missing data
outliers
duplicates
21
Q

Name two examples used for finding patterns in data

A

Classification and Association rules discovery

22
Q

Give an example of were classification is used

A

Fraud Detection

predict fraudulent cases in credit card transactions

23
Q

Describe the Classification steps

A

Given a collection of records (training set) - each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.

24
Q

Describe the Association rules discovery steps

A

Given a set of records each of which contain some number of items from a given collection;

Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

Goal: previously unseen dependencies in a collection should be identified properly.

25
Q

Give an example of where Association is used

A

Supermarket shelf management

Goal: Identify items that are bought together by sufficiently many customers.

Method: process all the transaction data collected with barcode scanners to find dependencies among items.

26
Q

What is the minimum coverage

A

minimum number of instances

27
Q

What is the OneR(one attribute rule) algorithm

A

find one attribute to use that makes fewest prediction errors

generates rules that only include one attribute plus the class

28
Q

Describe the OneR algorithm

A

For each attribute A:
For each value V of that attribute, create a rule:
1. count how often each class appears
2. find the most frequent class, c
3. make a rule “if A=V then C=c”
Calculate the error rate of this rule. Pick the attribute whose rules produce the lowest error rate

29
Q

What is the Apriori algorithm

A

finds frequent itemsets using candidate generation. Finds associations between data items

generates association rules that involve several attributes and does not focus on any particular attribute

30
Q

Describe the Apriori algorithm

A
  1. Set a minimum coverage
  2. Find all one-attribute associations, which satisfy the minimum coverage;
  3. Find all two-attribute associations, which satisfy the minimum coverage;
  4. Until either reach a specified maximum number of attributes, Or can no longer generate associations that have the set minimum coverage
  5. Set a minimum accuracy (confidence)
  6. Generate rules from each association, which satisfy the minimum accuracy.
31
Q

What is the ID3 Algorithm?

A

used to generate a decision tree from a dataset. Several rules can be generated and it only accepts nominal values

32
Q

Describe the ID3 Algorithm

A

Classify examples by sorting them down the tree from the root node to some leaf notes

Learned function represented by tree

Each node in tree is tested on some attribute of an instance

Branches represent values of attributes

Follow tree from root to leaves for output value.

33
Q

For ID3, how do you determine which attribute best classifies data?

A

Entropy

34
Q

What is entropy?

A

Entropy is a measure of ‘degree of doubt’

The higher it is, the more doubt there is about the possible conclusions.

The attribute which has the lowest entropy is the most useful determiner.

35
Q

What are the advantages of ID3

A

It generates a detailed decision tree.

With training data provided, it is always able to generate a tree.
it is easily implemented

The output is easily to be understood and interpreted

The process is simple process

Its running time increases only linearly with the complexity of the problem

36
Q

What are the limitations of ID3

A

Wholly spurious correlations are possible, since the algorithm takes no account of any meaning that the data it works on may have.

The algorithm considers just one attribute at a time.

When inducing rules from large sets of examples in which there are a large number of possible outcomes, then the algorithm can be very sensitive to apparently trivial changes in the set of examples.

The algorithm cannot generate uncertain rules or handle uncertain data

37
Q

Give an example of an Association Rule

A

Transactions item set: {nappies, beer}

IF bought_nappies THEN bought_beer_likely

38
Q

Give an example of a classification rule

A
IF buy_time in December
and cost > 500
and type_of_item = electronics
and location = overseas
and ...etc...
THEN possibly_fraudulent = yes
39
Q

Name 3 examples of data mining applications

A

Retail/Marketing
ID buying patterns
associations among customer demographics

Banking
patterns of fraudulent use
id loyal customers
spending of customer groups

Insurance
Claims analysis

Medicine
successful therapies