Data Mining Flashcards

1
Q

What is Data Mining

A

The process of extracting information from large databases and using it to make decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name two methods of Data Mining

A

Predictive and Description

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the prediction method

A

use some variables to predict unknown or future values of other variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the description method

A

find human-interpretable patterns that describe the da

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 basic tasks of data mining

A

Classification, Regression, Clustering and Association rule discovery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Classification

A

maps data into predefined groups or classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Regression

A

maps a data item to a real valued prediction variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Clustering

A

maps data into groups or classes which are defined by the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Association rule discovery

A

uncover relationships among data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 5 stages of the Data Mining Process

A
Data Gathering,
Data Preparation and Cleansing
Pattern Extraction and Discovery
Visualisation of the data
Analysis and Evaluation of Results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name two types of learning

A

deductive and inductive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is deductive learning

A

uses existing knowledge to deduce new knowledge. It is from general rules to special cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is inductive learning

A

uses many examples to produce a generalisation of the examples that were given

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name the 3 types of inductive learning

A

Supervised, Unsupervised and Reinforcement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Supervised learning?

A

Training examples are input-output pairs with informative output

Classification learning is sometimes called supervised, because, in a sense, the
scheme operates under supervision by being provided with the actual outcome for
each of the training examples—the play or don’t play judgment, the lens recommendation, the type of iris, the acceptability of the labor contract.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Unsupervised learning?

A

Training examples are input patterns with no associated output patterns

17
Q

What is Reinforcement learning?

A

Training examples are input-output pairs with evaluative output only

18
Q

Name two types of data values

A

nominal and real

19
Q

What is included in data preparation

A

data selection, data transformation

20
Q

What is included in data cleansing

A
Check if/for:
free from errors
missing data
outliers
duplicates
21
Q

Name two examples used for finding patterns in data

A

Classification and Association rules discovery

22
Q

Give an example of were classification is used

A

Fraud Detection

predict fraudulent cases in credit card transactions

23
Q

Describe the Classification steps

A

Given a collection of records (training set) - each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.

24
Q

Describe the Association rules discovery steps

A

Given a set of records each of which contain some number of items from a given collection;

Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

Goal: previously unseen dependencies in a collection should be identified properly.

25
Give an example of where Association is used
Supermarket shelf management Goal: Identify items that are bought together by sufficiently many customers. Method: process all the transaction data collected with barcode scanners to find dependencies among items.
26
What is the minimum coverage
minimum number of instances
27
What is the OneR(one attribute rule) algorithm
find one attribute to use that makes fewest prediction errors generates rules that only include one attribute plus the class
28
Describe the OneR algorithm
For each attribute A: For each value V of that attribute, create a rule: 1. count how often each class appears 2. find the most frequent class, c 3. make a rule "if A=V then C=c" Calculate the error rate of this rule. Pick the attribute whose rules produce the lowest error rate
29
What is the Apriori algorithm
finds frequent itemsets using candidate generation. Finds associations between data items generates association rules that involve several attributes and does not focus on any particular attribute
30
Describe the Apriori algorithm
1. Set a minimum coverage 2. Find all one-attribute associations, which satisfy the minimum coverage; 3. Find all two-attribute associations, which satisfy the minimum coverage; 4. Until either reach a specified maximum number of attributes, Or can no longer generate associations that have the set minimum coverage 5. Set a minimum accuracy (confidence) 6. Generate rules from each association, which satisfy the minimum accuracy.
31
What is the ID3 Algorithm?
used to generate a decision tree from a dataset. Several rules can be generated and it only accepts nominal values
32
Describe the ID3 Algorithm
Classify examples by sorting them down the tree from the root node to some leaf notes Learned function represented by tree Each node in tree is tested on some attribute of an instance Branches represent values of attributes Follow tree from root to leaves for output value.
33
For ID3, how do you determine which attribute best classifies data?
Entropy
34
What is entropy?
Entropy is a measure of 'degree of doubt' The higher it is, the more doubt there is about the possible conclusions. The attribute which has the lowest entropy is the most useful determiner.
35
What are the advantages of ID3
It generates a detailed decision tree. With training data provided, it is always able to generate a tree. it is easily implemented The output is easily to be understood and interpreted The process is simple process Its running time increases only linearly with the complexity of the problem
36
What are the limitations of ID3
Wholly spurious correlations are possible, since the algorithm takes no account of any meaning that the data it works on may have. The algorithm considers just one attribute at a time. When inducing rules from large sets of examples in which there are a large number of possible outcomes, then the algorithm can be very sensitive to apparently trivial changes in the set of examples. The algorithm cannot generate uncertain rules or handle uncertain data
37
Give an example of an Association Rule
Transactions item set: {nappies, beer} | IF bought_nappies THEN bought_beer_likely
38
Give an example of a classification rule
``` IF buy_time in December and cost > 500 and type_of_item = electronics and location = overseas and ...etc... THEN possibly_fraudulent = yes ```
39
Name 3 examples of data mining applications
Retail/Marketing ID buying patterns associations among customer demographics Banking patterns of fraudulent use id loyal customers spending of customer groups Insurance Claims analysis Medicine successful therapies