Week 1: Introduction to Data Mining Flashcards

1
Q

Data Mining

A

The goal of data mining is to extract information from data. It can be automated, it can extract meaningful and useful patterns from the data, and can be used to make predictions. It doesn’t require generating new data to be done properly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Pattern

A

It’s a series of data that repeats in a recognisable way, allowing for making non-trivial predictions for new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Datasets

A

These can be described as 2-D tables. Consists of attributes and instances. csv is a common file type for datasets. Instances are assumed to be independent, but attributes might be related.

In the real world, datasets can be massive, noisy, have missing values, biased, or insufficient for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Attribute

A

Also called feature and column. These represent characteristics of instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Instance

A

Also called example and row. These are sets of values, one for each attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Attribute Types

A

Numeric, Nominal, Ordinal, Cardinal, Interval, Dichotomous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Preparation

A

This involves assembly integration, cleaning, and transformation of the data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Feature Engineering

A

The process of transforming raw data by selecting the most suitable attributes for data mining problems to be solved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Classification Rule

A

This involves using an if-then rule to have the antecedent as a condition and a consequent as the resulting action or categorisation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Classification

A

This is a type of data mining task where instances are assigned class labels. It’s a supervised learning task, as labelled training data is required to create classification rules and methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regression

A

It’s a type of classification without discrete classes. Regression is considered a supervised method, as the actual values of the training data need to be known to make predicted values for the test or validation data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Clustering

A

This is an unsupervised learning method that assigns data into clusters based on how similar they are. Often distance metrics are used to assign data points into clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Association Rule

A

This is a method that doesn’t involve specific classes or labels. It involves looking at combinations of antecedents and consequents to see if the antecedent is a strong predictor for the consequent to appear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Process of Data Mining

A
  1. Objective Specification: define the data mining problem type (supervised, unsupervised).
  2. Data Exploration: visualise the data and confirm that the objective can be achieved with the dataset.
  3. Data Cleaning: fix any problems with the data and confirm there’s enough data for the analysis.
  4. Model Building: select and construct an appropriate model for the data. The data types of the attributes must be considered.
  5. Model Evaluation: calculate and measure the performance of the model in its accuracy and ability to generalise to new data. Overfitting must be avoided.
  6. Repeat: data mining is iterative and some steps or sequences of steps need to be repeated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Decision Table

A

This table shows which actions to perform based on the given conditions. It has a set of attributes and a decision label for each unique set of attribute values. These tables must be exhaustive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Decision Tree

A

The nodes are specific decisions, usually binary, and the branches represent the possible alternatives. The very top node is the root node. The terminal nodes are leaves.

17
Q

Missing Value Problem

A

For decision trees, it’s unclear what to do if the attribute value is missing. Possible solutions include ignoring all instances with missing values, setting the most popular choice to fill in the missing values, and making probabilistic choices for each missing value based on the other instances.

18
Q

Function Tree

A

Each node has a function of multiple attribute values.

19
Q

Regression Tree

A

These trees predict numerical values, with each node having branches on values of attributes or values of functions of attributes. The leaves are predicted values for the corresponding instance.

20
Q

Model Trees

A

A regression equation predicts numeric output values in each leaf. It’s more complex than linear regression and regression trees.

21
Q

Learning Rule

A

By adding new rules and refining existing rules while more instances are added to the training set, conjunctive clauses may be added to an existing antecedent for a classification or association rule.

22
Q

Linear Model

A

They compute weighted sums of attribute values, with all attribute values being numeric.

23
Q

Instance-Based Representations

A

Instances are memorised, with new data inputs measuring distances to instances. The instance with the lowest distance is then returned as the output. Good for tasks that involve matching inputs with an existing database. A model doesn’t need to be computed.

24
Q

Hamming Distance

A

Used for measuring instances with nominal attributes. For each attribute, the distance is 0 if there’s a match and 1 if there isn’t a match.