Week 1: Introduction to Data Mining Flashcards

Question 1

Q

Data Mining

Answer

A

The goal of data mining is to extract information from data. It can be automated, it can extract meaningful and useful patterns from the data, and can be used to make predictions. It doesn’t require generating new data to be done properly.

Question 2

Q

Pattern

Answer

A

It’s a series of data that repeats in a recognisable way, allowing for making non-trivial predictions for new data.

Question 3

Q

Datasets

Answer

A

These can be described as 2-D tables. Consists of attributes and instances. csv is a common file type for datasets. Instances are assumed to be independent, but attributes might be related.

In the real world, datasets can be massive, noisy, have missing values, biased, or insufficient for analysis.

Question 4

Q

Attribute

Answer

A

Also called feature and column. These represent characteristics of instances.

Question 5

Q

Instance

Answer

A

Also called example and row. These are sets of values, one for each attribute.

Question 6

Q

Attribute Types

Answer

A

Numeric, Nominal, Ordinal, Cardinal, Interval, Dichotomous.

Question 7

Q

Data Preparation

Answer

A

This involves assembly integration, cleaning, and transformation of the data for analysis.

Question 8

Q

Feature Engineering

Answer

A

The process of transforming raw data by selecting the most suitable attributes for data mining problems to be solved.

Question 9

Q

Classification Rule

Answer

A

This involves using an if-then rule to have the antecedent as a condition and a consequent as the resulting action or categorisation.

Question 10

Q

Classification

Answer

A

This is a type of data mining task where instances are assigned class labels. It’s a supervised learning task, as labelled training data is required to create classification rules and methods.

Question 11

Q

Regression

Answer

A

It’s a type of classification without discrete classes. Regression is considered a supervised method, as the actual values of the training data need to be known to make predicted values for the test or validation data.

Question 12

Q

Clustering

Answer

A

This is an unsupervised learning method that assigns data into clusters based on how similar they are. Often distance metrics are used to assign data points into clusters.

Question 13

Q

Association Rule

Answer

A

This is a method that doesn’t involve specific classes or labels. It involves looking at combinations of antecedents and consequents to see if the antecedent is a strong predictor for the consequent to appear.

Question 14

Q

Process of Data Mining

Answer

A

Objective Specification: define the data mining problem type (supervised, unsupervised).
Data Exploration: visualise the data and confirm that the objective can be achieved with the dataset.
Data Cleaning: fix any problems with the data and confirm there’s enough data for the analysis.
Model Building: select and construct an appropriate model for the data. The data types of the attributes must be considered.
Model Evaluation: calculate and measure the performance of the model in its accuracy and ability to generalise to new data. Overfitting must be avoided.
Repeat: data mining is iterative and some steps or sequences of steps need to be repeated.

Question 15

Q

Decision Table

Answer

A

This table shows which actions to perform based on the given conditions. It has a set of attributes and a decision label for each unique set of attribute values. These tables must be exhaustive.

Question 16

Q

Decision Tree

Answer

Study These Flashcards

A

The nodes are specific decisions, usually binary, and the branches represent the possible alternatives. The very top node is the root node. The terminal nodes are leaves.

Question 17

Q

Missing Value Problem

Answer

Study These Flashcards

A

For decision trees, it’s unclear what to do if the attribute value is missing. Possible solutions include ignoring all instances with missing values, setting the most popular choice to fill in the missing values, and making probabilistic choices for each missing value based on the other instances.

Question 18

Q

Function Tree

Answer

Study These Flashcards

A

Each node has a function of multiple attribute values.

Question 19

Q

Regression Tree

Answer

Study These Flashcards

A

These trees predict numerical values, with each node having branches on values of attributes or values of functions of attributes. The leaves are predicted values for the corresponding instance.

Question 20

Q

Model Trees

Answer

Study These Flashcards

A

A regression equation predicts numeric output values in each leaf. It’s more complex than linear regression and regression trees.

Question 21

Q

Learning Rule

Answer

Study These Flashcards

A

By adding new rules and refining existing rules while more instances are added to the training set, conjunctive clauses may be added to an existing antecedent for a classification or association rule.

Question 22

Q

Linear Model

Answer

Study These Flashcards

A

They compute weighted sums of attribute values, with all attribute values being numeric.

Question 23

Q

Instance-Based Representations

Answer

Study These Flashcards

A

Instances are memorised, with new data inputs measuring distances to instances. The instance with the lowest distance is then returned as the output. Good for tasks that involve matching inputs with an existing database. A model doesn’t need to be computed.

Question 24

Q

Hamming Distance

Answer

Study These Flashcards

A

Used for measuring instances with nominal attributes. For each attribute, the distance is 0 if there’s a match and 1 if there isn’t a match.

Week 1: Introduction to Data Mining Flashcards

(24 cards)