Week 1: Introduction to Data Mining Flashcards
Data Mining
The goal of data mining is to extract information from data. It can be automated, it can extract meaningful and useful patterns from the data, and can be used to make predictions. It doesn’t require generating new data to be done properly.
Pattern
It’s a series of data that repeats in a recognisable way, allowing for making non-trivial predictions for new data.
Datasets
These can be described as 2-D tables. Consists of attributes and instances. csv is a common file type for datasets. Instances are assumed to be independent, but attributes might be related.
In the real world, datasets can be massive, noisy, have missing values, biased, or insufficient for analysis.
Attribute
Also called feature and column. These represent characteristics of instances.
Instance
Also called example and row. These are sets of values, one for each attribute.
Attribute Types
Numeric, Nominal, Ordinal, Cardinal, Interval, Dichotomous.
Data Preparation
This involves assembly integration, cleaning, and transformation of the data for analysis.
Feature Engineering
The process of transforming raw data by selecting the most suitable attributes for data mining problems to be solved.
Classification Rule
This involves using an if-then rule to have the antecedent as a condition and a consequent as the resulting action or categorisation.
Classification
This is a type of data mining task where instances are assigned class labels. It’s a supervised learning task, as labelled training data is required to create classification rules and methods.
Regression
It’s a type of classification without discrete classes. Regression is considered a supervised method, as the actual values of the training data need to be known to make predicted values for the test or validation data.
Clustering
This is an unsupervised learning method that assigns data into clusters based on how similar they are. Often distance metrics are used to assign data points into clusters.
Association Rule
This is a method that doesn’t involve specific classes or labels. It involves looking at combinations of antecedents and consequents to see if the antecedent is a strong predictor for the consequent to appear.
Process of Data Mining
- Objective Specification: define the data mining problem type (supervised, unsupervised).
- Data Exploration: visualise the data and confirm that the objective can be achieved with the dataset.
- Data Cleaning: fix any problems with the data and confirm there’s enough data for the analysis.
- Model Building: select and construct an appropriate model for the data. The data types of the attributes must be considered.
- Model Evaluation: calculate and measure the performance of the model in its accuracy and ability to generalise to new data. Overfitting must be avoided.
- Repeat: data mining is iterative and some steps or sequences of steps need to be repeated.
Decision Table
This table shows which actions to perform based on the given conditions. It has a set of attributes and a decision label for each unique set of attribute values. These tables must be exhaustive.
Decision Tree
The nodes are specific decisions, usually binary, and the branches represent the possible alternatives. The very top node is the root node. The terminal nodes are leaves.
Missing Value Problem
For decision trees, it’s unclear what to do if the attribute value is missing. Possible solutions include ignoring all instances with missing values, setting the most popular choice to fill in the missing values, and making probabilistic choices for each missing value based on the other instances.
Function Tree
Each node has a function of multiple attribute values.
Regression Tree
These trees predict numerical values, with each node having branches on values of attributes or values of functions of attributes. The leaves are predicted values for the corresponding instance.
Model Trees
A regression equation predicts numeric output values in each leaf. It’s more complex than linear regression and regression trees.
Learning Rule
By adding new rules and refining existing rules while more instances are added to the training set, conjunctive clauses may be added to an existing antecedent for a classification or association rule.
Linear Model
They compute weighted sums of attribute values, with all attribute values being numeric.
Instance-Based Representations
Instances are memorised, with new data inputs measuring distances to instances. The instance with the lowest distance is then returned as the output. Good for tasks that involve matching inputs with an existing database. A model doesn’t need to be computed.
Hamming Distance
Used for measuring instances with nominal attributes. For each attribute, the distance is 0 if there’s a match and 1 if there isn’t a match.