Week 1: Introduction to Data Mining Flashcards
Data Mining
The goal of data mining is to extract information from data. It can be automated, it can extract meaningful and useful patterns from the data, and can be used to make predictions. It doesn’t require generating new data to be done properly.
Pattern
It’s a series of data that repeats in a recognisable way, allowing for making non-trivial predictions for new data.
Datasets
These can be described as 2-D tables. Consists of attributes and instances. csv is a common file type for datasets. Instances are assumed to be independent, but attributes might be related.
In the real world, datasets can be massive, noisy, have missing values, biased, or insufficient for analysis.
Attribute
Also called feature and column. These represent characteristics of instances.
Instance
Also called example and row. These are sets of values, one for each attribute.
Attribute Types
Numeric, Nominal, Ordinal, Cardinal, Interval, Dichotomous.
Data Preparation
This involves assembly integration, cleaning, and transformation of the data for analysis.
Feature Engineering
The process of transforming raw data by selecting the most suitable attributes for data mining problems to be solved.
Classification Rule
This involves using an if-then rule to have the antecedent as a condition and a consequent as the resulting action or categorisation.
Classification
This is a type of data mining task where instances are assigned class labels. It’s a supervised learning task, as labelled training data is required to create classification rules and methods.
Regression
It’s a type of classification without discrete classes. Regression is considered a supervised method, as the actual values of the training data need to be known to make predicted values for the test or validation data.
Clustering
This is an unsupervised learning method that assigns data into clusters based on how similar they are. Often distance metrics are used to assign data points into clusters.
Association Rule
This is a method that doesn’t involve specific classes or labels. It involves looking at combinations of antecedents and consequents to see if the antecedent is a strong predictor for the consequent to appear.
Process of Data Mining
- Objective Specification: define the data mining problem type (supervised, unsupervised).
- Data Exploration: visualise the data and confirm that the objective can be achieved with the dataset.
- Data Cleaning: fix any problems with the data and confirm there’s enough data for the analysis.
- Model Building: select and construct an appropriate model for the data. The data types of the attributes must be considered.
- Model Evaluation: calculate and measure the performance of the model in its accuracy and ability to generalise to new data. Overfitting must be avoided.
- Repeat: data mining is iterative and some steps or sequences of steps need to be repeated.
Decision Table
This table shows which actions to perform based on the given conditions. It has a set of attributes and a decision label for each unique set of attribute values. These tables must be exhaustive.