Discrete&Continuous Data Flashcards
Types of Attribute
Continuous
Ordinal
Nominal
Instances aren’t labelled
Unsupervised ML
Not enough instances are labelled
Semi-supervised ML
instances are all labelled
Supervised ML
instances are ordered
sequence learning
nominal learners
NB
1-R
DT
continuous learners
KNN
NP
SVM
Nominal Attributes, but Numeric Learner
(1) For k-NN and NP: Hamming distance
(2) randomly assign numbers to attribute values
• If scale is constant between attributes, this is not as bad an idea as it sounds! (But still undesirable)
• Worse with higher-arity attributes (more attribute values)
• Imposes an attribute ordering which may not exist
(3) one–hot encoding
Hamming distances
the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other
one–hot encoding
If nominal attribute takes m values, replace it with m
Boolean attributes
Example:
hot = [1, 0, 0]
mild = [0, 1, 0]
cool = [0, 0, 1]
Pros & Cons of one-hot encoding
Pro: solve the nominal attribute in continuous learner issue
Con: massively increase the feature space
Numeric Attributes, but Nominal Learner
(1) NB
(2) DT
(3) 1-R
Discretization
Types of Naive Bayes
• Multivariate NB: attributes are nominal, and can take any
(fixed) number of values
• Binomal (or Bernoulli) NB: attributes are binary (special
case of MV)
• Multinomial NB: attributes are natural numbers,
corresponding to frequencies
• Gaussian NB: attributes are real numbers, use Probability Density Function
numeric attributes for DT
(1) Binarisation
(2) Range
Binarisation
Each node is labelled with ak , and has two branches: one
branch is ak ≤ m, one branch is ak > m.
Info Gain/Gain Ratio must be calculated for each non-trivial “split point” for each attribute
Con of Binarisation
leads to arbitrarily large trees
Discretisation
the translation of continuous attributes onto nominal attributes
Steps:
- decide on the interval (out-of-scope)
- map each continuous value onto a discrete value
Types:
- Unsupervised (does not know/use the class label)
- Supervised (know/use the class label)
Unsupervised Discretisation
(1) Naive
(2) Equal Size
(3) Equal Frequency
(4) K-Means Clustering
Naive Unsupervised Discretisation
treat each unique value as a discrete nominal value
Pros & Cons of Naive Unsupervised Discretisation
Advantages:
• simple to implement
Disadvantages:
• loss of generality
• no sense of ordering
• describes the training data, but nothing more (overfitting)
Equal Size Unsupervised Discretisation
Identify the upper and lower bounds and partition the overall space into n equal intervals = equal width
min = 64
max = 83
interval: 64-70, 71-75; 80-83;
Pros & Cons of Equal Size Unsupervised Discretisation
Advantages:
• simple
Disadvantages:
• badly effected by outliers
• arbitrary n
Equal Frequency Unsupervised Discretisation
sort the values, and identify breakpoints which produce n (roughly) equal-sized partitions = equal frequency
1st bin: 1st-4th instances
2nd bin: 5th-8th instances
Pros & Cons of Equal Frequency Unsupervised Discretisation
Advantages:
• simple
Disadvantages:
• arbitrary n