Discrete&Continuous Data Flashcards
Types of Attribute
Continuous
Ordinal
Nominal
Instances aren’t labelled
Unsupervised ML
Not enough instances are labelled
Semi-supervised ML
instances are all labelled
Supervised ML
instances are ordered
sequence learning
nominal learners
NB
1-R
DT
continuous learners
KNN
NP
SVM
Nominal Attributes, but Numeric Learner
(1) For k-NN and NP: Hamming distance
(2) randomly assign numbers to attribute values
• If scale is constant between attributes, this is not as bad an idea as it sounds! (But still undesirable)
• Worse with higher-arity attributes (more attribute values)
• Imposes an attribute ordering which may not exist
(3) one–hot encoding
Hamming distances
the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other
one–hot encoding
If nominal attribute takes m values, replace it with m
Boolean attributes
Example:
hot = [1, 0, 0]
mild = [0, 1, 0]
cool = [0, 0, 1]
Pros & Cons of one-hot encoding
Pro: solve the nominal attribute in continuous learner issue
Con: massively increase the feature space
Numeric Attributes, but Nominal Learner
(1) NB
(2) DT
(3) 1-R
Discretization
Types of Naive Bayes
• Multivariate NB: attributes are nominal, and can take any
(fixed) number of values
• Binomal (or Bernoulli) NB: attributes are binary (special
case of MV)
• Multinomial NB: attributes are natural numbers,
corresponding to frequencies
• Gaussian NB: attributes are real numbers, use Probability Density Function
numeric attributes for DT
(1) Binarisation
(2) Range
Binarisation
Each node is labelled with ak , and has two branches: one
branch is ak ≤ m, one branch is ak > m.
Info Gain/Gain Ratio must be calculated for each non-trivial “split point” for each attribute