Lecture 7 - Features Flashcards
What are the 4 stages of data pre-processing
- Data cleaning
- Data integration
- Data reduction
- Data transformation
What are features
features, also called attributes, are defined as mapping from the instance space to the feature domain.
What are the three main categories of feature statistics
- Statistics of central tendency
- Statistic of dispersion
- Shape statistics
What are the 3 main statistics of central tendency
- mean
- median
- mode
What are the 2 statistics of dispersion
- Variance omega^2
- Standart deviation omega
What are the statistics of dispersion
- range
- midrange point
- quantiles
- interguartile range
The ____ is more sensetive to outlier than the ____
median or mean
mean
median
what is skewness
Skewness is then defined as m/omega^3. A positive value of skewness means that the distribution is right-skewed, which means that the right tail is longer than the left tail. Negative skweness indicates the opposite.
What is Kurtosis
m/omega^4. People often use excess kurtosis m/omega^4 - 3. Positive excess kurtosis means that the distrubution is more sharply peaked than the normal distribution.
when can structured features be constructed
- prior to learning the model
- during learning the model
What is normalisation
From Quantitave to Quantitative
Adapt the scale of quantitative features.
What is calibration
From ortinal, categorical and boolean TO Quantitative
Adds a scale to features that don’t have one
What is discretisation
- from quantitative to ordinal
- from quantitative to categorical
what is ordering
- from ordinal to ordinal
- from categorical to ordinal
- from boolean to ordinal
What is unordering
from ordinal to categorical
what is grouping
from categorical to categorical
what is thresholding
- from quantitative to boolean
- from ordinal to boolean
what is binarisation
from categorical to boolean
Define thresholding. in words not table
Thresholding transforms a quantitave or an ordinal feature into a boolean feature by dinding a feature value to split on.
how do we set the threshold for thresholding?
- Supervised thresholding: hand picked for better performance
- unsupervised thresholding: use centeral tendency statistics like mean/median
Describe Discretisation
Discretisation transforms a quantitative feature into an ordinal feature, by creating bins where each bin is an interval
name and exaplain 2 types of discretisation
- supervised: bottom-up, work by progressively splitting bins
- unsupervised: equal bin width, equal width discretisation
Define normalisation
Feature normalisation neutralises the effect of different quantitative features being measured on different scales.
Give to formulas with which we can normalise data
- min-max
- z-scores
what is PCA
Principal component analysis is a feature-construbtion teqnique. It works by computing the principal components and using them to performs a change of basis on the data.
Can PCA be performed on quantitative features?
yes
What is the idea of pca
The idea of PCA is to find tehse correlcations and create a new feature that could be represented as a linear combination of the oringial features.
in PCA, the sum of squared distances of projected points from the origin are called ____
eigenvalues
What are principal components
principal components are new features constructed as a linear combination of original features
give 2 approaches to extract principal components
- Singular value decomposition
- eigendecomposition
How does singular value decomposition work
using matrixs rows for each feature.
What is imputation
Imputation is the process of filling in missing data
name 3 imputation techniques
- Mean imputation
- Regression imputation
- Expectation maximisation
what is mean imputation
calculate the per class mean/median/mode
what is regression imputation
a regression model is estimated to predict the observed vlaues of a variable based on other variables.
what is expectation maximisation
assuming a multivariate model over all features, use the observed values for maximum-likelyhood estimation of the model parameters, then derive expectations for the unobserved feature values and interate.