03_data and features Flashcards
What is data?
-Info output by sensing devide or organ
-includes both useful and irrelevant or redundant info
-must be processed to be meaningful
- info in digital form that can be transmitted or processed
-factual information (eg measurements or statistics)
-used as a basis for reasoning, discussion or calculation
What is the pipeline/process associated with data? (3 steps)
1) data acquisition
2) data storage (used to be a bottleneck)
3) data analysis
Is all existing data technically accessible for analysis?
No, most of it is privately owned
What are two types of data?
1) structured data
2) unstructured data
What is structured data?
preprocessed and formatted data that is easily queryable
eg quantitative data in a table
most data analysis techniques require data to be available in a structured form for easier processing
How is structured data represented?
Always in a database schema
(eg a table in 2 dimensions)
What is unstructured data?
Unprocessed and unformatted data is not easily queryable
eg qualitative data, textual data, image data, data stream, audio data, video data (with increasing data complexity)
What is quantitative data?
can be measured,
distances can be defined
What are two kinds of quantitative data?
1) continuous data
2) discrete data
What is continuous data?
real-valued numbers;
potentially within a given range
eg
- temperatures
- a person’s height
- prices
What is discrete data?
discrete numbers;
whole numbers or real numbers;
potentially within a given range
eg
- number of people in a room
- inventory counts
What is qualitative (categorical) data?
cannot be measure,
distances not defined
What are two types of qualitative data?
1) nominal data
2) ordinal data
What is nominal data?
Labels for different categories
without ordering
eg
- color of hair
- names of persons
- types of fruit
What is ordinal data?
Labels for different categories
following an inherent ranking scheme
eg
- rank in a competition
- grades
- day of the week
What is feature engineering?
Turning unstructured data into structured data
Why do we need feature engineering?
Before ML methods can be applied to unstructured data, we have to process those and extract useful features from them
What are features?
features are quantitative and independent variables
based on which our ML model learns
What is the process of data analysis?
1) raw data (qualitative)
- feature engineering
2) features (quantitative), x
- input
3) ML model, f(x)
- output
4) target (supervised setup), y
What does feature engineering do?
extract or create features that may provide a ML model with rich info on its task based on domain knowledge
can be applied to raw data, resulting in quantitative data that can be directly fed into the ML model (features)
What is feature engineering for quantitative data?
create meaningful features through mathematical transformations
What are examples for mathematical transformations for feature engineering with quantitative data?
- arithmetic:
eg for difference of two variables - aggregation of features:
eg for aggregation of two business units to one overall result - geometric transformations:
eg to identify common wind speed patterns, vector-calculations
What is feature engineering for qualitative data?
since qualitative/categorical data cannot be fed into ML models directly, they have to be turned into quantitative data first
What are two methods to feature engineer qualitative data?
- label encoding
- one-hot encoding
What is label encoding?
type of feature engineering (qualitative data)
ordinal (ranked) data –> discrete quantitative data
ranking/order of the classes is conserved in a discrete numerical schema and a “distance” can be defined
What is one-hot encoding?
type of feature engineering (qualitative data)
nominal (unranked) data –> binary coding of labels
for each possible class in a feature, a binary feature is introduced, only those features that match have a value of one
–> multi-class features are also possible
Can one-hot encoding be used with a huge number of classes?
if too many classes present, use label encoding instead
–> curse of dimensionality
How can feature engineering be done to image data?
all pixels are considered in a greyscale way with
0 being a black pixel
maximum value being a white pixel
or 8-bit encoding: 0 is black, 256 is white
What is special for color-images with feature engineering?
RGB: three channels (red green blue)
each channel is a grayscale image in itself
What are methods for feature engineering of color-images? (5)
1) concatenate all channels (RGB) and feed stack into the model
(only works for eg CNNs because they can deal with 2D data)
2) linearize channels and concatenate vectors
(spatial info is somewhat lost, works for models that expect linear input data eg MLPs, K-NNs etc)
3) build a histogram for each channel
(spatial info is fully lost)
4) Visual bag-of-words
(use clusters as features and count their frequencies)
5) histogram of oriented gradients (HOG)
(for each cell, create histogram of gradients as feature)
Which of the methods for feature engineering of image data works best?
depends on task and data set
What is the final data set nomenclature?
Features/attributes (x)
f(x) = y
Targets/Labels, (y) - Ground-Truth
What type of data are weight and height?
continuous
What type of data is “wings”, true or false?
binary
What type of data is number of legs?
discrete
What type of data is “cuteness”?
ordinal
What type of data is “type = bird/cat/dog etc”?
categorical (multi-class)
What are “bird/cat/dog” labels for the ground-truth?
Classes of label “type”
What is data scaling?
to linearly transform your data in order to normalize them
Why do we need to scale data?
1) many ML models are based on a notion of “distance” between samples;
improperly scaled data may jeopardize the learning capability of such models
2) some ML models intrinsically presume that data are distributed following a Gaussian fashion with similar variances along all features;
high variance along one feature leads to bias
How to we scale data? (2)
- normalize feature variance
(to give similar weights to the different features) - normalize feature mean values
(assumed by a number of ML models)
What is the MinMax scaler?
scale every feature onto a range from 0 to 1 based on the minimum and maximum of the underlying distribution
xi’ = (xi - min(Xi)) / (max (Xi) - min (Xi))
What is a disadvantage of the MinMax scaler?
is prone to outliers and
does not center the distribution in the origin
What are different scalers?
- MinMax Scaler
- standard Scaler
- robust scaler
What is the Standard scaler?
scale every feature onto a range from -1 to 1 based on the mean and standard deviation of the underlying distribution
xi’ = (xi - mean(Xi)) / standarddeviation(Xi)
this scaling is centered onto the origin but still prone to outliers to some extent