-Info output by sensing devide or organ -includes both useful and irrelevant or redundant info -must be processed to be meaningful - info in digital form that can be transmitted or processed -factual information (eg measurements or statistics) -used as a basis for reasoning, discussion or calculation

03_data and features Flashcards by Annina Vietze

What is data?

-Info output by sensing devide or organ
-includes both useful and irrelevant or redundant info
-must be processed to be meaningful

info in digital form that can be transmitted or processed

-factual information (eg measurements or statistics)
-used as a basis for reasoning, discussion or calculation

How well did you know this?

Not at all

Perfectly

What is the pipeline/process associated with data? (3 steps)

1) data acquisition
2) data storage (used to be a bottleneck)
3) data analysis

How well did you know this?

Not at all

Perfectly

Is all existing data technically accessible for analysis?

No, most of it is privately owned

How well did you know this?

Not at all

Perfectly

What are two types of data?

1) structured data
2) unstructured data

How well did you know this?

Not at all

Perfectly

What is structured data?

preprocessed and formatted data that is easily queryable

eg quantitative data in a table

most data analysis techniques require data to be available in a structured form for easier processing

How well did you know this?

Not at all

Perfectly

How is structured data represented?

Always in a database schema
(eg a table in 2 dimensions)

How well did you know this?

Not at all

Perfectly

What is unstructured data?

Unprocessed and unformatted data is not easily queryable

eg qualitative data, textual data, image data, data stream, audio data, video data (with increasing data complexity)

How well did you know this?

Not at all

Perfectly

What is quantitative data?

can be measured,
distances can be defined

How well did you know this?

Not at all

Perfectly

What are two kinds of quantitative data?

1) continuous data
2) discrete data

How well did you know this?

Not at all

Perfectly

What is continuous data?

real-valued numbers;
potentially within a given range

eg
- temperatures
- a person’s height
- prices

How well did you know this?

Not at all

Perfectly

What is discrete data?

discrete numbers;
whole numbers or real numbers;
potentially within a given range

eg
- number of people in a room
- inventory counts

How well did you know this?

Not at all

Perfectly

What is qualitative (categorical) data?

cannot be measure,
distances not defined

How well did you know this?

Not at all

Perfectly

What are two types of qualitative data?

1) nominal data
2) ordinal data

How well did you know this?

Not at all

Perfectly

What is nominal data?

Labels for different categories
without ordering

eg
- color of hair
- names of persons
- types of fruit

How well did you know this?

Not at all

Perfectly

What is ordinal data?

Labels for different categories
following an inherent ranking scheme

eg
- rank in a competition
- grades
- day of the week

How well did you know this?

Not at all

Perfectly

What is feature engineering?

Turning unstructured data into structured data

How well did you know this?

Not at all

Perfectly

Why do we need feature engineering?

Before ML methods can be applied to unstructured data, we have to process those and extract useful features from them

How well did you know this?

Not at all

Perfectly

What are features?

features are quantitative and independent variables
based on which our ML model learns

How well did you know this?

Not at all

Perfectly

What is the process of data analysis?

Study These Flashcards

1) raw data (qualitative)
- feature engineering
2) features (quantitative), x
- input
3) ML model, f(x)
- output
4) target (supervised setup), y

What does feature engineering do?

Study These Flashcards

extract or create features that may provide a ML model with rich info on its task based on domain knowledge

can be applied to raw data, resulting in quantitative data that can be directly fed into the ML model (features)

What is feature engineering for quantitative data?

Study These Flashcards

create meaningful features through mathematical transformations

What are examples for mathematical transformations for feature engineering with quantitative data?

Study These Flashcards

arithmetic:
eg for difference of two variables
aggregation of features:
eg for aggregation of two business units to one overall result
geometric transformations:
eg to identify common wind speed patterns, vector-calculations

What is feature engineering for qualitative data?

Study These Flashcards

since qualitative/categorical data cannot be fed into ML models directly, they have to be turned into quantitative data first

What are two methods to feature engineer qualitative data?

Study These Flashcards

label encoding
one-hot encoding

What is label encoding?

type of feature engineering (qualitative data) ordinal (ranked) data --> discrete quantitative data ranking/order of the classes is conserved in a discrete numerical schema and a "distance" can be defined

What is one-hot encoding?

type of feature engineering (qualitative data) nominal (unranked) data --> binary coding of labels for each possible class in a feature, a binary feature is introduced, only those features that match have a value of one --> multi-class features are also possible

Can one-hot encoding be used with a huge number of classes?

if too many classes present, use label encoding instead --> curse of dimensionality

How can feature engineering be done to image data?

all pixels are considered in a greyscale way with 0 being a black pixel maximum value being a white pixel or 8-bit encoding: 0 is black, 256 is white

What is special for color-images with feature engineering?

RGB: three channels (red green blue) each channel is a grayscale image in itself

What are methods for feature engineering of color-images? (5)

1) concatenate all channels (RGB) and feed stack into the model (only works for eg CNNs because they can deal with 2D data) 2) linearize channels and concatenate vectors (spatial info is somewhat lost, works for models that expect linear input data eg MLPs, K-NNs etc) 3) build a histogram for each channel (spatial info is fully lost) 4) Visual bag-of-words (use clusters as features and count their frequencies) 5) histogram of oriented gradients (HOG) (for each cell, create histogram of gradients as feature)

Which of the methods for feature engineering of image data works best?

depends on task and data set

What is the final data set nomenclature?

Features/attributes (x) f(x) = y Targets/Labels, (y) - Ground-Truth

What type of data are weight and height?

continuous

What type of data is "wings", true or false?

binary

What type of data is number of legs?

discrete

What type of data is "cuteness"?

ordinal

What type of data is "type = bird/cat/dog etc"?

categorical (multi-class)

What are "bird/cat/dog" labels for the ground-truth?

Classes of label "type"

What is data scaling?

to linearly transform your data in order to normalize them

Why do we need to scale data?

1) many ML models are based on a notion of "distance" between samples; improperly scaled data may jeopardize the learning capability of such models 2) some ML models intrinsically presume that data are distributed following a Gaussian fashion with similar variances along all features; high variance along one feature leads to bias

How to we scale data? (2)

- normalize feature variance (to give similar weights to the different features) - normalize feature mean values (assumed by a number of ML models)

What is the MinMax scaler?

scale every feature onto a range from 0 to 1 based on the minimum and maximum of the underlying distribution xi' = (xi - min(Xi)) / (max (Xi) - min (Xi))

What is a disadvantage of the MinMax scaler?

is prone to outliers and does not center the distribution in the origin

What are different scalers?

- MinMax Scaler - standard Scaler - robust scaler

What is the Standard scaler?

scale every feature onto a range from -1 to 1 based on the mean and standard deviation of the underlying distribution xi' = (xi - mean(Xi)) / standarddeviation(Xi) this scaling is centered onto the origin but still prone to outliers to some extent

03_data and features Flashcards

(45 cards)