03_data and features Flashcards

1
Q

What is data?

A

-Info output by sensing devide or organ
-includes both useful and irrelevant or redundant info
-must be processed to be meaningful

  • info in digital form that can be transmitted or processed

-factual information (eg measurements or statistics)
-used as a basis for reasoning, discussion or calculation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the pipeline/process associated with data? (3 steps)

A

1) data acquisition
2) data storage (used to be a bottleneck)
3) data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is all existing data technically accessible for analysis?

A

No, most of it is privately owned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are two types of data?

A

1) structured data
2) unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is structured data?

A

preprocessed and formatted data that is easily queryable

eg quantitative data in a table

most data analysis techniques require data to be available in a structured form for easier processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is structured data represented?

A

Always in a database schema
(eg a table in 2 dimensions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is unstructured data?

A

Unprocessed and unformatted data is not easily queryable

eg qualitative data, textual data, image data, data stream, audio data, video data (with increasing data complexity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is quantitative data?

A

can be measured,
distances can be defined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are two kinds of quantitative data?

A

1) continuous data
2) discrete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is continuous data?

A

real-valued numbers;
potentially within a given range

eg
- temperatures
- a person’s height
- prices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is discrete data?

A

discrete numbers;
whole numbers or real numbers;
potentially within a given range

eg
- number of people in a room
- inventory counts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is qualitative (categorical) data?

A

cannot be measure,
distances not defined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are two types of qualitative data?

A

1) nominal data
2) ordinal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is nominal data?

A

Labels for different categories
without ordering

eg
- color of hair
- names of persons
- types of fruit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is ordinal data?

A

Labels for different categories
following an inherent ranking scheme

eg
- rank in a competition
- grades
- day of the week

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is feature engineering?

A

Turning unstructured data into structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why do we need feature engineering?

A

Before ML methods can be applied to unstructured data, we have to process those and extract useful features from them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are features?

A

features are quantitative and independent variables
based on which our ML model learns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the process of data analysis?

A

1) raw data (qualitative)
- feature engineering
2) features (quantitative), x
- input
3) ML model, f(x)
- output
4) target (supervised setup), y

20
Q

What does feature engineering do?

A

extract or create features that may provide a ML model with rich info on its task based on domain knowledge

can be applied to raw data, resulting in quantitative data that can be directly fed into the ML model (features)

21
Q

What is feature engineering for quantitative data?

A

create meaningful features through mathematical transformations

22
Q

What are examples for mathematical transformations for feature engineering with quantitative data?

A
  • arithmetic:
    eg for difference of two variables
  • aggregation of features:
    eg for aggregation of two business units to one overall result
  • geometric transformations:
    eg to identify common wind speed patterns, vector-calculations
23
Q

What is feature engineering for qualitative data?

A

since qualitative/categorical data cannot be fed into ML models directly, they have to be turned into quantitative data first

24
Q

What are two methods to feature engineer qualitative data?

A
  • label encoding
  • one-hot encoding
25
Q

What is label encoding?

A

type of feature engineering (qualitative data)

ordinal (ranked) data –> discrete quantitative data

ranking/order of the classes is conserved in a discrete numerical schema and a “distance” can be defined

26
Q

What is one-hot encoding?

A

type of feature engineering (qualitative data)

nominal (unranked) data –> binary coding of labels

for each possible class in a feature, a binary feature is introduced, only those features that match have a value of one

–> multi-class features are also possible

27
Q

Can one-hot encoding be used with a huge number of classes?

A

if too many classes present, use label encoding instead

–> curse of dimensionality

28
Q

How can feature engineering be done to image data?

A

all pixels are considered in a greyscale way with
0 being a black pixel
maximum value being a white pixel

or 8-bit encoding: 0 is black, 256 is white

29
Q

What is special for color-images with feature engineering?

A

RGB: three channels (red green blue)

each channel is a grayscale image in itself

30
Q

What are methods for feature engineering of color-images? (5)

A

1) concatenate all channels (RGB) and feed stack into the model
(only works for eg CNNs because they can deal with 2D data)

2) linearize channels and concatenate vectors
(spatial info is somewhat lost, works for models that expect linear input data eg MLPs, K-NNs etc)

3) build a histogram for each channel
(spatial info is fully lost)

4) Visual bag-of-words
(use clusters as features and count their frequencies)

5) histogram of oriented gradients (HOG)
(for each cell, create histogram of gradients as feature)

31
Q

Which of the methods for feature engineering of image data works best?

A

depends on task and data set

32
Q

What is the final data set nomenclature?

A

Features/attributes (x)

f(x) = y

Targets/Labels, (y) - Ground-Truth

33
Q

What type of data are weight and height?

A

continuous

34
Q

What type of data is “wings”, true or false?

A

binary

35
Q

What type of data is number of legs?

A

discrete

36
Q

What type of data is “cuteness”?

A

ordinal

37
Q

What type of data is “type = bird/cat/dog etc”?

A

categorical (multi-class)

38
Q

What are “bird/cat/dog” labels for the ground-truth?

A

Classes of label “type”

39
Q

What is data scaling?

A

to linearly transform your data in order to normalize them

40
Q

Why do we need to scale data?

A

1) many ML models are based on a notion of “distance” between samples;
improperly scaled data may jeopardize the learning capability of such models

2) some ML models intrinsically presume that data are distributed following a Gaussian fashion with similar variances along all features;
high variance along one feature leads to bias

41
Q

How to we scale data? (2)

A
  • normalize feature variance
    (to give similar weights to the different features)
  • normalize feature mean values
    (assumed by a number of ML models)
42
Q

What is the MinMax scaler?

A

scale every feature onto a range from 0 to 1 based on the minimum and maximum of the underlying distribution

xi’ = (xi - min(Xi)) / (max (Xi) - min (Xi))

43
Q

What is a disadvantage of the MinMax scaler?

A

is prone to outliers and
does not center the distribution in the origin

44
Q

What are different scalers?

A
  • MinMax Scaler
  • standard Scaler
  • robust scaler
45
Q

What is the Standard scaler?

A

scale every feature onto a range from -1 to 1 based on the mean and standard deviation of the underlying distribution

xi’ = (xi - mean(Xi)) / standarddeviation(Xi)

this scaling is centered onto the origin but still prone to outliers to some extent