lecture 3 - feature engineering Flashcards

1
Q

types of feature engineering in the time domain

A
  1. numerical
  2. categorical
  3. mized

we can look at a combination of values in a window and base the prediction on that

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

numerical time domain

A
  • summarize values of a numerical attribute in a certain window
  • mean, max, min, std
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

small step sizes

A
  • the smaller the window size, the closer you’ll be to the original data
  • small window size = more data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

categorical time domain

A
  • generate patterns that combine categorical values over time
  • pattern types: succession and co-occurrence
  • we consider support to identify patterns that occur frequently enough to be considered significant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

support Θ

A

[sum of the number of times the pattern occurs inside window throughout N instances] / [N - λ]
–> by itself and in the historical window

  • we extend patterns with sufficient support ( > Θ) to more complex patterns of size k
  • k = number of rows considered as a pattern within a window
  • k grows over time, the bigger the more complex. bigger k also decreases support.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

frequency domain

A

considers periodic behavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

frequency domain: fourier transformation

A
  • any sequence of measurements can be represented by a combination of sinusoid functions
  • find which frequencies are available in the window (frequency decomposition)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

fourier transformation steps

A
  1. assume a base frequency
  2. compute a frequency per second for each value of k
  3. find the amplitudes a(k) associated with the k different frequencies - this way, when we multiply the amplitude with the value for the sinusoid functions, we end up with the original signal again.
  4. summation formula explains the original signal (x^i_t) as a sum of separate amplitudes
  5. get feature values (highest amplitude frequency, frequency weighted signal average, power spectrum entropy)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

base frequency

A
  • this is the lowest frequency with a complete sinusoid in it
  • we can look at k multiples of this base frequency
  • k * f_0 represents the k-th multiple of the base frequency
  • k is directly related to the number of full sinusoid periods within the given window
  • k runs from 0 to λ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

frequency per second

A
  • (k * Nsec)/ (λ + 1)
  • output in Hz
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how many frequencies do we need to cover the entire range of frequencies from the base frequency up to the highest multiple of it?

A

λ + 1 different frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

finding best values for a(k) for a given window of time points

A

best done with fast fourier transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

definition: highest amplitude frequency

A

feature that identifies the frequency that has the highest amplitude in the signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

definition: frequency weighted signal

A

feature that calculates a weighted average of the frequencies, where the weights are the amplitudes of the frequencies

  • frequencies with high amplitudes get more weight
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

definition: power spectrum entropy

A
  • quantifies the amount of signal in the data
  • calculates the intensity of each frequency component
  • checks whether one or a few discrete frequencies are standing out
  • high value = more complex signal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

unstructured data: preprocessing pipeline for text data

A
  1. tokenization
  2. lower case
  3. stemming
  4. stop word removal
17
Q

tokenization

A

identify sentences and words within sentences

18
Q

lower case

A

change the uppercase letters to lowercase

19
Q

stemming

A

identify the stem of the word to reduce words to their stem and map all different variations of e.g., verbs to a single term

20
Q

stop word removal

A

remove known stop words as they are not likely to be predictive

21
Q

features/approaches for text data

A
  1. bag of words
  2. TD-IDF (inverse document frequency)
  3. topic modeling
22
Q

bag of words

A
  • count occurrences of n-grams within text irrespective of order. this is the value of the attribute
  • does not account for uniqueness of words: some words occur more frequently than others. low frequency words are often more predictive than words that you see in every text.
23
Q

TF-IDF (inverse document frequency)

A
  • does account for uniqueness of words
  • gives more weight to unique words and avoids very frequent words to become too dominant
  1. TF = number of occurrences of a certain n-gram (= a number a^j_i)
  2. idf_j = log([total number of documents or instances in question] / [number of documents that contain the n-gram])
  3. tf_idf = TF * IDF
24
Q

IDF score

A
  • 0 = word occurs in all documents
  • high score = word is more rare/unique
25
Q

topic modeling

A

instead of looking at a lot of words, look at the topics the free text is about

  • we assume that the text contains k topics
  • for each topic(k), a word has a certain weight
  • a topic is defined as a combination of words that have weights assigned to them -> set of weights for each word in the corpus
26
Q

scoring topics

A

topic_k(i) = sum over m attributes(count of word m in in document i x weight of word m in topic k)

-> for all k topics, a weight is assigned to each and every word in the corpus

27
Q

overlap (problem, consequence, solution, implication)

A
  • if we have a large window size, the next data point will not be very different from the current one
  • the consequence is that the learning process slowed down and contains a lot of uninformative data points, as well as data leakage.
  • this is avoided by allowing for a certain percentage of overlap (typically 50%)
  • this means that we have less remaining instances than if we would have used 90% overlap
28
Q

fourier transformation: λ

A
  • λ = [x_{t-λ},…, x_t]
  • λ = 40, means 41 datapoints
29
Q

n-gram

A
  • n consecutive words
  • n represents the number of words we consider as a single unit/attribute
30
Q

fourier transform: feature values

A
  • every time a fourier transformation is done, you end up with amplitudes.
  • these amplitudes are aggregated into three values
  1. highest amplitude frequency
  2. frequency weighted signal average
  3. power spectrum entropy
31
Q

TF-IDF score

A

gives more weight to n-grams that are unique

32
Q

definition of ‘topic’

A
  • a topic is defined as a combination of words.
  • the way topics are represented is that for each topic, these words each have certain weights assigned to them.
33
Q

how to find topics

A
  • latent dirichlet allocation (LDA)
  • LDA assumes texts are generated with a poisson distribution over words, and a dirichlet distribution over topics
  • we assume words belong fully to a single topic initially
  • we then update the weights to maximize the probability of observing the texts
  • i.e., topics are found based on distributions of words and topics