- the smaller the window size, the closer you'll be to the original data - small window size = more data

[sum of the number of times the pattern occurs inside window throughout N instances] / [N - λ] --> by itself and in the historical window - we extend patterns with sufficient support ( > Θ) to more complex patterns of size k - k = number of rows considered as a pattern within a window - k grows over time, the bigger the more complex. bigger k also decreases support.

- this is the lowest frequency with a complete sinusoid in it - we can look at k multiples of this base frequency - k f_0 represents the k-th multiple of the base frequency - k is directly related to the number of full sinusoid periods within the given window - k runs from 0 to λ

- (k Nsec)/ (λ + 1) - output in Hz

lecture 3 - feature engineering Flashcards by Kiara Shivani

types of feature engineering in the time domain

numerical
categorical
mized

we can look at a combination of values in a window and base the prediction on that

How well did you know this?

Not at all

Perfectly

numerical time domain

summarize values of a numerical attribute in a certain window
mean, max, min, std

How well did you know this?

Not at all

Perfectly

small step sizes

the smaller the window size, the closer you’ll be to the original data
small window size = more data

How well did you know this?

Not at all

Perfectly

categorical time domain

generate patterns that combine categorical values over time
pattern types: succession and co-occurrence
we consider support to identify patterns that occur frequently enough to be considered significant

How well did you know this?

Not at all

Perfectly

support Θ

[sum of the number of times the pattern occurs inside window throughout N instances] / [N - λ]
–> by itself and in the historical window

we extend patterns with sufficient support ( > Θ) to more complex patterns of size k
k = number of rows considered as a pattern within a window
k grows over time, the bigger the more complex. bigger k also decreases support.

How well did you know this?

Not at all

Perfectly

frequency domain

considers periodic behavior

How well did you know this?

Not at all

Perfectly

frequency domain: fourier transformation

any sequence of measurements can be represented by a combination of sinusoid functions
find which frequencies are available in the window (frequency decomposition)

How well did you know this?

Not at all

Perfectly

fourier transformation steps

assume a base frequency
compute a frequency per second for each value of k
find the amplitudes a(k) associated with the k different frequencies - this way, when we multiply the amplitude with the value for the sinusoid functions, we end up with the original signal again.
summation formula explains the original signal (x^i_t) as a sum of separate amplitudes
get feature values (highest amplitude frequency, frequency weighted signal average, power spectrum entropy)

How well did you know this?

Not at all

Perfectly

base frequency

this is the lowest frequency with a complete sinusoid in it
we can look at k multiples of this base frequency
k * f_0 represents the k-th multiple of the base frequency
k is directly related to the number of full sinusoid periods within the given window
k runs from 0 to λ

How well did you know this?

Not at all

Perfectly

frequency per second

(k * Nsec)/ (λ + 1)
output in Hz

How well did you know this?

Not at all

Perfectly

how many frequencies do we need to cover the entire range of frequencies from the base frequency up to the highest multiple of it?

λ + 1 different frequencies

How well did you know this?

Not at all

Perfectly

finding best values for a(k) for a given window of time points

best done with fast fourier transform

How well did you know this?

Not at all

Perfectly

definition: highest amplitude frequency

feature that identifies the frequency that has the highest amplitude in the signal

How well did you know this?

Not at all

Perfectly

definition: frequency weighted signal

feature that calculates a weighted average of the frequencies, where the weights are the amplitudes of the frequencies

frequencies with high amplitudes get more weight

How well did you know this?

Not at all

Perfectly

definition: power spectrum entropy

quantifies the amount of signal in the data
calculates the intensity of each frequency component
checks whether one or a few discrete frequencies are standing out
high value = more complex signal

How well did you know this?

Not at all

Perfectly

unstructured data: preprocessing pipeline for text data

Study These Flashcards

tokenization
lower case
stemming
stop word removal

tokenization

Study These Flashcards

identify sentences and words within sentences

lower case

Study These Flashcards

change the uppercase letters to lowercase

stemming

Study These Flashcards

identify the stem of the word to reduce words to their stem and map all different variations of e.g., verbs to a single term

stop word removal

Study These Flashcards

remove known stop words as they are not likely to be predictive

features/approaches for text data

Study These Flashcards

bag of words
TD-IDF (inverse document frequency)
topic modeling

bag of words

Study These Flashcards

count occurrences of n-grams within text irrespective of order. this is the value of the attribute
does not account for uniqueness of words: some words occur more frequently than others. low frequency words are often more predictive than words that you see in every text.

TF-IDF (inverse document frequency)

Study These Flashcards

does account for uniqueness of words
gives more weight to unique words and avoids very frequent words to become too dominant

TF = number of occurrences of a certain n-gram (= a number a^j_i)
idf_j = log([total number of documents or instances in question] / [number of documents that contain the n-gram])
tf_idf = TF * IDF

IDF score

Study These Flashcards

0 = word occurs in all documents
high score = word is more rare/unique

topic modeling

instead of looking at a lot of words, look at the topics the free text is about - we assume that the text contains k topics - for each topic(k), a word has a certain weight - **a topic is defined as a combination of words that have weights assigned to them** -> set of weights for each word in the corpus

scoring topics

topic_k(i) = sum over m attributes(count of word m in in document i **x** weight of word m in topic k) -> for all k topics, a weight is assigned to each and every word in the corpus

overlap (problem, consequence, solution, implication)

- if we have a large window size, the next data point will not be very different from the current one - the consequence is that the learning process slowed down and contains a lot of uninformative data points, as well as data leakage. - this is avoided by allowing for a certain percentage of overlap (typically 50%) - this means that we have less remaining instances than if we would have used 90% overlap

fourier transformation: λ

- λ = [x_{t-λ},..., x_t] - λ = 40, means 41 datapoints

n-gram

- n consecutive words - n represents the number of words we consider as a single unit/attribute

fourier transform: feature values

- every time a fourier transformation is done, you end up with amplitudes. - these amplitudes are aggregated into three values 1. highest amplitude frequency 2. frequency weighted signal average 3. power spectrum entropy

TF-IDF score

gives more weight to n-grams that are unique

definition of 'topic'

- a topic is defined as a combination of words. - the way topics are represented is that for each topic, these words each have certain weights assigned to them.

how to find topics

- latent dirichlet allocation (LDA) - LDA assumes texts are generated with a poisson distribution over words, and a dirichlet distribution over topics - we assume words belong fully to a single topic initially - we then update the weights to maximize the probability of observing the texts - i.e., topics are found based on distributions of words and topics

lecture 3 - feature engineering Flashcards

(33 cards)