lecture 3 - feature engineering Flashcards
types of feature engineering in the time domain
- numerical
- categorical
- mized
we can look at a combination of values in a window and base the prediction on that
numerical time domain
- summarize values of a numerical attribute in a certain window
- mean, max, min, std
small step sizes
- the smaller the window size, the closer you’ll be to the original data
- small window size = more data
categorical time domain
- generate patterns that combine categorical values over time
- pattern types: succession and co-occurrence
- we consider support to identify patterns that occur frequently enough to be considered significant
support Θ
[sum of the number of times the pattern occurs inside window throughout N instances] / [N - λ]
–> by itself and in the historical window
- we extend patterns with sufficient support ( > Θ) to more complex patterns of size k
- k = number of rows considered as a pattern within a window
- k grows over time, the bigger the more complex. bigger k also decreases support.
frequency domain
considers periodic behavior
frequency domain: fourier transformation
- any sequence of measurements can be represented by a combination of sinusoid functions
- find which frequencies are available in the window (frequency decomposition)
fourier transformation steps
- assume a base frequency
- compute a frequency per second for each value of k
- find the amplitudes a(k) associated with the k different frequencies - this way, when we multiply the amplitude with the value for the sinusoid functions, we end up with the original signal again.
- summation formula explains the original signal (x^i_t) as a sum of separate amplitudes
- get feature values (highest amplitude frequency, frequency weighted signal average, power spectrum entropy)
base frequency
- this is the lowest frequency with a complete sinusoid in it
- we can look at k multiples of this base frequency
- k * f_0 represents the k-th multiple of the base frequency
- k is directly related to the number of full sinusoid periods within the given window
- k runs from 0 to λ
frequency per second
- (k * Nsec)/ (λ + 1)
- output in Hz
how many frequencies do we need to cover the entire range of frequencies from the base frequency up to the highest multiple of it?
λ + 1 different frequencies
finding best values for a(k) for a given window of time points
best done with fast fourier transform
definition: highest amplitude frequency
feature that identifies the frequency that has the highest amplitude in the signal
definition: frequency weighted signal
feature that calculates a weighted average of the frequencies, where the weights are the amplitudes of the frequencies
- frequencies with high amplitudes get more weight
definition: power spectrum entropy
- quantifies the amount of signal in the data
- calculates the intensity of each frequency component
- checks whether one or a few discrete frequencies are standing out
- high value = more complex signal
unstructured data: preprocessing pipeline for text data
- tokenization
- lower case
- stemming
- stop word removal
tokenization
identify sentences and words within sentences
lower case
change the uppercase letters to lowercase
stemming
identify the stem of the word to reduce words to their stem and map all different variations of e.g., verbs to a single term
stop word removal
remove known stop words as they are not likely to be predictive
features/approaches for text data
- bag of words
- TD-IDF (inverse document frequency)
- topic modeling
bag of words
- count occurrences of n-grams within text irrespective of order. this is the value of the attribute
- does not account for uniqueness of words: some words occur more frequently than others. low frequency words are often more predictive than words that you see in every text.
TF-IDF (inverse document frequency)
- does account for uniqueness of words
- gives more weight to unique words and avoids very frequent words to become too dominant
- TF = number of occurrences of a certain n-gram (= a number a^j_i)
- idf_j = log([total number of documents or instances in question] / [number of documents that contain the n-gram])
- tf_idf = TF * IDF
IDF score
- 0 = word occurs in all documents
- high score = word is more rare/unique