preprocessing Flashcards by Giacomo Moro

What is the main goal of the Savitzky-Golay filter?

To smooth data while preserving features like peaks and edges better than simple moving averages.

How well did you know this?

Not at all

Perfectly

How does the Savitzky-Golay filter compute smoothed values?

By fitting a low-degree polynomial to a subset of the data points (window) using least squares and evaluating the polynomial at the central point.

How well did you know this?

Not at all

Perfectly

Do the Savitzky-Golay filter coefficients depend on the data?

No, the coefficients depend only on the window size and the degree of the polynomial.

How well did you know this?

Not at all

Perfectly

What does mutual information measure?

The amount of information shared between two variables.

How well did you know this?

Not at all

Perfectly

How is mutual information related to entropy?

I(X;Y)=H(X)+H(Y)−H(X,Y), where H represents entropy.

How well did you know this?

Not at all

Perfectly

What is the range of mutual information values?

Non-negative, I(X;Y)≥0, and it is
0 if X and Y are independent.

How well did you know this?

Not at all

Perfectly

How is mutual information used in feature selection?

By ranking features based on their MI with the target variable to select the most informative ones.

How well did you know this?

Not at all

Perfectly

What does the Chi-Square test measure?

The difference between observed and expected frequencies in categorical data.

How well did you know this?

Not at all

Perfectly

What is the Chi-Square statistic formula?

𝜒^2= ∑(𝑂𝑖−𝐸𝑖)^2/𝐸𝑖
where 𝑂𝑖 is the observed frequency and 𝐸𝑖 is the expected frequency.

How well did you know this?

Not at all

Perfectly

What is the purpose of the Chi-Square test in feature selection?

To assess the independence of a feature and the target variable.

How well did you know this?

Not at all

Perfectly

What are the assumptions of the Chi-Square test?

Data must be categorical, and expected frequencies should generally be greater than 5.

How well did you know this?

Not at all

Perfectly

What does the correlation coefficient measure?

The strength and direction of the linear relationship between two variables.

How well did you know this?

Not at all

Perfectly

What is the range of the correlation coefficient?

[−1,1], where−1 indicates a perfect negative correlation,
1 indicates a perfect positive correlation, and 0 indicates no linear relationship.

How well did you know this?

Not at all

Perfectly

What is the formula for the Pearson correlation coefficient?

r = (SUM(xi - x)(yi - y))/(sqrt(SUM(xi - x)^2 * SUM(yi - y)^2)

How well did you know this?

Not at all

Perfectly

What is the goal of PCA?

To reduce the dimensionality of data while retaining as much variance as possible.

How well did you know this?

Not at all

Perfectly

How does PCA work?

Study These Flashcards

By projecting data onto a new set of orthogonal axes (principal components) ordered by the amount of variance they capture.

What are eigenvalues and eigenvectors in PCA?

Study These Flashcards

Eigenvalues represent the amount of variance captured by each principal component, and eigenvectors define the directions of the principal components.

How do you decide the number of components to keep in PCA?

Study These Flashcards

Using techniques like the explained variance ratio or a scree plot.

What is Chi-Merge used for?

Study These Flashcards

To discretize continuous variables by merging adjacent intervals based on a Chi-Square test.

How does Chi-Merge determine whether to merge intervals?

Study These Flashcards

By calculating the Chi-Square statistic between adjacent intervals and merging them if the statistic is below a threshold.

What are frequent patterns, and why are they important in data mining?

Study These Flashcards

Frequent patterns are sets of items, subsequences, or substructures that appear in a dataset with a frequency higher than a specified threshold (support).
They form the foundation for discovering association rules, correlations, and sequential patterns.

What are support and confidence, and how are they used in association rule mining?

Study These Flashcards

Support: The proportion of transactions in which an itemset appears. It measures how frequently the itemset occurs in the dataset.
S(x) = transactions containing x / total transactions

Confidence: The likelihood that items in Y are purchased when
X is purchased. It measures the strength of an association rule X→Y.
Confidence(X→Y) = S(XUY)/S(X)

Explain the key steps of the Apriori algorithm.

Study These Flashcards

Candidate Generation: Start with single items (1-itemsets) and generate larger k-itemsets iteratively.
Prune Infrequent Itemsets: Eliminate candidates that have subsets not meeting the minimum support.
Count Support: Count occurrences of each itemset in the dataset.
Repeat: Continue generating and pruning k+1-itemsets until no new frequent itemsets can be found.
Generate Rules: Use frequent itemsets to derive association rules and calculate their confidence.
Key Insight: Apriori uses the downward closure property (if an itemset is frequent, all its subsets are frequent) to reduce the search space.

Weaknesses:
Computationally expensive due to candidate generation and multiple database scans.
Inefficient for datasets with a large number of items or high-dimensional data.
Memory-intensive as it stores many intermediate candidates.

What is FP-Growth, and how does it address the inefficiencies of the Apriori algorithm?

Study These Flashcards

The FP-Growth (Frequent Pattern Growth) algorithm avoids candidate generation by using a compact FP-Tree (Frequent Pattern Tree) structure.
Steps:
Build an FP-Tree:
Scan the database to count support and order items by descending frequency.
Insert transactions into the tree, sharing paths when possible.
Mine the FP-Tree:
Extract frequent patterns recursively using conditional pattern bases and conditional FP-Trees.

Benefits:
Reduces the number of database scans (only two passes).
Handles dense datasets more efficiently.
Memory-efficient due to compact tree structure.

Describe how an FP-Tree is constructed and mined for frequent patterns.

Constructing the FP-Tree: Scan the dataset to calculate support for each item. Remove infrequent items and sort remaining items in descending frequency. Insert transactions into the tree: Shared prefixes of transactions are represented as branches. Each node contains: An item name. A count (frequency of the item). Links to its children and a header table for connections. Mining the FP-Tree: Start with the least frequent item (bottom of the tree). Extract its conditional pattern base (prefix paths leading to that item). Build a conditional FP-Tree from this base. Repeat recursively for each conditional FP-Tree.

preprocessing Flashcards

(27 cards)