preprocessing Flashcards
What is the main goal of the Savitzky-Golay filter?
To smooth data while preserving features like peaks and edges better than simple moving averages.
How does the Savitzky-Golay filter compute smoothed values?
By fitting a low-degree polynomial to a subset of the data points (window) using least squares and evaluating the polynomial at the central point.
Do the Savitzky-Golay filter coefficients depend on the data?
No, the coefficients depend only on the window size and the degree of the polynomial.
What does mutual information measure?
The amount of information shared between two variables.
How is mutual information related to entropy?
I(X;Y)=H(X)+H(Y)βH(X,Y), where H represents entropy.
What is the range of mutual information values?
Non-negative, I(X;Y)β₯0, and it is
0 if X and Y are independent.
How is mutual information used in feature selection?
By ranking features based on their MI with the target variable to select the most informative ones.
What does the Chi-Square test measure?
The difference between observed and expected frequencies in categorical data.
What is the Chi-Square statistic formula?
π^2= β(ππβπΈπ)^2/πΈπ
where ππ is the observed frequency and πΈπ is the expected frequency.
What is the purpose of the Chi-Square test in feature selection?
To assess the independence of a feature and the target variable.
What are the assumptions of the Chi-Square test?
Data must be categorical, and expected frequencies should generally be greater than 5.
What does the correlation coefficient measure?
The strength and direction of the linear relationship between two variables.
What is the range of the correlation coefficient?
[β1,1], whereβ1 indicates a perfect negative correlation,
1 indicates a perfect positive correlation, and 0 indicates no linear relationship.
What is the formula for the Pearson correlation coefficient?
r = (SUM(xi - x)(yi - y))/(sqrt(SUM(xi - x)^2 * SUM(yi - y)^2)
What is the goal of PCA?
To reduce the dimensionality of data while retaining as much variance as possible.
How does PCA work?
By projecting data onto a new set of orthogonal axes (principal components) ordered by the amount of variance they capture.
What are eigenvalues and eigenvectors in PCA?
Eigenvalues represent the amount of variance captured by each principal component, and eigenvectors define the directions of the principal components.
How do you decide the number of components to keep in PCA?
Using techniques like the explained variance ratio or a scree plot.
What is Chi-Merge used for?
To discretize continuous variables by merging adjacent intervals based on a Chi-Square test.
How does Chi-Merge determine whether to merge intervals?
By calculating the Chi-Square statistic between adjacent intervals and merging them if the statistic is below a threshold.
What are frequent patterns, and why are they important in data mining?
Frequent patterns are sets of items, subsequences, or substructures that appear in a dataset with a frequency higher than a specified threshold (support).
They form the foundation for discovering association rules, correlations, and sequential patterns.
What are support and confidence, and how are they used in association rule mining?
Support: The proportion of transactions in which an itemset appears. It measures how frequently the itemset occurs in the dataset.
S(x) = transactions containing x / total transactions
Confidence: The likelihood that items in Y are purchased when
X is purchased. It measures the strength of an association rule XβY.
Confidence(XβY) = S(XUY)/S(X)
Explain the key steps of the Apriori algorithm.
Candidate Generation: Start with single items (1-itemsets) and generate larger k-itemsets iteratively.
Prune Infrequent Itemsets: Eliminate candidates that have subsets not meeting the minimum support.
Count Support: Count occurrences of each itemset in the dataset.
Repeat: Continue generating and pruning k+1-itemsets until no new frequent itemsets can be found.
Generate Rules: Use frequent itemsets to derive association rules and calculate their confidence.
Key Insight: Apriori uses the downward closure property (if an itemset is frequent, all its subsets are frequent) to reduce the search space.
Weaknesses:
Computationally expensive due to candidate generation and multiple database scans.
Inefficient for datasets with a large number of items or high-dimensional data.
Memory-intensive as it stores many intermediate candidates.
What is FP-Growth, and how does it address the inefficiencies of the Apriori algorithm?
The FP-Growth (Frequent Pattern Growth) algorithm avoids candidate generation by using a compact FP-Tree (Frequent Pattern Tree) structure.
Steps:
Build an FP-Tree:
Scan the database to count support and order items by descending frequency.
Insert transactions into the tree, sharing paths when possible.
Mine the FP-Tree:
Extract frequent patterns recursively using conditional pattern bases and conditional FP-Trees.
Benefits:
Reduces the number of database scans (only two passes).
Handles dense datasets more efficiently.
Memory-efficient due to compact tree structure.
Describe how an FP-Tree is constructed and mined for frequent patterns.
Constructing the FP-Tree:
Scan the dataset to calculate support for each item.
Remove infrequent items and sort remaining items in descending frequency.
Insert transactions into the tree:
Shared prefixes of transactions are represented as branches.
Each node contains:
An item name.
A count (frequency of the item).
Links to its children and a header table for connections.
Mining the FP-Tree:
Start with the least frequent item (bottom of the tree).
Extract its conditional pattern base (prefix paths leading to that item).
Build a conditional FP-Tree from this base.
Repeat recursively for each conditional FP-Tree.