preprocessing Flashcards

1
Q

What is the main goal of the Savitzky-Golay filter?

A

To smooth data while preserving features like peaks and edges better than simple moving averages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does the Savitzky-Golay filter compute smoothed values?

A

By fitting a low-degree polynomial to a subset of the data points (window) using least squares and evaluating the polynomial at the central point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Do the Savitzky-Golay filter coefficients depend on the data?

A

No, the coefficients depend only on the window size and the degree of the polynomial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does mutual information measure?

A

The amount of information shared between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is mutual information related to entropy?

A

I(X;Y)=H(X)+H(Y)βˆ’H(X,Y), where H represents entropy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the range of mutual information values?

A

Non-negative, I(X;Y)β‰₯0, and it is
0 if X and Y are independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is mutual information used in feature selection?

A

By ranking features based on their MI with the target variable to select the most informative ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the Chi-Square test measure?

A

The difference between observed and expected frequencies in categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Chi-Square statistic formula?

A

πœ’^2= βˆ‘(π‘‚π‘–βˆ’πΈπ‘–)^2/𝐸𝑖
where 𝑂𝑖 is the observed frequency and 𝐸𝑖 is the expected frequency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of the Chi-Square test in feature selection?

A

To assess the independence of a feature and the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the assumptions of the Chi-Square test?

A

Data must be categorical, and expected frequencies should generally be greater than 5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the correlation coefficient measure?

A

The strength and direction of the linear relationship between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the range of the correlation coefficient?

A

[βˆ’1,1], whereβˆ’1 indicates a perfect negative correlation,
1 indicates a perfect positive correlation, and 0 indicates no linear relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the formula for the Pearson correlation coefficient?

A

r = (SUM(xi - x)(yi - y))/(sqrt(SUM(xi - x)^2 * SUM(yi - y)^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the goal of PCA?

A

To reduce the dimensionality of data while retaining as much variance as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does PCA work?

A

By projecting data onto a new set of orthogonal axes (principal components) ordered by the amount of variance they capture.

17
Q

What are eigenvalues and eigenvectors in PCA?

A

Eigenvalues represent the amount of variance captured by each principal component, and eigenvectors define the directions of the principal components.

18
Q

How do you decide the number of components to keep in PCA?

A

Using techniques like the explained variance ratio or a scree plot.

19
Q

What is Chi-Merge used for?

A

To discretize continuous variables by merging adjacent intervals based on a Chi-Square test.

20
Q

How does Chi-Merge determine whether to merge intervals?

A

By calculating the Chi-Square statistic between adjacent intervals and merging them if the statistic is below a threshold.

21
Q

What are frequent patterns, and why are they important in data mining?

A

Frequent patterns are sets of items, subsequences, or substructures that appear in a dataset with a frequency higher than a specified threshold (support).
They form the foundation for discovering association rules, correlations, and sequential patterns.

22
Q

What are support and confidence, and how are they used in association rule mining?

A

Support: The proportion of transactions in which an itemset appears. It measures how frequently the itemset occurs in the dataset.
S(x) = transactions containing x / total transactions

Confidence: The likelihood that items in Y are purchased when
X is purchased. It measures the strength of an association rule X→Y.
Confidence(X→Y) = S(XUY)/S(X)

23
Q

Explain the key steps of the Apriori algorithm.

A

Candidate Generation: Start with single items (1-itemsets) and generate larger k-itemsets iteratively.
Prune Infrequent Itemsets: Eliminate candidates that have subsets not meeting the minimum support.
Count Support: Count occurrences of each itemset in the dataset.
Repeat: Continue generating and pruning k+1-itemsets until no new frequent itemsets can be found.
Generate Rules: Use frequent itemsets to derive association rules and calculate their confidence.
Key Insight: Apriori uses the downward closure property (if an itemset is frequent, all its subsets are frequent) to reduce the search space.

Weaknesses:
Computationally expensive due to candidate generation and multiple database scans.
Inefficient for datasets with a large number of items or high-dimensional data.
Memory-intensive as it stores many intermediate candidates.

24
Q

What is FP-Growth, and how does it address the inefficiencies of the Apriori algorithm?

A

The FP-Growth (Frequent Pattern Growth) algorithm avoids candidate generation by using a compact FP-Tree (Frequent Pattern Tree) structure.
Steps:
Build an FP-Tree:
Scan the database to count support and order items by descending frequency.
Insert transactions into the tree, sharing paths when possible.
Mine the FP-Tree:
Extract frequent patterns recursively using conditional pattern bases and conditional FP-Trees.

Benefits:
Reduces the number of database scans (only two passes).
Handles dense datasets more efficiently.
Memory-efficient due to compact tree structure.

25
Q

Describe how an FP-Tree is constructed and mined for frequent patterns.

A

Constructing the FP-Tree:
Scan the dataset to calculate support for each item.
Remove infrequent items and sort remaining items in descending frequency.
Insert transactions into the tree:
Shared prefixes of transactions are represented as branches.
Each node contains:
An item name.
A count (frequency of the item).
Links to its children and a header table for connections.
Mining the FP-Tree:
Start with the least frequent item (bottom of the tree).
Extract its conditional pattern base (prefix paths leading to that item).
Build a conditional FP-Tree from this base.
Repeat recursively for each conditional FP-Tree.

26
Q
A
27
Q
A