Intro And Association Tule Mining Flashcards

Question

How to deal with missing values?

Answer 1

First use what you know about the data –Why data is missing? –Distribution of missing data Decide on the best strategy to yield the least biased estimates –Do Not Impute (DNI) – Deletion Methods (list-wise deletion, pair-wise deletion) – Single Imputation Methods (mean/mode substitution, dummy variable, single regression) – Model-Based Methods (maximum Likelihood, multiple imputation)

Answer 2

• Do Not Impute (DNI) –Simply use the default policy of the data mining method –Works only if the policy exists –Some methods can work around missing data

Answer 3

• The handling of missing data depends on the type • Discarding all the examples with a missing values –Simplest approach –Allows the use of unmodified data mining methods –Only practical if there are few examples with missing values. – Otherwise, it can introduce bias.

Answer 4

• Only analyze cases with available data on each variable • Simple, but reduces the data • Comparability across analyses • Does not use all the information • Estimates may be biased if data not MCAR

Answer 5

• Delete cases with missing values that affect only the variables of interest • Example – When using only the first two variables, the missing values of the third variable are not considered • Advantage –Keeps as many cases as possible for each analysis –Uses all information possible with each analysis • Disadvantage –Comparison of results is more difficult because samples are different each time

Answer 6

• Convert the missing values into a new value –Use a special value for it –Add an attribute that indicates if value is missing or not –Greatly increases the difficulty of the data mining process • Imputation methods – Assign a value to the missing one, based on the rest of the dataset –Use the unmodified data mining methods

Answer 7

• Mean/mode substitution (most common value) –Replace missing value with sample mean or mode –Run analyses as if all complete cases – Advantages: Can use complete case analysis methods – Disadvantages: Reduces variability • Dummy variable control –Create an indicator for missing value (1=value is missing for observation; 0=value is observed for observation) –Impute missing values to a constant (such as the mean) –Include missing indicator in the algorithm –Advantage: uses all available information about missing observation – Disadvantage: results in biased estimates, not theoretically driven

Answer 8

• Extract a model from the dataset to perform the imputation – Suitable for MCAR and, to a lesser extent, for MAR –Not suitable for NMAR type of missing data • For NMAR we need to go back to the source of the data to obtain more information

Answer 9

• Data has not been collected for mining it • Errors and omissions that don’t affect original purpose of data (e.g. age of customer) • Typographical errors in nominal attributes, thus values need to be checked for consistency • Typographical and measurement errors in numeric attributes, thus outliers need to be identified • Errors may be deliberate (e.g. wrong zip codes)

Answer 10

They influence the type of statistical analyses and visualization we can perform Some algorithms and functions fit some specific data types best like Check for valid values, Deal with missing values, etc.

Answer 11

Transforming Categorical Data into Numerical Data Converting categorical data into numerical data is often necessary for machine learning algorithms, which typically require numerical input. Here are common methods for this transformation: 1. Label Encoding: • Definition: Converts each category into a unique integer. • Use Case: Useful for ordinal data where categories have a meaningful order (e.g., “Low”, “Medium”, “High”). • Example: from sklearn.preprocessing import LabelEncoder data = ['Low', 'Medium', 'High', 'Medium'] encoder = LabelEncoder() encoded_data = encoder.fit_transform(data) print(encoded_data) # Output: [1, 2, 0, 2] • Pros: Simple and efficient. • Cons: Can introduce ordinal relationships where none exist, leading to incorrect model assumptions. 2. One-Hot Encoding: • Definition: Converts each category into a new binary column (0 or 1), with “1” indicating the presence of that category. • Use Case: Ideal for nominal data where categories have no intrinsic order (e.g., “Red”, “Green”, “Blue”). • Example: import pandas as pd data = ['Red', 'Green', 'Blue'] df = pd.DataFrame(data, columns=['Color']) one_hot_encoded = pd.get_dummies(df, columns=['Color']) print(one_hot_encoded) Output: Color_Blue Color_Green Color_Red 0 0 0 1 1 0 1 0 2 1 0 0 • Pros: No ordinal relationship introduced; ideal for non-ordinal data. • Cons: Can lead to the “curse of dimensionality” if there are many categories (many new columns). 3. Binary Encoding: • Definition: Converts categories into binary digits and uses fewer columns than one-hot encoding. • Use Case: Useful when dealing with high-cardinality features (features with many categories). • Example: import category_encoders as ce data = ['Apple', 'Banana', 'Orange'] encoder = ce.BinaryEncoder() binary_encoded = encoder.fit_transform(data) print(binary_encoded) • Pros: Reduces dimensionality compared to one-hot encoding. • Cons: Still introduces additional columns. 4. Frequency Encoding: • Definition: Replaces each category with its frequency in the dataset. • Use Case: Useful for high-cardinality features where you want to retain some information about category distribution. • Example: data = ['A', 'B', 'A', 'C', 'B', 'A'] freq_encoding = pd.Series(data).value_counts().to_dict() encoded_data = [freq_encoding[val] for val in data] print(encoded_data) # Output: [3, 2, 3, 1, 2, 3] • Pros: Helps reduce dimensionality. • Cons: Can introduce bias if frequency distributions are uneven. 5. Target Encoding (Mean Encoding): • Definition: Replaces categories with the mean of the target variable for each category. • Use Case: Useful for categorical features in supervised learning where target leakage is not a concern. • Example: df = pd.DataFrame({'Category': ['A', 'B', 'A', 'C'], 'Target': [10, 20, 30, 40]}) means = df.groupby('Category')['Target'].mean() df['Category_encoded'] = df['Category'].map(means) print(df) Output: Category Target Category_encoded 0 A 10 20.0 1 B 20 20.0 2 A 30 20.0 3 C 40 40.0 • Pros: Retains useful information about categories’ relationship with the target. • Cons: Risk of overfitting, especially with small datasets. Transforming Numerical Data into Categorical Data Sometimes, you may want to convert numerical data into categorical data, such as when creating bins for continuous data or segmenting data for analysis. Here are common techniques: 1. Binning (Discretization): • Definition: Divides numerical data into bins (categories) based on value ranges. • Use Case: Useful for simplifying numerical data by converting it into discrete intervals. • Example: import pandas as pd data = [23, 45, 12, 67, 34, 89] bins = [0, 30, 60, 90] labels = ['Low', 'Medium', 'High'] categorized_data = pd.cut(data, bins=bins, labels=labels) print(categorized_data) # Output: ['Low', 'Medium', 'Low', 'Medium', 'Medium', 'High'] • Pros: Simplifies analysis by reducing data complexity. • Cons: Can lead to loss of information. 2. Quantile Binning: • Definition: Bins data into categories based on quantiles (e.g., quartiles, deciles). • Use Case: Useful for dividing data into equally sized groups, often used for ranking. • Example: data = [10, 20, 30, 40, 50] quantile_bins = pd.qcut(data, q=3, labels=['Low', 'Medium', 'High']) print(quantile_bins) # Output: ['Low', 'Low', 'Medium', 'Medium', 'High'] • Pros: Ensures bins have approximately the same number of observations. • Cons: Bin boundaries may not be intuitive. 3. Custom Binning: • Definition: Uses specific domain knowledge to create bins based on meaningful thresholds. • Use Case: Useful when there are well-known cutoff points in your data (e.g., age groups like “Child”, “Adult”, “Senior”). • Example: data = [10, 15, 25, 35, 50] custom_bins = [0, 18, 35, 60] custom_labels = ['Child', 'Adult', 'Senior'] custom_categorized = pd.cut(data, bins=custom_bins, labels=custom_labels) print(custom_categorized) • Pros: Tailored to your specific data context. • Cons: Requires domain knowledge and can be subjective. Summary Table Transformation Technique Use Case Pros Cons Categorical to Numerical Label Encoding Ordinal data Simple and efficient May introduce artificial order One-Hot Encoding Nominal data Avoids ordinal assumptions High dimensionality Binary Encoding High-cardinality features Reduces dimensionality More complex to interpret Frequency Encoding High-cardinality features Retains frequency information Can introduce bias Target Encoding Supervised learning Captures relationship with target Risk of overfitting Numerical to Categorical Binning Simplify continuous data Reduces complexity Loss of information Quantile Binning Equal-sized category groups Balanced bins Non-intuitive boundaries Custom Binning Domain-specific thresholds Tailored categorization Requires domain knowledge The choice of method depends on the type of data and the specific problem you are trying to solve.

Answer 12

• LabelEncoder –Encodes target labels with values between 0 and n_labels-1 • OneHotEncoder –Performs a one-hot encoding of categorical features. • OrdinalEncoder –Performs an ordinal (integer) encoding of the categorical features

Answer 13

Label Encoder Map a categorical variable described by n values into a numerical variables with values from 0 to n-1 For example, attribute Outlook would be replaced by a numerical variables with values 0, 1, and 2 Warning By replacing a label with a number might influence the process in unexpected ways In the example, by assigning 0 to overcast and 2 to sunny we give a higher weight to the latter What happens if we then apply a regression model? Would the result change with different assigned values? If we apply label encoding, we should store the mapping used for each attribute to be able to map the encoded data into the original ones One Hot Encoding Map each categorical attribute with n values into n binary 0/1 variables Each one describing one specific attribute values For example, attribute Outlook is replaced by three binary variables Sunny, Overcast, and Rainy Warning One hot encoding assigns the same numerical value (1) to all the labels But it can generate a massive amount of variables when applied to categorical variables with many values

Answer 14

Because the algorithm we are using may not work with continuous ones

Answer 15

Apply deep learning to map categorical variables in into Euclidean spaces Similar values are mapped close to each other in the embedding space thus revealing the intrinsic properties of the categorical variables

Answer 16

Preliminary exploration of the data aimed at identifying their most relevant characteristics What the key motivations? –Help to select the right tool for preprocessing and data mining – Exploit humans’ abilities to recognize patterns not captured by automatic tools

Answer 17

“An approach of analyzing data to summarize their main characteristics without using a statistical model or having formulated a prior hypothesis.” “Exploratory data analysis was promoted by JohnTukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.”

Answer 18

Exploratory Data Analysis (as originally defined by Tukey), was mainly focused on: – Visualization –Clustering and anomaly detection (viewed as exploratory techniques) – Note that, in data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploration

Answer 19

What are they? –Numbers that summarize properties of the data Summarized properties include – Location, mean, spread, skewness, standard deviation, mode, percentiles, etc.

Answer 20

The frequency of an attribute value –The percentage of time the value occurs in the data set – For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. • The mode of an attribute is the most frequent attribute value • The notions of frequency and mode are typically used with categorical data

Answer 21

The mean is the most common measure of the location of a set of points However, the mean is very sensitive to outliers. Thus, the median or a trimmed mean is also commonly used

Answer 22

For continuous data, the notion of a percentile is very useful p-th percentile –Given an ordinal or continuous attribute x and a number p –p-th percentile is a value xp of x such that p% of the observed values of x are less than xp For instance, the 50th percentile is the value x50% such that 50% of all values of x are less than x50%

Answer 23

Trimean – It is the weighted mean of the first, second and third quartile Truncated Mean – Discards data above and below a certain percentile – For example, below the 5th percentile and above the 95th percentile

Answer 24

The variance is the most common measure of the spread of a set of points

Answer 25

Given two attributes, measure how strongly one attribute implies the other, based on the available data Use correlation measures to estimate how predictive one attribute is of another

Answer 26

Numerical Variables – For two numerical variables, we can compute Pearson’s product moment coefficient Ordinal Variables –We can compute Spearman’s rank correlation coefficient Categorical Variables – We can compute χ2 statistic test which tests the hypothesis that A and B are independent BinaryVariables –Compute Point-biserial correlation

Answer 27

Correlation does not imply causation – Just because value of one attribute is highly predictive of value of other doesn’t mean that forcing the first variable to take on a particular value will cause the second to change Causality has a direction, while correlation typically doesn’t – Correlation between high income and owning a Ferrari – Giving a person a Ferrari doesn’t affect their income – But increasing their income may make them more likely to buy a Ferrari Confounding variables can cause attributes to be correlated: – High heart rate and sweating are correlated with each other since they tend to both happen during exercise (confounder) – Causing somebody to sweat by putting them in sauna won’t necessarily raise their heart rate (it does a little, but not as much as exercise) – And giving them beta-blockers to lower their heart rate might not prevent sweating (it might a little, but again not like stopping exercising)

Answer 28

What are outliers? – Data objects that do not comply with the general behavior or model of the data, that is, values that appear as anomalous – Most data mining methods consider outliers noise or exceptions. Outliers may be detected using – Manual inspection and knowledge of reasonable values. – Statistical tests that assume a distribution or probability model for the data – Distance measures where objects that are a substantial distance from any other cluster are considered outliers – Deviation-based methods identify outliers by examining differences in the main characteristics of objects in a group How do we manage outliers? – Outliers are typically filtered out by eliminating the data points containing them

Answer 29

We might need to normalize attributes that have very different scales (e.g., age vs income) Range normalization converts all values to the range [0,1] Standard Score Normalization forces variables to have mean of 0 and standard deviation of 1.

Answer 30

• Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported • Data visualization is one of the most powerful and appealing techniques for data exploration –Humans have a well-developed ability to analyze large amounts of information that is presented visually –Can detect general patterns and trends –Can detect outliers and unusual patterns

Answer 31

• They are a graphical representation of the distribution of data • They are representations of tabulated frequencies depicted as adjacent rectangles, erected over discrete intervals (bins). • Their areas are proportional to the frequency of the observations in the interval. • The height of each bar indicates the number of objects • Shape of histogram depends on the number of bins

Answer 32

A box plot (also known as a box-and-whisker plot) is a graphical representation used in statistics to display the distribution of a dataset. It provides a summary of the dataset through five key summary statistics: 1. Minimum: The smallest value in the dataset (excluding outliers). 2. First Quartile (Q1): The 25th percentile, meaning 25% of the data is below this value. 3. Median (Q2): The 50th percentile or the middle value of the dataset. 4. Third Quartile (Q3): The 75th percentile, meaning 75% of the data is below this value. 5. Maximum: The largest value in the dataset (excluding outliers). In addition to these, outliers (extreme values) are sometimes plotted as individual points beyond the “whiskers,” which extend from Q1 to the minimum and from Q3 to the maximum. The box represents the interquartile range (IQR), which is the range between Q1 and Q3. It visually shows where the bulk of the data lies, with the median typically marked inside the box. This helps to easily see how the data is skewed or whether it’s symmetrically distributed.

Answer 33

Three main approaches Visualize several combinations of two-dimension plots (e.g., scatter plot matrix) Visualize all the dimensions at once (e.g., heatmaps, spider plots, and Chernoff) Project the data into a smaller space and visualize the the projected data

Answer 34

When projecting high-dimensional data into fewer dimensions we can either Find a linear projection e.g., use Principle Component Analysis Find a non-linear projection e.g., use t-distributed Stochastic Neighbor Embeddings (t-SNE)

Answer 35

• Typically applied to reduce the number of dimensions of data (feature selection) • The goal of PCA is to find a projection that captures the largest amount of variation in data • Given N data vectors from n-dimensions, find k

Answer 36

• Steps to apply PCA –Normalize input data –Compute k orthonormal (unit) vectors, i.e., principal components –Each input data point can be written as a linear combination of the k principal component vectors • The principal components are sor ted in order of decreasing “significance” or strength • Data size can be reduced by eliminating the weak components, i.e., those with low variance. • Using the strongest principal components, it is possible to reconstruct a good approximation of the original data

Answer 37

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm used for dimensionality reduction, primarily for visualizing high-dimensional datasets in a lower-dimensional space, typically 2D or 3D. Here’s how t-SNE works: 1. Input: You start with a high-dimensional dataset where each data point may have many features. For example, you might have a dataset where each sample is represented by 100 dimensions (or features). 2. Objective: The goal of t-SNE is to reduce the dataset to 2 or 3 dimensions while preserving the structure or relationships between the points as much as possible. 3. Process: • t-SNE computes pairwise similarities between the points in the high-dimensional space. • It then tries to find a mapping to a lower-dimensional space where the points that are similar in the high-dimensional space remain close together, while dissimilar points are mapped farther apart. • In the high-dimensional space, similarities between points are modeled using Gaussian distributions. In the low-dimensional space, similarities are modeled using t-distributions (which have heavier tails than Gaussian distributions, helping to prevent overcrowding of points). 4. Result: The result is a 2D or 3D map where points that are close together in the high-dimensional space remain close in the reduced space, allowing for visualization of clusters, groupings, or other patterns. Key Characteristics: • Non-linear dimensionality reduction: Unlike techniques like PCA (Principal Component Analysis), t-SNE is non-linear, meaning it can capture complex relationships between data points. • Focus on local structure: t-SNE excels at preserving the local structure of the data (i.e., points that are close in high-dimensional space remain close in lower dimensions). However, it may not always preserve the global structure of the dataset well. • Visualization: It is most commonly used to create 2D plots for visualizing high-dimensional data like images, text embeddings, or genomic data. Applications: t-SNE is widely used in: • Image recognition (e.g., visualizing features learned by deep neural networks). • Natural language processing (e.g., visualizing word embeddings). • Bioinformatics (e.g., clustering gene expression data). One important caveat with t-SNE is that it can be computationally intensive and sometimes hard to interpret, especially with very large datasets.

Answer 38

t-SNE converts distances between data points to joint probabilities then models original points by mapping them to low dimensional map points such that position of map points conserves the structure of the data 1. Define a probability distribution over pairs of high-dimensional data points so that: –Similar data points have a high probability of being picked –Dissimilar points have an extremely small probability of being picked 2. Define a similar distribution over the points in the map space – Minimize the Kullback–Leibler divergence between the two distributions with respect to the locations of the map points – To minimize the score, it applies gradient descent

Answer 39

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in a transaction

Answer 40

Support Fraction of transactions that contain both X and Y Confidence Measures how often items inY appear in transactions that contain X Given: X => Y

Answer 41

Given a set of transactions, find all rules with support >= minsup threshold confidence >= minconf threshold

Answer 42

• Frequent Itemset Generation –Generate all itemsets whose support 3 minsup • Rule Generation –Generate high confidence rules from frequent itemset –Each rule is a binary partitioning of a frequent itemset

Answer 43

Frequent Itemset –An itemset whose support is greater than or equal to the minsup threshold

Answer 44

• Lift is the ratio of the observed joint probability of X andY to the expected joint probability if they were statistically independent, lift(X ÞY) = sup(XUY)/sup(X)sup(Y) = conf(X ÞY)/sup(Y) • Lift is a measure of the deviation from stochastic independence (if it is 1 then X and Y are independent) • Lift also measure the surprise of the rule.A lift close to 1means that the support of a rule is expected considering the supports of its components. • We typically look for values of lift that are much larger (i.e., above expectation) or much smaller (i.e., below expectation) than one.

Answer 45

• A frequent itemset X is called maximal if it has no frequent supersets • The set of all maximal frequent itemsets,given as M = {X | X ∈ F and !∃Y ⊃ X,such thatY ∈ F } • M is a condensed representation of the set of all frequent itemset F, because we can determine whether any itemset is frequent or not using M • If there is a maximal itemset Z such that X ⊆ Z, then X must be frequent, otherwise Z cannot be frequent • However, M alone cannot be used to determine sup(X), we can only use to have a lower-bound, that is, sup(X)>=sup(Z) if X ⊆ Z ∈M.

Answer 46

• An itemset X is closed if all supersets of X have strictly less support, that is, sup(X) > sup(Y), for all Y ⊃ X • That is, C = {X | X ∈ F and !∃Y⊃X, such that sup(X)=sup(Y) } • The set of all closed frequent itemsets C is a condensed representation, as we can determine whether an itemset X is frequent, as well as the exact support of X using C alone • A frequent itemset X is a minimal generator if it has no subsets with the same support G={X|X∈ Fand!∃Y⊂X,suchthatsup(X)=sup(Y)} • Thus, all subsets of X have strictly higher support, that is, sup(X) < sup(Y)

Answer 47

Refer to the DM1 image

Answer 48

Clustering algorithms group a collection of data points into “clusters” according to some distance measure Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another • A cluster is a collection of data objects –Similar to one another within the same cluster –Dissimilar to the objects in other clusters • Cluster analysis –Given a set data points try to understand their structure –Finds similarities between data according to the characteristics found in the data –Groups similar data objects into clusters –It is unsupervised learning since there is no predefined classes

Answer 49

• A good clustering consists of high-quality clusters with –High intra-class similarity –Low inter-class similarity • The quality of a clustering result depends on both –The similarity measure used by the method and –Its implementation (the algorithms used to find the clusters) –Its ability to discover some or all the hidden patterns • Evaluation –Various measure of intra/inter cluster similarity –Manual inspection –Benchmarking on existing labels

Answer 50

• Dissimilarity/Similarity metric – Similarity expressed in terms of distance function, typically a metric, d(i, j) – Definitions of distance functions are usually ver y different for inter val-scaled, Boolean, categorical, ordinal ratio, and vector variables –Weights can/should be associated with different variables based on applications and data semantics • Cluster quality measure – Separate from distance, there is a “quality” function that measures the “goodness” of a cluster – It is hard to define “similar enough” or “good enough” –The answer is typically highly subjective

Answer 51

Euclidian distance is the typical function used to compute the similarity between two examples Another popular metric is city-block (Manhattan) metric, distance is the sum of absolute differences The Jaccard distance is a measure of dissimilarity between two sets. It is based on the Jaccard index, which measures the similarity between two sets. The Jaccard index is calculated as one minus the size of the intersection of the sets divided by the size of the union of the sets. The Hamming distance is a measure of the difference between two vectors. It counts the number of positions at which the corresponding symbols differ and divide by the size of the bigger one. The distance between a string x=x1x2...xn and y=y1y2...ym is the smallest number of insertions and deletions of single characters that will transform x into y Cosine distance (or cosine dissimilarity) is a measure of how different two non-zero vectors are in terms of their orientation, rather than their magnitude. It is derived from the cosine similarity, which measures the cosine of the angle between two vectors in a multi-dimensional space.

Answer 52

Why Cosine Similarity is Important Cosine similarity is widely used in fields like text mining, natural language processing (NLP), and recommendation systems because it focuses on the orientation (direction) of vectors rather than their magnitude. This makes it particularly suited for comparing high-dimensional, sparse datasets like text or word embeddings. Key Advantages: 1. Magnitude Independence: It ignores the magnitude of the vectors, which is useful for text data where documents may have different lengths but similar content. • Example: “Hello world” vs. “Hello world, hello again” – their directions may still align despite different word frequencies. 2. High-Dimensional Data: Handles sparse and high-dimensional vectors efficiently, especially in text analysis. 3. Fast Computation: Relatively simple to compute, making it ideal for large-scale systems like search engines and recommendation algorithms. Applications: • Text Similarity: Comparing document vectors in NLP (e.g., TF-IDF or word embeddings). • Recommendation Systems: Matching users with similar preferences or products with similar attributes. • Clustering: Grouping similar documents or feature vectors based on their direction. How to Calculate Cosine Similarity The cosine similarity between two vectors \mathbf{A} and \mathbf{B} is given by the formula: \text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} Where: • \mathbf{A} \cdot \mathbf{B} : Dot product of vectors \mathbf{A} and \mathbf{B} . • \|\mathbf{A}\| : Magnitude (norm) of vector \mathbf{A} , calculated as \sqrt{\sum_{i=1}^n A_i^2} . • \|\mathbf{B}\| : Magnitude (norm) of vector \mathbf{B} , calculated as \sqrt{\sum_{i=1}^n B_i^2} . Steps to Calculate: 1. Compute the Dot Product ( \mathbf{A} \cdot \mathbf{B} ): \mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^n A_i \cdot B_i This sums up the element-wise product of the two vectors. 2. Compute the Magnitudes ( \|\mathbf{A}\| and \|\mathbf{B}\| ): \|\mathbf{A}\| = \sqrt{\sum_{i=1}^n A_i^2}, \quad \|\mathbf{B}\| = \sqrt{\sum_{i=1}^n B_i^2} 3. Divide the Dot Product by the Product of Magnitudes: [ \cos(\theta

Answer 53

Normalization of attributes is crucial in clustering because it ensures that all features contribute equally to the distance measures used in clustering algorithms, such as K-means or hierarchical clustering. Here are the key reasons why normalization is important: 1. Equal Weighting of Features: Different attributes can have different scales (e.g., one attribute might range from 1 to 1000, while another ranges from 0 to 1). If features are not normalized, those with larger numerical ranges will dominate the distance calculations, making clustering more biased towards these attributes. 2. Improved Distance Calculations: Clustering algorithms rely on distance metrics (such as Euclidean distance) to group similar data points together. Without normalization, features with larger scales can disproportionately influence the distance, distorting the true relationships between data points. 3. Better Interpretation of Clusters: Normalizing ensures that each attribute contributes similarly to the clustering results, leading to clusters that better represent the underlying structure of the data. This improves the interpretability and meaningfulness of the resulting clusters. 4. Handling Features with Different Units: When attributes have different units (e.g., temperature in Celsius and weight in kilograms), normalization allows these features to be comparable by bringing them to a common scale. 5. Prevents Bias in Algorithms: Algorithms like K-means can be especially sensitive to the range of input features. Normalization helps prevent bias towards any specific attribute, leading to more balanced and accurate clustering. Common normalization techniques include min-max scaling (rescaling features to a [0,1] range) or z-score normalization (standardizing data to have a mean of 0 and a standard deviation of 1).

Answer 54

• Scalability • Ability to deal with different types of attributes • Ability to handle dynamic data • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability

Answer 55

As the number of dimensions in a dataset increases, distance measures become increasingly meaningless. Increasing the number of dimensions spread out the points until, in very high dimensions, they are almost equidistant from each other. In high dimensions, almost all pairs of point are equally far away from one another

Answer 56

Clustering high-dimensional data presents several challenges, which are closely tied to the curse of dimensionality. The curse of dimensionality refers to the various issues that arise when working with data in high-dimensional spaces, where the number of dimensions (features or attributes) is large. Here’s how it affects clustering: 1. Distance Measures Become Less Meaningful • In high-dimensional spaces, most data points become equidistant from one another. This is because, as the number of dimensions increases, the volume of the space grows exponentially, and points are spread out more uniformly across this space. • Impact on Clustering: Since clustering algorithms (e.g., K-means) rely on distance measures (like Euclidean distance) to group similar points together, the distinction between “near” and “far” points diminishes in high dimensions, making it difficult to form meaningful clusters. 2. Increased Sparsity of Data • High-dimensional spaces are often sparse, meaning that the density of data points decreases as dimensionality increases. This sparsity can make it hard to detect patterns or clusters because there may be very few neighboring points in any given region of the space. • Impact on Clustering: In clustering, algorithms group nearby points into clusters based on density or proximity. When the data is sparse, it becomes harder to find well-defined clusters, leading to noise and poor clustering performance. 3. Overfitting and Noise Sensitivity • With a large number of dimensions, irrelevant or noisy features are often introduced. As the dimensionality grows, the likelihood of some dimensions being irrelevant (i.e., not contributing to the clustering task) also increases. • Impact on Clustering: Clustering algorithms can easily overfit to these noisy or irrelevant dimensions, creating misleading clusters. The clusters may reflect the noise rather than the true structure of the data. 4. Increased Computational Complexity • The computational cost of clustering algorithms grows with the number of dimensions, as calculating distances or running iterative algorithms (e.g., K-means) becomes more expensive in high-dimensional spaces. • Impact on Clustering: High-dimensional data can make clustering algorithms slow or even infeasible to run efficiently, particularly for large datasets. Addressing the Curse of Dimensionality in Clustering: To handle the curse of dimensionality in clustering, several techniques can be applied: • Dimensionality reduction: Techniques like Principal Component Analysis (PCA) or t-SNE can be used to reduce the number of dimensions while retaining most of the variance or structure in the data. • Feature selection: Identify and use only the most relevant features for clustering, removing irrelevant or redundant dimensions. • Distance-based alternatives: Some clustering algorithms use distance measures that are more robust in high dimensions, such as cosine similarity or density-based clustering methods (e.g., DBSCAN). In summary, the curse of dimensionality makes clustering high-dimensional data challenging due to the loss of meaningful distances, data sparsity, noise sensitivity, and increased computational cost. Reducing the dimensionality or using alternative clustering approaches helps mitigate these issues.

Answer 57

Brute-force approach –List all possible association rules –Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds

Answer 58

If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets

Answer 59

• Leverages the tidsets directly for support computation. • The itemset support is computed by intersecting the tidsets of suitably chosen subsets. • Given t(X) and t(Y) for any two frequent itemsets X and Y, then t(XY)=t(X) ∧ t(Y) • And sup(XY) = |t(XY)| Eclat can be further improved by using diffsets (difference of tidset).

Answer 60

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure – Highly condensed, but complete for frequent pattern mining –Avoid costly database scans • Use an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation: sub-database test only • Major Steps to mine FP-tree –Construct the frequent pattern tree –For each frequent item i compute the projected FP-tree –Recursively mine conditional FP-trees and grow frequent patterns obtained so far – If the conditional FP-tree contains a single path, simply enumerate all the patterns • We start from the same transaction databased used in the previous examples and sort all the transactions based on the frequencies of their items

Answer 61

• Preserve complete information for frequent pattern mining • Reduce irrelevant info—infrequent items are gone • The more frequently occurring items are more likely to be shared • Never be larger than the original database (not count node-links and the count field) • No candidate generation, no candidate test, no repeated scan of entire database

Answer 62

• Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f => L \ f satisfies the minimum confidence requirement If |L| = k, then there are 2k – 2 candidate association rules (ignoring L → ⦰ and ⦰ → L) In order to generate rules more efficiently we gotta have in mind that: • Confidence does not have an anti-monotone property • c(ABCÞD) can be larger or smaller than c(AB Þ D) • However, confidence of rules generated from the same itemset has an anti-monotone property • L = {A,B,C,D}:c(ABC Þ D) >= c(AB Þ CD) >= c(A Þ BCD) • Confidence is anti-monotone in the number of items on the right-hand side of the rule Refer to the DIQ1 figure for an example

Answer 63

• Hierarchical vs point assignment • Numeric and/or symbolic data • Deterministic vs. probabilistic • Exclusive vs. overlapping • Hierarchical vs. flat • Top-down vs. bottom-up

Answer 64

Encoding of categorical values, normalization of numerical values, exclusion of the target values

Intro And Association Tule Mining Flashcards

(98 cards)