Theory Flashcards

1
Q

Why do we need data mining?

A

Data mining developed primarily in response to the increasing amounts of data being generated and the need to extract meaningful insights from this vast information. Several factors contributed to its development:

  1. Explosion of Data: With the advent of digital technologies, businesses, governments, and organizations began generating huge amounts of data from various sources such as transactions, social media, sensors, and web activity. Traditional methods were insufficient to handle and analyze these massive datasets.
  2. Advances in Computing Power: The rise of more powerful computers, storage systems, and distributed computing made it possible to process large volumes of data quickly and efficiently, enabling the complex calculations required for data mining.
  3. Need for Competitive Advantage: Businesses sought new ways to gain insights into customer behavior, market trends, and operational efficiencies to remain competitive. Data mining helped in identifying patterns, predicting trends, and improving decision-making processes.
  4. Machine Learning and AI Growth: With the development of machine learning algorithms, data mining evolved into a more sophisticated discipline. These algorithms helped in automating the discovery of patterns and trends in data, leading to actionable insights.
  5. Interdisciplinary Research: Data mining drew from various fields like statistics, computer science, artificial intelligence, and database management, which allowed for the creation of more powerful tools and methodologies.
  6. Applications Across Domains: Industries such as finance, healthcare, marketing, and manufacturing realized the potential of data mining in predicting outcomes, improving processes, and uncovering hidden relationships within data, further driving its development.

In essence, data mining arose as a solution to manage and make sense of the data deluge and to provide insights that could be turned into a competitive edge or valuable knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some examples of data mining usage?

A

Customer churn
– Given customer information for the past months, understand who customers I did lose, or predict what customers I might loose

Credit assessment
– Given a loan application, predict whether the bank should approve the loan predict what customers I might lose in the next months

Customer segmentation
– Given several information about the customers, identify interesting groups among them

Community detection
–Who is discussing what?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What led the Big Data phenomenon to occur? What are the consequences of Big Data?

A

The Big Data phenomenon occurred due to several key factors that have shaped the modern technological landscape. Here’s an overview of the causes and consequences:

  1. Explosion of Data Generation:
    • The rise of the internet, mobile devices, social media, and cloud computing has led to unprecedented data generation. Every digital interaction, from emails and social posts to e-commerce transactions, creates data.
    • The growth of Internet of Things (IoT) devices, such as sensors, wearables, and smart appliances, added another layer of continuous data creation from physical objects.
  2. Increased Data Storage Capacity:
    • Advancements in storage technologies, such as cloud storage, distributed databases, and data centers, made it possible to store vast amounts of data cheaply and efficiently. This, combined with improved data compression techniques, enabled the preservation of enormous datasets.
  3. Advances in Data Processing Technologies:
    • Tools like Hadoop, Apache Spark, and NoSQL databases allowed for scalable processing of unstructured and semi-structured data across distributed systems. These technologies made it feasible to analyze vast datasets in a relatively short time, unlocking the potential of Big Data.
  4. Proliferation of Social Media and Digital Platforms:
    • Platforms like Facebook, Twitter, Instagram, and YouTube generate a massive amount of user-generated content. The interactions, likes, shares, and views on these platforms became key sources of data for analytics.
  5. E-commerce and Digital Transactions:
    • Online shopping platforms such as Amazon and Alibaba, along with digital payment systems, track and record every consumer transaction, producing detailed datasets related to purchasing behaviors and trends.
  6. Advancements in Machine Learning and Artificial Intelligence:
    • AI and machine learning require large amounts of data to train models effectively. As these fields advanced, the need for massive datasets to improve algorithms accelerated the Big Data movement.
  1. Improved Decision-Making:
    • Organizations can leverage Big Data analytics to make data-driven decisions, leading to more accurate forecasts, optimized operations, and personalized services. In business, this translates into enhanced customer insights, market analysis, and operational efficiencies.
  2. Personalization and Predictive Analytics:
    • Companies now have the ability to customize products and services to individual preferences. For example, streaming services like Netflix and Spotify use Big Data to personalize content recommendations, and e-commerce platforms use it to suggest products.
    • Predictive analytics can anticipate future trends based on historical data, which helps industries like finance, healthcare, and marketing improve outcomes.
  3. Privacy and Security Concerns:
    • The vast amounts of personal data being collected have led to increased concerns about privacy violations and data breaches. Organizations are now expected to protect sensitive data, leading to stricter regulations like the GDPR (General Data Protection Regulation) in Europe and other privacy laws globally.
  4. Ethical and Bias Issues:
    • With the extensive use of data in decision-making, ethical concerns have surfaced, particularly around how data is collected, processed, and used. Algorithms trained on biased data can perpetuate existing inequalities, especially in areas like hiring, criminal justice, and lending.
  5. Impact on Business Models:
    • Big Data has driven the shift to data-centric business models, where data is a valuable asset. Companies like Google, Amazon, and Facebook capitalize on user data to generate revenue through targeted advertising, changing the nature of competition in the digital economy.
  6. Job Displacement and Creation:
    • While Big Data has created new roles in data science, analytics, and IT infrastructure, it has also led to automation in several industries. Tasks that were traditionally performed by humans are now being handled by AI and data-driven systems, leading to both job displacement and the creation of new tech-driven jobs.
  7. Advances in Healthcare:
    • In healthcare, Big Data has led to significant advancements such as precision medicine, where treatments are tailored to individual patients based on genetic, environmental, and lifestyle factors. It has also helped in epidemic prediction and management, as seen during the COVID-19 pandemic.
  8. Efficiency Gains in Various Industries:
    • Sectors like manufacturing, logistics, and supply chain management benefit from Big Data analytics by optimizing production processes, reducing waste, and improving delivery times.

The Big Data phenomenon arose from the exponential increase in data generated by digital and connected technologies, coupled with advancements in storage and processing capabilities. The consequences are vast, ranging from enhanced decision-making and personalized experiences to concerns around privacy, ethics, and job displacement, fundamentally reshaping industries and society.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When can we say a computer learn from experience?

A

A computer program is said to learn from experience E with respect to some class
of task T and a performance measure P,
if its performance at tasks in T, as measured by P, improves because of experience E.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the machine learning paradigms?

A

Suppose we have the experience E encoded as a dataset,
D = x1,x2,x3,…,xn

• Supervised Learning
–Given the desired outputs t1,t2,…,tn learns to produce the correct output given a new set of input

• Unsupervised learning
–Exploits regularities in D to build a representation to be used for reasoning or prediction

• Reinforcement learning
– Producing actions a1,a2,…,an which affect the environment, and receiving rewards r1,r2,…,rn learn to act in order to maximize rewards in the long term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data mining?

A

Data Mining is the non-trivial process of identifying (1) valid, (2) novel, (3) potentially useful, and
(4) understandable patterns in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can we identify a good pattern in data?

A

Is it valid?
–The pattern has to be valid with respect to a certainty level (rule true for the 86%)

Is it novel?
–Is the relation between astigmatism and hard contact lenses already well-known?

Is it useful? Is it actionable?
–The pattern should provide information useful to the bank for assessing credit risk

Is it understandable?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the general idea of data mining? What are the obstacles to be overcame?

A

Build computer programs that navigate through databases automatically, seeking patterns

However,
–Most patterns will be uninteresting
– Most patterns are spurious, inexact, or contingent on accidental coincidences in the data
– Real data is imperfect, some par ts will be garbled, and some will be missing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the essencial caracteristics of the data mining algorithms?

A

Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the types of data mining models?

A

Descriptive vs. Predictive
Are the models built for gaining insight? (about what already happened)
Or are they built for accurate prediction? (about what might happen)

Prescriptive
Apply descriptive and predictive mining to recommend a course of action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the structure of a data mining process?

A

Selection
– What are data we need to answer the posed question?

Cleaning
– Are there any errors or inconsistencies in the data we need to eliminate?

Transformation
– Some variables might be eliminated because equivalent to others
– Some variables might be elaborated to create new variables
(e.g., birthday to age, daily measures into weekly/monthly measures, log?)

Mining
– Select the mining approach: classification, regression, association, etc. – Choose and apply the mining algorithm(s)

Validation
– Are the patterns we discovered sound? According to what criteria? – Are the criteria sound? Can we explain the result?

Presentation & Narrative
– What did we learn? Is there a story to tell? A take-home message?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the importance of data preparation and preprocessing?

A

Data preparation account for most of the time needed to design an effective data mining pipeline

It can take up to 80%-90% of the overall effort

No quality in data, no quality out! (trash in, trash out) Quality decisions need quality in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is data dirty?

A

Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is analyzed.
– Human/hardware/software problems

Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry – Errors in data transmission

Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the major tasks in data preprocessing?

A

Data cleaning
– Fill in missing values, smooth noisy data
– Identify or remove outliers
– Remove duplicates and resolve inconsistencies

Data integration
– Integration of multiple databases, data cubes, or files

Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression

Feature engineering, data transformation, and data discretization
– Normalization
– Creation of new features (e.g., age from birthday)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the reasons for missing values?
How can we handle them?

A

• Reasons for missing values
– Information is not collected (e.g., people decline to give their age and weight)
–Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)

• Handling missing values
–Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
–Replace with all possible values (weighted by their probabilities)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is sampling and why is it important? When a sample is considered to be representative?

A

• Sampling is the main technique employed for data selection
• Often used for both the preliminary investigation of the data and the final data analysis
• We also sample because working on the entire data too expensive or time consuming
• A sample is representative if it has the same property (of interest) as the original set of data • If the sample is representative, it will work almost as well as using the entire data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the types of data sampling?

A

• Sampling without replacement
– As each item is selected, it is removed from the population

• Sampling with replacement (Bootstrap)
–Objects are not removed from the population as they are selected for the sample.
– In sampling with replacement, the same object can be picked up more than once

• Stratified sampling
–Split the data into several partitions
–Then draw random samples from each partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some of the data mining tasks?

A

Regression, clustering, classification, association

• Outlier analysis
–An outlier is a data object that does not comply with the general behavior of the data
–It can be considered as noise or exception but is quite useful in rare events analysis

• Trend and evolution analysis
–Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
–Similarity-based analysis

• Text Mining,Topic Modeling, Graph Mining, Data Streams

• Sentiment Analysis, Opinion Mining, etc.

• Other pattern-directed or statistical analyses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Are all discovered patterns interesting? How can we determine if a pattern is interesting?

A

• Interestingness measures
– Data Mining may generate thousands of patterns, but typically not all of them are interesting.
– A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm

• Objective vs. subjective interestingness measures
–Objective measures are based on statistics and structures of patterns
– Subjective measures are based on user’s belief in the data, e.g., unexpectedness, novelty, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are instances, attributes and concepts?

A

• Instances (observations, cases, records, items, examples)
–The atomic elements of information from a dataset
–Each row in previous table corresponds to an instance

• Attributes (variables, features, independent variables)
–Measures aspects of an instance
–Each instance is composed of a certain number of attributes
–Each column in previous table contains values of an attribute

• Concept (class, target variable, dependent variable)
–Special content inside the data
–Kind of things that can be learned
–Intelligible and operational concept description
–Last column of previous table was the class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the different types of attributes?

A

• Numeric Attributes
–Real-valued or integer-valued domain
–Interval-scaled when only differences are meaningful (e.g., temperature)
– Ratio-scaled when differences and ratios are meaningful (e.g., Age)

• Categorical Attributes
–Set-valued domain composed of a set of symbols
– Nominal when only equality is meaningful (e.g., domain(Sex) = { M, F})
–Ordinal when both equality (are two values the same?) and inequality (is one value less than another?) are meaningful (e.g., domain(Education) = { High School, BS, MS, PhD})

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can we classify attributes based on an alternative view?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Can missing values have a meaning? Give an example

A

• Missing value may have a meaning in itself

• For example, missing test in a medical examination or an empty field in a questionnaire

• They are usually indicated by out-of-range entries (e.g., max/min float, NaN, null)

• Does absence of value have some significance?
– If it does, “missing” is a separate value
– If it does not, “missing” must be treated in a special wa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the types of missing values?

A

Missing not at random (MNAR)
–Distribution of missing values depends on missing value
– E.g., respondents with high income less likely to repor t it

Missing at random (MAR)
–Distribution of missing values depends on observed attributes, but not missing value
– E.g., men less likely than women to respond to question about mental health

Missing completely at random (MCAR)
–Distribution of missing values does not depend on observed attributes or missing value
–E.g., survey questions randomly sampled from larger set of possible questions

Types of Missing Data

In data analysis, missing data is a common problem that can affect the quality of your results. Understanding the type of missing data is crucial because it determines the best strategy for handling it. Missing data is typically classified into three categories:
1. Missing Completely at Random (MCAR):
• Definition: Data is considered MCAR when the probability of a data point being missing is unrelated to the observed or unobserved data. Essentially, there’s no pattern to the missing data, and it happens purely by chance.
• Example: If a survey respondent accidentally skips a question due to a technical glitch or randomly misses a question with no relation to their responses, the missing data would be considered MCAR.
• Detection: MCAR can be tested using statistical tests like Little’s MCAR test. Additionally, examining patterns and correlations between missing values and other variables may indicate randomness.
• Solutions: If data is MCAR, it’s safe to use methods like:
• Listwise deletion: Removing all cases with missing values.
• Mean/median imputation: Replacing missing values with the mean or median of the non-missing data.
• Advanced methods: Using multiple imputation or machine learning models, though not strictly necessary for MCAR.
2. Missing at Random (MAR):
• Definition: Data is considered MAR if the probability of a data point being missing is related to the observed data but not to the missing data itself. In other words, the missingness is related to other known variables but not to the value of the missing data itself.
• Example: In a medical study, if older patients are more likely to skip a specific question but the missingness does not depend on their actual answer to that question, the data is MAR.
• Detection: Analyzing the relationship between missing data and other observed variables can help identify MAR. Techniques like logistic regression can be used to model the probability of missingness as a function of other variables.
• Solutions: When data is MAR, you can use methods like:
• Multiple imputation: Creating several imputed datasets by estimating the missing values based on other observed data, and then averaging the results.
• Maximum likelihood estimation: Estimating model parameters by accounting for the missing data structure.
• Regression imputation: Predicting missing values using observed variables.
3. Missing Not at Random (MNAR):
• Definition: Data is MNAR when the probability of a data point being missing is related to the value of the missing data itself, even after accounting for other variables. This means that the missingness is inherently tied to the data that is missing.
• Example: In a survey about income, high-income respondents might be more likely to skip questions about their earnings due to privacy concerns. Here, the missingness depends directly on the income value.
• Detection: MNAR is challenging to detect since it involves the values of the missing data itself. It often requires domain knowledge, additional data collection, or sensitivity analysis to explore if the data might be MNAR.
• Solutions: Addressing MNAR requires more sophisticated techniques, such as:
• Modeling the missingness: Using selection models or pattern-mixture models to explicitly model the missing data mechanism.
• Data augmentation: Collecting additional data or using follow-up studies to understand the nature of the missingness.
• Sensitivity analysis: Testing different assumptions about the missing data to see how they impact the results.

How to Identify the Type of Missing Data

1.	Visualizations:
•	Use heatmaps, bar plots, or missingness matrices to visualize missing data patterns.
•	Correlation plots between missing indicators and other variables can suggest MCAR or MAR.
2.	Statistical Tests:
•	Little’s MCAR Test: A hypothesis test where the null hypothesis is that data is MCAR. A significant result suggests the data is not MCAR.
•	Logistic Regression for Missingness: Modeling the missingness indicator (e.g., whether a value is missing) against other variables can help detect MAR.
3.	Domain Knowledge:
•	Understanding the context of your data can help infer whether missingness is related to specific variables (MAR) or inherent to the missing value itself (MNAR).

Practical Solutions for Handling Missing Data

1.	MCAR Solutions:
•	Listwise Deletion: If the data is MCAR, removing rows with missing values is unbiased but reduces the dataset size.
•	Mean/Median Imputation: Useful for simple datasets but may underestimate variability.
•	Random Sampling Imputation: Impute missing values using random sampling from observed values.
2.	MAR Solutions:
•	Multiple Imputation: Iteratively predicts missing values using models (like regression) based on observed data, creating multiple versions of the dataset to average over.
•	Maximum Likelihood Estimation: Fits models using the observed data and accounts for missingness without imputing values.
•	Predictive Models: Use models like k-Nearest Neighbors (k-NN) or machine learning algorithms to predict missing values.
3.	MNAR Solutions:
•	Sensitivity Analysis: Explore how different assumptions about missing data affect the results.
•	Pattern-Mixture Models: Model each pattern of missing data separately.
•	Data Collection: If possible, gather additional data to minimize the impact of MNAR.

Summary Table

Type Definition Detection Solutions
MCAR Missingness is random and unrelated to any data. Little’s MCAR test, pattern analysis Listwise deletion, mean imputation, multiple imputation
MAR Missingness is related to observed data but not to missing data itself. Logistic regression, correlation with observed data Multiple imputation, maximum likelihood, regression imputation
MNAR Missingness is related to the value of the missing data itself. Sensitivity analysis, domain knowledge Sensitivity analysis, pattern-mixture models, data augmentation

Understanding the type of missing data is crucial for applying the right strategy to ensure accurate and unbiased results in your analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How to deal with missing values?

A

First use what you know about the data
–Why data is missing?
–Distribution of missing data

Decide on the best strategy to yield the least biased estimates
–Do Not Impute (DNI)
– Deletion Methods (list-wise deletion, pair-wise deletion)
– Single Imputation Methods (mean/mode substitution, dummy variable, single regression)
– Model-Based Methods (maximum Likelihood, multiple imputation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Explain the Do Not Impute strategy

A

• Do Not Impute (DNI)
–Simply use the default policy of the data mining method
–Works only if the policy exists
–Some methods can work around missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Explain the Deletion strategy

A

• The handling of missing data depends on the type

• Discarding all the examples with a missing values
–Simplest approach
–Allows the use of unmodified data mining methods
–Only practical if there are few examples with missing values.
– Otherwise, it can introduce bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Explain the List-wise Deletion strategy

A

• Only analyze cases with available data on each variable
• Simple, but reduces the data
• Comparability across analyses
• Does not use all the information
• Estimates may be biased if data not MCAR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Explain the Pair-wise Deletion strategy

A

• Delete cases with missing values that affect only the variables of interest

• Example
– When using only the first two variables, the missing values of the third variable are not considered

• Advantage
–Keeps as many cases as possible for each analysis
–Uses all information possible with each analysis

• Disadvantage
–Comparison of results is more difficult because samples are different each time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Explain the Imputation strategy

A

• Convert the missing values into a new value
–Use a special value for it
–Add an attribute that indicates if value is missing or not
–Greatly increases the difficulty of the data mining process

• Imputation methods
– Assign a value to the missing one, based on the rest of the dataset
–Use the unmodified data mining methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Explain the Single Imputation strategy

A

• Mean/mode substitution (most common value)
–Replace missing value with sample mean or mode
–Run analyses as if all complete cases
– Advantages: Can use complete case analysis methods
– Disadvantages: Reduces variability

• Dummy variable control
–Create an indicator for missing value (1=value is missing for observation; 0=value is observed for observation)
–Impute missing values to a constant (such as the mean)
–Include missing indicator in the algorithm
–Advantage: uses all available information about missing observation
– Disadvantage: results in biased estimates, not theoretically driven

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Explain the Model-based Imputation strategy

A

• Extract a model from the dataset to perform the imputation – Suitable for MCAR and, to a lesser extent, for MAR –Not suitable for NMAR type of missing data
• For NMAR we need to go back to the source of the data to obtain more information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are some of the reasons behind Inaccurate values?

A

• Data has not been collected for mining it

• Errors and omissions that don’t affect original purpose of data (e.g. age of customer)

• Typographical errors in nominal attributes, thus values need to be checked for consistency

• Typographical and measurement errors in numeric attributes, thus outliers need to be identified

• Errors may be deliberate (e.g. wrong zip codes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why do we care about data types?

A

They influence the type of statistical analyses and visualization we can perform

Some algorithms and functions fit some specific data types best like Check for valid values, Deal with missing values, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How can we transform categorical data into numerical and vice versa!

A

Transforming Categorical Data into Numerical Data

Converting categorical data into numerical data is often necessary for machine learning algorithms, which typically require numerical input. Here are common methods for this transformation:
1. Label Encoding:
• Definition: Converts each category into a unique integer.
• Use Case: Useful for ordinal data where categories have a meaningful order (e.g., “Low”, “Medium”, “High”).
• Example:

from sklearn.preprocessing import LabelEncoder

data = [‘Low’, ‘Medium’, ‘High’, ‘Medium’]
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(data)
print(encoded_data) # Output: [1, 2, 0, 2]

•	Pros: Simple and efficient.
•	Cons: Can introduce ordinal relationships where none exist, leading to incorrect model assumptions.

2.	One-Hot Encoding:
•	Definition: Converts each category into a new binary column (0 or 1), with “1” indicating the presence of that category.
•	Use Case: Ideal for nominal data where categories have no intrinsic order (e.g., “Red”, “Green”, “Blue”).
•	Example:

import pandas as pd

data = [‘Red’, ‘Green’, ‘Blue’]
df = pd.DataFrame(data, columns=[‘Color’])
one_hot_encoded = pd.get_dummies(df, columns=[‘Color’])
print(one_hot_encoded)

Output:

Color_Blue Color_Green Color_Red
0 0 0 1
1 0 1 0
2 1 0 0

•	Pros: No ordinal relationship introduced; ideal for non-ordinal data.
•	Cons: Can lead to the “curse of dimensionality” if there are many categories (many new columns).

3.	Binary Encoding:
•	Definition: Converts categories into binary digits and uses fewer columns than one-hot encoding.
•	Use Case: Useful when dealing with high-cardinality features (features with many categories).
•	Example:

import category_encoders as ce

data = [‘Apple’, ‘Banana’, ‘Orange’]
encoder = ce.BinaryEncoder()
binary_encoded = encoder.fit_transform(data)
print(binary_encoded)

•	Pros: Reduces dimensionality compared to one-hot encoding.
•	Cons: Still introduces additional columns.

4.	Frequency Encoding:
•	Definition: Replaces each category with its frequency in the dataset.
•	Use Case: Useful for high-cardinality features where you want to retain some information about category distribution.
•	Example:

data = [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘A’]
freq_encoding = pd.Series(data).value_counts().to_dict()
encoded_data = [freq_encoding[val] for val in data]
print(encoded_data) # Output: [3, 2, 3, 1, 2, 3]

•	Pros: Helps reduce dimensionality.
•	Cons: Can introduce bias if frequency distributions are uneven.

5.	Target Encoding (Mean Encoding):
•	Definition: Replaces categories with the mean of the target variable for each category.
•	Use Case: Useful for categorical features in supervised learning where target leakage is not a concern.
•	Example:

df = pd.DataFrame({‘Category’: [‘A’, ‘B’, ‘A’, ‘C’], ‘Target’: [10, 20, 30, 40]})
means = df.groupby(‘Category’)[‘Target’].mean()
df[‘Category_encoded’] = df[‘Category’].map(means)
print(df)

Output:

Category Target Category_encoded
0 A 10 20.0
1 B 20 20.0
2 A 30 20.0
3 C 40 40.0

•	Pros: Retains useful information about categories’ relationship with the target.
•	Cons: Risk of overfitting, especially with small datasets.

Transforming Numerical Data into Categorical Data

Sometimes, you may want to convert numerical data into categorical data, such as when creating bins for continuous data or segmenting data for analysis. Here are common techniques:
1. Binning (Discretization):
• Definition: Divides numerical data into bins (categories) based on value ranges.
• Use Case: Useful for simplifying numerical data by converting it into discrete intervals.
• Example:

import pandas as pd

data = [23, 45, 12, 67, 34, 89]
bins = [0, 30, 60, 90]
labels = [‘Low’, ‘Medium’, ‘High’]
categorized_data = pd.cut(data, bins=bins, labels=labels)
print(categorized_data) # Output: [‘Low’, ‘Medium’, ‘Low’, ‘Medium’, ‘Medium’, ‘High’]

•	Pros: Simplifies analysis by reducing data complexity.
•	Cons: Can lead to loss of information.

2.	Quantile Binning:
•	Definition: Bins data into categories based on quantiles (e.g., quartiles, deciles).
•	Use Case: Useful for dividing data into equally sized groups, often used for ranking.
•	Example:

data = [10, 20, 30, 40, 50]
quantile_bins = pd.qcut(data, q=3, labels=[‘Low’, ‘Medium’, ‘High’])
print(quantile_bins) # Output: [‘Low’, ‘Low’, ‘Medium’, ‘Medium’, ‘High’]

•	Pros: Ensures bins have approximately the same number of observations.
•	Cons: Bin boundaries may not be intuitive.

3.	Custom Binning:
•	Definition: Uses specific domain knowledge to create bins based on meaningful thresholds.
•	Use Case: Useful when there are well-known cutoff points in your data (e.g., age groups like “Child”, “Adult”, “Senior”).
•	Example:

data = [10, 15, 25, 35, 50]
custom_bins = [0, 18, 35, 60]
custom_labels = [‘Child’, ‘Adult’, ‘Senior’]
custom_categorized = pd.cut(data, bins=custom_bins, labels=custom_labels)
print(custom_categorized)

•	Pros: Tailored to your specific data context.
•	Cons: Requires domain knowledge and can be subjective.

Summary Table

Transformation Technique Use Case Pros Cons
Categorical to Numerical Label Encoding Ordinal data Simple and efficient May introduce artificial order
One-Hot Encoding Nominal data Avoids ordinal assumptions High dimensionality
Binary Encoding High-cardinality features Reduces dimensionality More complex to interpret
Frequency Encoding High-cardinality features Retains frequency information Can introduce bias
Target Encoding Supervised learning Captures relationship with target Risk of overfitting
Numerical to Categorical Binning Simplify continuous data Reduces complexity Loss of information
Quantile Binning Equal-sized category groups Balanced bins Non-intuitive boundaries
Custom Binning Domain-specific thresholds Tailored categorization Requires domain knowledge

The choice of method depends on the type of data and the specific problem you are trying to solve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the different types of encoders?

A

• LabelEncoder
–Encodes target labels with values between 0 and n_labels-1

• OneHotEncoder
–Performs a one-hot encoding of categorical features.

• OrdinalEncoder
–Performs an ordinal (integer) encoding of the categorical features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the pros and cons of one-hot encoding and label encoding?

A

Label Encoder

Map a categorical variable described by n values into a numerical variables with values from 0 to n-1

For example, attribute Outlook would be replaced by a numerical variables with values 0, 1, and 2

Warning

By replacing a label with a number might influence the process in unexpected ways

In the example, by assigning 0 to overcast and 2 to sunny we give a higher weight to the latter
What happens if we then apply a regression model? Would the result change with different assigned values?
If we apply label encoding, we should store the mapping used
for each attribute to be able to map the encoded data into the original ones

One Hot Encoding

Map each categorical attribute with n values into n binary 0/1 variables
Each one describing one specific attribute values

For example, attribute Outlook is replaced by three binary variables Sunny, Overcast, and Rainy

Warning

One hot encoding assigns the same numerical value (1) to all the labels

But it can generate a massive amount of variables when applied to categorical variables with many values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How can we verify the influence of a label encoding in a model?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Why would we want to discretize numerical values?

A

Because the algorithm we are using may not work with continuous ones

40
Q

What are categorical embbedings?

A

Apply deep learning to map categorical variables in into Euclidean spaces

Similar values are mapped close to each other in the embedding space thus revealing the intrinsic properties of the categorical variables

41
Q

What is data exploration? What are the key motivations to apply it?

A

Preliminary exploration of the data aimed at identifying their most relevant characteristics

What the key motivations?
–Help to select the right tool for preprocessing and data mining
– Exploit humans’ abilities to recognize patterns not captured by automatic tools

42
Q

What are the goals of exploratory data analysis?

A

“An approach of analyzing data to summarize their main characteristics without using a statistical model or having formulated a prior hypothesis.”

“Exploratory data analysis was promoted by JohnTukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.”

43
Q

What are the data exploratory techniques?

A

Exploratory Data Analysis (as originally defined by Tukey), was mainly focused on:
– Visualization
–Clustering and anomaly detection (viewed as exploratory techniques)
– Note that, in data mining, clustering and anomaly detection are major
areas of interest, and not thought of as just exploration

44
Q

What are summary statistics? Give examples

A

What are they?
–Numbers that summarize properties of the data

Summarized properties include
– Location, mean, spread, skewness, standard deviation, mode, percentiles, etc.

45
Q

What is the frequency and mode of an attribute?

A

The frequency of an attribute value
–The percentage of time the value occurs in the data set
– For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time.

• The mode of an attribute is the most frequent attribute value

• The notions of frequency and mode are typically used with categorical data

46
Q

What are the measurements of location of the data? Explain them.

A

The mean is the most common measure of the location of a set of points

However, the mean is very sensitive to outliers. Thus, the median or a trimmed mean is also commonly used

47
Q

What are percentiles? How are they calculated?

A

For continuous data, the notion of a percentile is very useful

p-th percentile
–Given an ordinal or continuous attribute x and a number p
–p-th percentile is a value xp of x such that p% of the observed values of x are less than xp

For instance, the 50th percentile is the value x50% such that 50% of all values of x are less than x50%

48
Q

What are the trimean and truncated mean?

A

Trimean
– It is the weighted mean of the first, second and third quartile

Truncated Mean
– Discards data above and below a certain percentile
– For example, below the 5th percentile and above the 95th percentile

49
Q

What are the measurements of data spread?

A

The variance is the most common measure of the spread of a set of points

50
Q

What is correlation?

A

Given two attributes, measure how strongly one attribute implies the other, based on the available data

Use correlation measures to estimate how predictive one attribute is of another

51
Q

How can we calculate the correlation of different kinds of attributes?

A

Numerical Variables
– For two numerical variables, we can compute Pearson’s product moment coefficient

Ordinal Variables
–We can compute Spearman’s rank correlation coefficient

Categorical Variables
– We can compute χ2 statistic test which tests the hypothesis that A and B are independent

BinaryVariables
–Compute Point-biserial correlation

52
Q

What is the difference between causation and correlation?

A

Correlation does not imply causation
– Just because value of one attribute is highly predictive of value of other doesn’t mean that forcing the
first variable to take on a particular value will cause the second to change

Causality has a direction, while correlation typically doesn’t
– Correlation between high income and owning a Ferrari
– Giving a person a Ferrari doesn’t affect their income
– But increasing their income may make them more likely to buy a Ferrari

Confounding variables can cause attributes to be correlated:
– High heart rate and sweating are correlated with each other since they tend to both happen during
exercise (confounder)
– Causing somebody to sweat by putting them in sauna won’t necessarily raise their heart rate (it does a little, but not as much as exercise)
– And giving them beta-blockers to lower their heart rate might not prevent sweating (it might a little, but again not like stopping exercising)

53
Q

What are outliers? How can we detect them? How can we handle them?

A

What are outliers?
– Data objects that do not comply with the general behavior or model of the data, that is, values that
appear as anomalous
– Most data mining methods consider outliers noise or exceptions.

Outliers may be detected using
– Manual inspection and knowledge of reasonable values.
– Statistical tests that assume a distribution or probability model for the data
– Distance measures where objects that are a substantial distance from any other cluster are considered outliers
– Deviation-based methods identify outliers by examining differences in the main characteristics of objects in a group

How do we manage outliers?
– Outliers are typically filtered out by eliminating the data points containing them

54
Q

What is normalization? What are some approaches for doing it?

A

We might need to normalize attributes that have very different scales (e.g., age vs income)

Range normalization converts all values to the range [0,1]

Standard Score Normalization forces variables to have mean of 0 and standard deviation of 1.

55
Q

What is the Standard Score Normalization? How do we calculate it? What does it implies?

A
56
Q

What is the importance of visualization?

A

• Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported

• Data visualization is one of the most powerful and appealing techniques for data exploration
–Humans have a well-developed ability to analyze large amounts of information that is
presented visually
–Can detect general patterns and trends –Can detect outliers and unusual patterns

57
Q

What are histograms?

A

• They are a graphical representation of the distribution of data
• They are representations of tabulated frequencies depicted as adjacent rectangles, erected over
discrete intervals (bins).
• Their areas are proportional to the frequency of the observations in the interval.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins

58
Q

What are box plots

A

A box plot (also known as a box-and-whisker plot) is a graphical representation used in statistics to display the distribution of a dataset. It provides a summary of the dataset through five key summary statistics:

1.	Minimum: The smallest value in the dataset (excluding outliers).
2.	First Quartile (Q1): The 25th percentile, meaning 25% of the data is below this value.
3.	Median (Q2): The 50th percentile or the middle value of the dataset.
4.	Third Quartile (Q3): The 75th percentile, meaning 75% of the data is below this value.
5.	Maximum: The largest value in the dataset (excluding outliers).

In addition to these, outliers (extreme values) are sometimes plotted as individual points beyond the “whiskers,” which extend from Q1 to the minimum and from Q3 to the maximum.

The box represents the interquartile range (IQR), which is the range between Q1 and Q3. It visually shows where the bulk of the data lies, with the median typically marked inside the box. This helps to easily see how the data is skewed or whether it’s symmetrically distributed.

59
Q

How can we visualize more than two dimensions sat the same time?

A

Three main approaches

Visualize several combinations of two-dimension plots (e.g., scatter plot matrix)

Visualize all the dimensions at once (e.g., heatmaps, spider plots, and Chernoff)

Project the data into a smaller space and visualize the the projected data

60
Q

How can we project high-dimensional data into fewer dimensions?

A

When projecting high-dimensional data into fewer dimensions we can either

Find a linear projection e.g., use Principle Component Analysis

Find a non-linear projection
e.g., use t-distributed Stochastic Neighbor Embeddings (t-SNE)

61
Q

What is Principal Component Analysis? When does it works? What can affect it?

A

• Typically applied to reduce the number of dimensions of data (feature selection)
• The goal of PCA is to find a projection that captures the largest amount of variation in data
• Given N data vectors from n-dimensions, find k<n orthogonal vectors (the principal components) that can be used to represent data
• Works for numeric data only and it is affected by scale, so data usually need to be rescaled before applying PCA

62
Q

What we the steps for applying PCA?

A

• Steps to apply PCA
–Normalize input data
–Compute k orthonormal (unit) vectors, i.e., principal components
–Each input data point can be written as a linear combination of the k principal component vectors

• The principal components are sor ted in order of decreasing “significance” or strength

• Data size can be reduced by eliminating the weak components, i.e., those with low variance.

• Using the strongest principal components, it is possible to reconstruct a good approximation of the original data

63
Q

What is t distributed stochastic neighbor embedding?

A

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm used for dimensionality reduction, primarily for visualizing high-dimensional datasets in a lower-dimensional space, typically 2D or 3D.

Here’s how t-SNE works:

1.	Input: You start with a high-dimensional dataset where each data point may have many features. For example, you might have a dataset where each sample is represented by 100 dimensions (or features).
2.	Objective: The goal of t-SNE is to reduce the dataset to 2 or 3 dimensions while preserving the structure or relationships between the points as much as possible.
3.	Process:
•	t-SNE computes pairwise similarities between the points in the high-dimensional space.
•	It then tries to find a mapping to a lower-dimensional space where the points that are similar in the high-dimensional space remain close together, while dissimilar points are mapped farther apart.
•	In the high-dimensional space, similarities between points are modeled using Gaussian distributions. In the low-dimensional space, similarities are modeled using t-distributions (which have heavier tails than Gaussian distributions, helping to prevent overcrowding of points).
4.	Result: The result is a 2D or 3D map where points that are close together in the high-dimensional space remain close in the reduced space, allowing for visualization of clusters, groupings, or other patterns.

Key Characteristics:

•	Non-linear dimensionality reduction: Unlike techniques like PCA (Principal Component Analysis), t-SNE is non-linear, meaning it can capture complex relationships between data points.
•	Focus on local structure: t-SNE excels at preserving the local structure of the data (i.e., points that are close in high-dimensional space remain close in lower dimensions). However, it may not always preserve the global structure of the dataset well.
•	Visualization: It is most commonly used to create 2D plots for visualizing high-dimensional data like images, text embeddings, or genomic data.

Applications:

t-SNE is widely used in:

•	Image recognition (e.g., visualizing features learned by deep neural networks).
•	Natural language processing (e.g., visualizing word embeddings).
•	Bioinformatics (e.g., clustering gene expression data).

One important caveat with t-SNE is that it can be computationally intensive and sometimes hard to interpret, especially with very large datasets.

64
Q

What ar é the steps to apply t-SNE?

A

t-SNE converts distances between data points to joint probabilities then models original points by mapping them to low dimensional map points such that position of map points conserves the structure of the data

  1. Define a probability distribution over pairs of high-dimensional data points so that:
    –Similar data points have a high probability of being picked
    –Dissimilar points have an extremely small probability of being picked
  2. Define a similar distribution over the points in the map space
    – Minimize the Kullback–Leibler divergence between the two distributions with respect to the
    locations of the map points
    – To minimize the score, it applies gradient descent
65
Q

What is association rule mining?

A

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in a transaction

66
Q

What are the evaluation metrics of association rule mining? What do they mean?

A

Support
Fraction of transactions that contain both X and Y

Confidence
Measures how often items inY appear in transactions that contain X

Given: X => Y

67
Q

What is the end goal of association rule mining?

A

Given a set of transactions, find all rules with support >= minsup threshold confidence >= minconf threshold

68
Q

What are the steps taken to mining association?

A

• Frequent Itemset Generation
–Generate all itemsets whose support 3 minsup

• Rule Generation
–Generate high confidence rules from frequent itemset –Each rule is a binary partitioning of a frequent itemset

69
Q

What are frequent itemtemsets?

A

Frequent Itemset
–An itemset whose support is greater than or equal to the minsup threshold

70
Q

What are the extremes in the choice process for minimum sup and confidence?

A
71
Q

How do we set an appropriate minimum sup?

A
72
Q

What is the lift of an association rule? How is it calculated? What the values of lift mean?

A

• Lift is the ratio of the observed joint probability of X andY to the expected joint probability if they were statistically independent,
lift(X ÞY) = sup(XUY)/sup(X)sup(Y) = conf(X ÞY)/sup(Y)

• Lift is a measure of the deviation from stochastic independence
(if it is 1 then X and Y are independent)

• Lift also measure the surprise of the rule.A lift close to 1means that the support of a rule is expected considering the supports of its components.

• We typically look for values of lift that are much larger (i.e., above expectation) or much smaller (i.e., below expectation) than one.

73
Q

How can we summarize the itemsets obtained?

A

• A frequent itemset X is called maximal if it has no frequent supersets
• The set of all maximal frequent itemsets,given as M = {X | X ∈ F and !∃Y ⊃ X,such thatY ∈ F }
• M is a condensed representation of the set of all frequent itemset F, because we can determine whether any itemset is frequent or not using M
• If there is a maximal itemset Z such that X ⊆ Z, then X must be frequent, otherwise Z cannot be frequent
• However, M alone cannot be used to determine sup(X), we can only use to have a lower-bound, that is, sup(X)>=sup(Z) if X ⊆ Z ∈M.

74
Q

What is the closed frequent itemset? What is the minimal generator itemset?

A

• An itemset X is closed if all supersets of X have strictly less support, that is, sup(X) > sup(Y), for all Y ⊃ X
• That is, C = {X | X ∈ F and !∃Y⊃X, such that sup(X)=sup(Y) }
• The set of all closed frequent itemsets C is a condensed representation, as we can determine whether an itemset X is frequent, as well as the exact support of X using C alone

• A frequent itemset X is a minimal generator if it has no subsets with the same support G={X|X∈ Fand!∃Y⊂X,suchthatsup(X)=sup(Y)}
• Thus, all subsets of X have strictly higher support, that is, sup(X) < sup(Y)

75
Q

What are the classifications of the itemsets in the example?

A

Refer to the DM1 image

76
Q

How can we model bipartite graphs as association rules?

A
77
Q

What is the difference between a set and a subsequence? What about a subsequence and a consecutive subsequence?

A
78
Q

Find all the subsequences of the example

A
79
Q

What is sequence pattern mining? What is the difference between it and association rule mining?

A
80
Q

When is a sequence contained in another one? What is the difference between size and length?

A
81
Q

Define the sequence for each customer and calculate the supper of the example.

A
82
Q

What is clustering? What is the principle of its function?

A

Clustering algorithms group a collection of data points into “clusters” according to some distance measure

Data points in the same cluster should have a small distance from one another

Data points in different clusters should be at a large distance from one another

• A cluster is a collection of data objects –Similar to one another within the same cluster
–Dissimilar to the objects in other clusters

• Cluster analysis
–Given a set data points try to understand their structure
–Finds similarities between data according to the characteristics found in the data
–Groups similar data objects into clusters
–It is unsupervised learning since there is no predefined classes

83
Q

What is a good clustering?

A

• A good clustering consists of high-quality clusters with
–High intra-class similarity
–Low inter-class similarity

• The quality of a clustering result depends on both
–The similarity measure used by the method and
–Its implementation (the algorithms used to find the clusters)
–Its ability to discover some or all the hidden patterns

• Evaluation
–Various measure of intra/inter cluster similarity
–Manual inspection
–Benchmarking on existing labels

84
Q

What is the distance/similarity metrics? How do we calculate it?

A

• Dissimilarity/Similarity metric
– Similarity expressed in terms of distance function, typically a metric, d(i, j)
– Definitions of distance functions are usually ver y different for inter val-scaled, Boolean, categorical, ordinal ratio, and vector variables
–Weights can/should be associated with different variables based on applications and data semantics

• Cluster quality measure
– Separate from distance, there is a “quality” function that measures the “goodness” of a cluster
– It is hard to define “similar enough” or “good enough”
–The answer is typically highly subjective

85
Q

What are the standard distance functions?

A

Euclidian distance is the typical function used to compute the similarity between two examples

Another popular metric is city-block (Manhattan) metric, distance is the sum of absolute differences

The Jaccard distance is a measure of dissimilarity between two sets. It is based on the Jaccard index, which measures the similarity between two sets. The Jaccard index is calculated as one minus the size of the intersection of the sets divided by the size of the union of the sets.

The Hamming distance is a measure of the difference between two vectors. It counts the number of positions at which the corresponding symbols differ and divide by the size of the bigger one.

The distance between a string x=x1x2…xn and y=y1y2…ym is the smallest number of insertions and deletions of single characters that will transform x into y

Cosine distance (or cosine dissimilarity) is a measure of how different two non-zero vectors are in terms of their orientation, rather than their magnitude. It is derived from the cosine similarity, which measures the cosine of the angle between two vectors in a multi-dimensional space.

86
Q

Detail the explanation of the cosine similarity and its importance

A

Why Cosine Similarity is Important

Cosine similarity is widely used in fields like text mining, natural language processing (NLP), and recommendation systems because it focuses on the orientation (direction) of vectors rather than their magnitude. This makes it particularly suited for comparing high-dimensional, sparse datasets like text or word embeddings.

Key Advantages:
1. Magnitude Independence: It ignores the magnitude of the vectors, which is useful for text data where documents may have different lengths but similar content.
• Example: “Hello world” vs. “Hello world, hello again” – their directions may still align despite different word frequencies.
2. High-Dimensional Data: Handles sparse and high-dimensional vectors efficiently, especially in text analysis.
3. Fast Computation: Relatively simple to compute, making it ideal for large-scale systems like search engines and recommendation algorithms.

Applications:

•	Text Similarity: Comparing document vectors in NLP (e.g., TF-IDF or word embeddings).
•	Recommendation Systems: Matching users with similar preferences or products with similar attributes.
•	Clustering: Grouping similar documents or feature vectors based on their direction.

How to Calculate Cosine Similarity

The cosine similarity between two vectors \mathbf{A} and \mathbf{B} is given by the formula:

\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|}

Where:
• \mathbf{A} \cdot \mathbf{B} : Dot product of vectors \mathbf{A} and \mathbf{B} .
• |\mathbf{A}| : Magnitude (norm) of vector \mathbf{A} , calculated as \sqrt{\sum_{i=1}^n A_i^2} .
• |\mathbf{B}| : Magnitude (norm) of vector \mathbf{B} , calculated as \sqrt{\sum_{i=1}^n B_i^2} .

Steps to Calculate:

1.	Compute the Dot Product ( \mathbf{A} \cdot \mathbf{B} ):

\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^n A_i \cdot B_i

This sums up the element-wise product of the two vectors.
2. Compute the Magnitudes ( |\mathbf{A}| and |\mathbf{B}| ):

|\mathbf{A}| = \sqrt{\sum_{i=1}^n A_i^2}, \quad |\mathbf{B}| = \sqrt{\sum_{i=1}^n B_i^2}

3.	Divide the Dot Product by the Product of Magnitudes: [ \cos(\theta
87
Q

Why is it important to normalize our attributes when doing clustering?

A

Normalization of attributes is crucial in clustering because it ensures that all features contribute equally to the distance measures used in clustering algorithms, such as K-means or hierarchical clustering. Here are the key reasons why normalization is important:

1.	Equal Weighting of Features: Different attributes can have different scales (e.g., one attribute might range from 1 to 1000, while another ranges from 0 to 1). If features are not normalized, those with larger numerical ranges will dominate the distance calculations, making clustering more biased towards these attributes.
2.	Improved Distance Calculations: Clustering algorithms rely on distance metrics (such as Euclidean distance) to group similar data points together. Without normalization, features with larger scales can disproportionately influence the distance, distorting the true relationships between data points.
3.	Better Interpretation of Clusters: Normalizing ensures that each attribute contributes similarly to the clustering results, leading to clusters that better represent the underlying structure of the data. This improves the interpretability and meaningfulness of the resulting clusters.
4.	Handling Features with Different Units: When attributes have different units (e.g., temperature in Celsius and weight in kilograms), normalization allows these features to be comparable by bringing them to a common scale.
5.	Prevents Bias in Algorithms: Algorithms like K-means can be especially sensitive to the range of input features. Normalization helps prevent bias towards any specific attribute, leading to more balanced and accurate clustering.

Common normalization techniques include min-max scaling (rescaling features to a [0,1] range) or z-score normalization (standardizing data to have a mean of 0 and a standard deviation of 1).

88
Q

What are the requisites for clustering algorithms?

A

• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

89
Q

What is the curse of dimensionality?

A

As the number of dimensions in a dataset increases, distance measures become increasingly meaningless.

Increasing the number of dimensions spread out the points until, in very high dimensions, they are almost equidistant from each other.

In high dimensions, almost all pairs of point are equally far away from one another

90
Q

What is the problem of clustering high dimensional data? How does it relates to the curse of dimensionality? And how can we deal with high dimensional data?

A

Clustering high-dimensional data presents several challenges, which are closely tied to the curse of dimensionality. The curse of dimensionality refers to the various issues that arise when working with data in high-dimensional spaces, where the number of dimensions (features or attributes) is large. Here’s how it affects clustering:

  1. Distance Measures Become Less Meaningful• In high-dimensional spaces, most data points become equidistant from one another. This is because, as the number of dimensions increases, the volume of the space grows exponentially, and points are spread out more uniformly across this space.
    • Impact on Clustering: Since clustering algorithms (e.g., K-means) rely on distance measures (like Euclidean distance) to group similar points together, the distinction between “near” and “far” points diminishes in high dimensions, making it difficult to form meaningful clusters.
  2. Increased Sparsity of Data• High-dimensional spaces are often sparse, meaning that the density of data points decreases as dimensionality increases. This sparsity can make it hard to detect patterns or clusters because there may be very few neighboring points in any given region of the space.
    • Impact on Clustering: In clustering, algorithms group nearby points into clusters based on density or proximity. When the data is sparse, it becomes harder to find well-defined clusters, leading to noise and poor clustering performance.
  3. Overfitting and Noise Sensitivity• With a large number of dimensions, irrelevant or noisy features are often introduced. As the dimensionality grows, the likelihood of some dimensions being irrelevant (i.e., not contributing to the clustering task) also increases.
    • Impact on Clustering: Clustering algorithms can easily overfit to these noisy or irrelevant dimensions, creating misleading clusters. The clusters may reflect the noise rather than the true structure of the data.
  4. Increased Computational Complexity• The computational cost of clustering algorithms grows with the number of dimensions, as calculating distances or running iterative algorithms (e.g., K-means) becomes more expensive in high-dimensional spaces.
    • Impact on Clustering: High-dimensional data can make clustering algorithms slow or even infeasible to run efficiently, particularly for large datasets.

Addressing the Curse of Dimensionality in Clustering:

To handle the curse of dimensionality in clustering, several techniques can be applied:

•	Dimensionality reduction: Techniques like Principal Component Analysis (PCA) or t-SNE can be used to reduce the number of dimensions while retaining most of the variance or structure in the data.
•	Feature selection: Identify and use only the most relevant features for clustering, removing irrelevant or redundant dimensions.
•	Distance-based alternatives: Some clustering algorithms use distance measures that are more robust in high dimensions, such as cosine similarity or density-based clustering methods (e.g., DBSCAN).

In summary, the curse of dimensionality makes clustering high-dimensional data challenging due to the loss of meaningful distances, data sparsity, noise sensitivity, and increased computational cost. Reducing the dimensionality or using alternative clustering approaches helps mitigate these issues.

91
Q

How can we use brute force for association rule generation? Suggest a pseudo code to do it.

A

Brute-force approach
–List all possible association rules
–Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds

92
Q

How can we use the leve wise approach to calculate association rules? Give a pseudo code for doing it

A

If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle holds due to the following property of the support measure:

• Support of an itemset never exceeds the support of its subsets

93
Q

How can we use the the Eclat algorithm (tideset approach) to calculate association rules? Give a pseudo code for doing it

A

• Leverages the tidsets directly for support computation.

• The itemset support is computed by intersecting the tidsets of suitably chosen subsets.

• Given t(X) and t(Y) for any two frequent itemsets X and Y, then t(XY)=t(X) ∧ t(Y)

• And sup(XY) = |t(XY)|

Eclat can be further improved by using diffsets (difference of tidset).

94
Q

How can we use the the FPGrowth algorithm (frequent pattern tree approach) to calculate association rules? Give a pseudo code for doing it

A

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
– Highly condensed, but complete for frequent pattern mining
–Avoid costly database scans

• Use an efficient, FP-tree-based frequent pattern mining method

• A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation: sub-database test only

• Major Steps to mine FP-tree
–Construct the frequent pattern tree
–For each frequent item i compute the projected FP-tree
–Recursively mine conditional FP-trees and grow frequent patterns obtained so far
– If the conditional FP-tree contains a single path, simply enumerate all the patterns

• We start from the same transaction databased used in the previous examples and sort all the transactions based on the frequencies of their items

95
Q

What are the benefits of FPGrowth?

A

• Preserve complete information for frequent pattern mining
• Reduce irrelevant info—infrequent items are gone
• The more frequently occurring items are more likely to be shared
• Never be larger than the original database (not count node-links and the count field)
• No candidate generation, no candidate test, no repeated scan of entire database

96
Q

What do we mean by rule generation? How can we efficiently generate associations rules? Give an example.

A

• Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f => L \ f satisfies the minimum confidence requirement

If |L| = k, then there are 2k – 2 candidate association rules (ignoring L → ⦰ and ⦰ → L)

In order to generate rules more efficiently we gotta have in mind that:

• Confidence does not have an anti-monotone property

• c(ABCÞD) can be larger or smaller than c(AB Þ D)

• However, confidence of rules generated from the same itemset has an anti-monotone property

• L = {A,B,C,D}:c(ABC Þ D) >= c(AB Þ CD) >= c(A Þ BCD)

• Confidence is anti-monotone in the number of items on the right-hand side of the rule

Refer to the DIQ1 figure for an example

97
Q

Give some different clustering methods

A

• Hierarchical vs point assignment

• Numeric and/or symbolic data

• Deterministic vs. probabilistic

• Exclusive vs. overlapping

• Hierarchical vs. flat

• Top-down vs. bottom-up