Theory Flashcards
Why do we need data mining?
Data mining developed primarily in response to the increasing amounts of data being generated and the need to extract meaningful insights from this vast information. Several factors contributed to its development:
- Explosion of Data: With the advent of digital technologies, businesses, governments, and organizations began generating huge amounts of data from various sources such as transactions, social media, sensors, and web activity. Traditional methods were insufficient to handle and analyze these massive datasets.
- Advances in Computing Power: The rise of more powerful computers, storage systems, and distributed computing made it possible to process large volumes of data quickly and efficiently, enabling the complex calculations required for data mining.
- Need for Competitive Advantage: Businesses sought new ways to gain insights into customer behavior, market trends, and operational efficiencies to remain competitive. Data mining helped in identifying patterns, predicting trends, and improving decision-making processes.
- Machine Learning and AI Growth: With the development of machine learning algorithms, data mining evolved into a more sophisticated discipline. These algorithms helped in automating the discovery of patterns and trends in data, leading to actionable insights.
- Interdisciplinary Research: Data mining drew from various fields like statistics, computer science, artificial intelligence, and database management, which allowed for the creation of more powerful tools and methodologies.
- Applications Across Domains: Industries such as finance, healthcare, marketing, and manufacturing realized the potential of data mining in predicting outcomes, improving processes, and uncovering hidden relationships within data, further driving its development.
In essence, data mining arose as a solution to manage and make sense of the data deluge and to provide insights that could be turned into a competitive edge or valuable knowledge.
What are some examples of data mining usage?
Customer churn
– Given customer information for the past months, understand who customers I did lose, or predict what customers I might loose
Credit assessment
– Given a loan application, predict whether the bank should approve the loan predict what customers I might lose in the next months
Customer segmentation
– Given several information about the customers, identify interesting groups among them
Community detection
–Who is discussing what?
What led the Big Data phenomenon to occur? What are the consequences of Big Data?
The Big Data phenomenon occurred due to several key factors that have shaped the modern technological landscape. Here’s an overview of the causes and consequences:
-
Explosion of Data Generation:
- The rise of the internet, mobile devices, social media, and cloud computing has led to unprecedented data generation. Every digital interaction, from emails and social posts to e-commerce transactions, creates data.
- The growth of Internet of Things (IoT) devices, such as sensors, wearables, and smart appliances, added another layer of continuous data creation from physical objects.
-
Increased Data Storage Capacity:
- Advancements in storage technologies, such as cloud storage, distributed databases, and data centers, made it possible to store vast amounts of data cheaply and efficiently. This, combined with improved data compression techniques, enabled the preservation of enormous datasets.
-
Advances in Data Processing Technologies:
- Tools like Hadoop, Apache Spark, and NoSQL databases allowed for scalable processing of unstructured and semi-structured data across distributed systems. These technologies made it feasible to analyze vast datasets in a relatively short time, unlocking the potential of Big Data.
-
Proliferation of Social Media and Digital Platforms:
- Platforms like Facebook, Twitter, Instagram, and YouTube generate a massive amount of user-generated content. The interactions, likes, shares, and views on these platforms became key sources of data for analytics.
-
E-commerce and Digital Transactions:
- Online shopping platforms such as Amazon and Alibaba, along with digital payment systems, track and record every consumer transaction, producing detailed datasets related to purchasing behaviors and trends.
-
Advancements in Machine Learning and Artificial Intelligence:
- AI and machine learning require large amounts of data to train models effectively. As these fields advanced, the need for massive datasets to improve algorithms accelerated the Big Data movement.
-
Improved Decision-Making:
- Organizations can leverage Big Data analytics to make data-driven decisions, leading to more accurate forecasts, optimized operations, and personalized services. In business, this translates into enhanced customer insights, market analysis, and operational efficiencies.
-
Personalization and Predictive Analytics:
- Companies now have the ability to customize products and services to individual preferences. For example, streaming services like Netflix and Spotify use Big Data to personalize content recommendations, and e-commerce platforms use it to suggest products.
- Predictive analytics can anticipate future trends based on historical data, which helps industries like finance, healthcare, and marketing improve outcomes.
-
Privacy and Security Concerns:
- The vast amounts of personal data being collected have led to increased concerns about privacy violations and data breaches. Organizations are now expected to protect sensitive data, leading to stricter regulations like the GDPR (General Data Protection Regulation) in Europe and other privacy laws globally.
-
Ethical and Bias Issues:
- With the extensive use of data in decision-making, ethical concerns have surfaced, particularly around how data is collected, processed, and used. Algorithms trained on biased data can perpetuate existing inequalities, especially in areas like hiring, criminal justice, and lending.
-
Impact on Business Models:
- Big Data has driven the shift to data-centric business models, where data is a valuable asset. Companies like Google, Amazon, and Facebook capitalize on user data to generate revenue through targeted advertising, changing the nature of competition in the digital economy.
-
Job Displacement and Creation:
- While Big Data has created new roles in data science, analytics, and IT infrastructure, it has also led to automation in several industries. Tasks that were traditionally performed by humans are now being handled by AI and data-driven systems, leading to both job displacement and the creation of new tech-driven jobs.
-
Advances in Healthcare:
- In healthcare, Big Data has led to significant advancements such as precision medicine, where treatments are tailored to individual patients based on genetic, environmental, and lifestyle factors. It has also helped in epidemic prediction and management, as seen during the COVID-19 pandemic.
-
Efficiency Gains in Various Industries:
- Sectors like manufacturing, logistics, and supply chain management benefit from Big Data analytics by optimizing production processes, reducing waste, and improving delivery times.
The Big Data phenomenon arose from the exponential increase in data generated by digital and connected technologies, coupled with advancements in storage and processing capabilities. The consequences are vast, ranging from enhanced decision-making and personalized experiences to concerns around privacy, ethics, and job displacement, fundamentally reshaping industries and society.
When can we say a computer learn from experience?
A computer program is said to learn from experience E with respect to some class
of task T and a performance measure P,
if its performance at tasks in T, as measured by P, improves because of experience E.
What are the machine learning paradigms?
Suppose we have the experience E encoded as a dataset,
D = x1,x2,x3,…,xn
• Supervised Learning
–Given the desired outputs t1,t2,…,tn learns to produce the correct output given a new set of input
• Unsupervised learning
–Exploits regularities in D to build a representation to be used for reasoning or prediction
• Reinforcement learning
– Producing actions a1,a2,…,an which affect the environment, and receiving rewards r1,r2,…,rn learn to act in order to maximize rewards in the long term
What is data mining?
Data Mining is the non-trivial process of identifying (1) valid, (2) novel, (3) potentially useful, and
(4) understandable patterns in data.
How can we identify a good pattern in data?
Is it valid?
–The pattern has to be valid with respect to a certainty level (rule true for the 86%)
Is it novel?
–Is the relation between astigmatism and hard contact lenses already well-known?
Is it useful? Is it actionable?
–The pattern should provide information useful to the bank for assessing credit risk
Is it understandable?
What is the general idea of data mining? What are the obstacles to be overcame?
Build computer programs that navigate through databases automatically, seeking patterns
However,
–Most patterns will be uninteresting
– Most patterns are spurious, inexact, or contingent on accidental coincidences in the data
– Real data is imperfect, some par ts will be garbled, and some will be missing
What are the essencial caracteristics of the data mining algorithms?
Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful
What are the types of data mining models?
Descriptive vs. Predictive
Are the models built for gaining insight? (about what already happened)
Or are they built for accurate prediction? (about what might happen)
Prescriptive
Apply descriptive and predictive mining to recommend a course of action
What is the structure of a data mining process?
Selection
– What are data we need to answer the posed question?
Cleaning
– Are there any errors or inconsistencies in the data we need to eliminate?
Transformation
– Some variables might be eliminated because equivalent to others
– Some variables might be elaborated to create new variables
(e.g., birthday to age, daily measures into weekly/monthly measures, log?)
Mining
– Select the mining approach: classification, regression, association, etc. – Choose and apply the mining algorithm(s)
Validation
– Are the patterns we discovered sound? According to what criteria? – Are the criteria sound? Can we explain the result?
Presentation & Narrative
– What did we learn? Is there a story to tell? A take-home message?
What is the importance of data preparation and preprocessing?
Data preparation account for most of the time needed to design an effective data mining pipeline
It can take up to 80%-90% of the overall effort
No quality in data, no quality out! (trash in, trash out) Quality decisions need quality in data
Why is data dirty?
Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is analyzed.
– Human/hardware/software problems
Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry – Errors in data transmission
Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
What are the major tasks in data preprocessing?
Data cleaning
– Fill in missing values, smooth noisy data
– Identify or remove outliers
– Remove duplicates and resolve inconsistencies
Data integration
– Integration of multiple databases, data cubes, or files
Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
Feature engineering, data transformation, and data discretization
– Normalization
– Creation of new features (e.g., age from birthday)
What are the reasons for missing values?
How can we handle them?
• Reasons for missing values
– Information is not collected (e.g., people decline to give their age and weight)
–Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
• Handling missing values
–Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
–Replace with all possible values (weighted by their probabilities)
What is sampling and why is it important? When a sample is considered to be representative?
• Sampling is the main technique employed for data selection
• Often used for both the preliminary investigation of the data and the final data analysis
• We also sample because working on the entire data too expensive or time consuming
• A sample is representative if it has the same property (of interest) as the original set of data • If the sample is representative, it will work almost as well as using the entire data sets
What are the types of data sampling?
• Sampling without replacement
– As each item is selected, it is removed from the population
• Sampling with replacement (Bootstrap)
–Objects are not removed from the population as they are selected for the sample.
– In sampling with replacement, the same object can be picked up more than once
• Stratified sampling
–Split the data into several partitions
–Then draw random samples from each partition
What are some of the data mining tasks?
Regression, clustering, classification, association
• Outlier analysis
–An outlier is a data object that does not comply with the general behavior of the data
–It can be considered as noise or exception but is quite useful in rare events analysis
• Trend and evolution analysis
–Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
–Similarity-based analysis
• Text Mining,Topic Modeling, Graph Mining, Data Streams
• Sentiment Analysis, Opinion Mining, etc.
• Other pattern-directed or statistical analyses
Are all discovered patterns interesting? How can we determine if a pattern is interesting?
• Interestingness measures
– Data Mining may generate thousands of patterns, but typically not all of them are interesting.
– A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures
–Objective measures are based on statistics and structures of patterns
– Subjective measures are based on user’s belief in the data, e.g., unexpectedness, novelty, etc.
What are instances, attributes and concepts?
• Instances (observations, cases, records, items, examples)
–The atomic elements of information from a dataset
–Each row in previous table corresponds to an instance
• Attributes (variables, features, independent variables)
–Measures aspects of an instance
–Each instance is composed of a certain number of attributes
–Each column in previous table contains values of an attribute
• Concept (class, target variable, dependent variable)
–Special content inside the data
–Kind of things that can be learned
–Intelligible and operational concept description
–Last column of previous table was the class
What are the different types of attributes?
• Numeric Attributes
–Real-valued or integer-valued domain
–Interval-scaled when only differences are meaningful (e.g., temperature)
– Ratio-scaled when differences and ratios are meaningful (e.g., Age)
• Categorical Attributes
–Set-valued domain composed of a set of symbols
– Nominal when only equality is meaningful (e.g., domain(Sex) = { M, F})
–Ordinal when both equality (are two values the same?) and inequality (is one value less than another?) are meaningful (e.g., domain(Education) = { High School, BS, MS, PhD})
How can we classify attributes based on an alternative view?
Can missing values have a meaning? Give an example
• Missing value may have a meaning in itself
• For example, missing test in a medical examination or an empty field in a questionnaire
• They are usually indicated by out-of-range entries (e.g., max/min float, NaN, null)
• Does absence of value have some significance?
– If it does, “missing” is a separate value
– If it does not, “missing” must be treated in a special wa
What are the types of missing values?
Missing not at random (MNAR)
–Distribution of missing values depends on missing value
– E.g., respondents with high income less likely to repor t it
Missing at random (MAR)
–Distribution of missing values depends on observed attributes, but not missing value
– E.g., men less likely than women to respond to question about mental health
Missing completely at random (MCAR)
–Distribution of missing values does not depend on observed attributes or missing value
–E.g., survey questions randomly sampled from larger set of possible questions
Types of Missing Data
In data analysis, missing data is a common problem that can affect the quality of your results. Understanding the type of missing data is crucial because it determines the best strategy for handling it. Missing data is typically classified into three categories:
1. Missing Completely at Random (MCAR):
• Definition: Data is considered MCAR when the probability of a data point being missing is unrelated to the observed or unobserved data. Essentially, there’s no pattern to the missing data, and it happens purely by chance.
• Example: If a survey respondent accidentally skips a question due to a technical glitch or randomly misses a question with no relation to their responses, the missing data would be considered MCAR.
• Detection: MCAR can be tested using statistical tests like Little’s MCAR test. Additionally, examining patterns and correlations between missing values and other variables may indicate randomness.
• Solutions: If data is MCAR, it’s safe to use methods like:
• Listwise deletion: Removing all cases with missing values.
• Mean/median imputation: Replacing missing values with the mean or median of the non-missing data.
• Advanced methods: Using multiple imputation or machine learning models, though not strictly necessary for MCAR.
2. Missing at Random (MAR):
• Definition: Data is considered MAR if the probability of a data point being missing is related to the observed data but not to the missing data itself. In other words, the missingness is related to other known variables but not to the value of the missing data itself.
• Example: In a medical study, if older patients are more likely to skip a specific question but the missingness does not depend on their actual answer to that question, the data is MAR.
• Detection: Analyzing the relationship between missing data and other observed variables can help identify MAR. Techniques like logistic regression can be used to model the probability of missingness as a function of other variables.
• Solutions: When data is MAR, you can use methods like:
• Multiple imputation: Creating several imputed datasets by estimating the missing values based on other observed data, and then averaging the results.
• Maximum likelihood estimation: Estimating model parameters by accounting for the missing data structure.
• Regression imputation: Predicting missing values using observed variables.
3. Missing Not at Random (MNAR):
• Definition: Data is MNAR when the probability of a data point being missing is related to the value of the missing data itself, even after accounting for other variables. This means that the missingness is inherently tied to the data that is missing.
• Example: In a survey about income, high-income respondents might be more likely to skip questions about their earnings due to privacy concerns. Here, the missingness depends directly on the income value.
• Detection: MNAR is challenging to detect since it involves the values of the missing data itself. It often requires domain knowledge, additional data collection, or sensitivity analysis to explore if the data might be MNAR.
• Solutions: Addressing MNAR requires more sophisticated techniques, such as:
• Modeling the missingness: Using selection models or pattern-mixture models to explicitly model the missing data mechanism.
• Data augmentation: Collecting additional data or using follow-up studies to understand the nature of the missingness.
• Sensitivity analysis: Testing different assumptions about the missing data to see how they impact the results.
How to Identify the Type of Missing Data
1. Visualizations: • Use heatmaps, bar plots, or missingness matrices to visualize missing data patterns. • Correlation plots between missing indicators and other variables can suggest MCAR or MAR. 2. Statistical Tests: • Little’s MCAR Test: A hypothesis test where the null hypothesis is that data is MCAR. A significant result suggests the data is not MCAR. • Logistic Regression for Missingness: Modeling the missingness indicator (e.g., whether a value is missing) against other variables can help detect MAR. 3. Domain Knowledge: • Understanding the context of your data can help infer whether missingness is related to specific variables (MAR) or inherent to the missing value itself (MNAR).
Practical Solutions for Handling Missing Data
1. MCAR Solutions: • Listwise Deletion: If the data is MCAR, removing rows with missing values is unbiased but reduces the dataset size. • Mean/Median Imputation: Useful for simple datasets but may underestimate variability. • Random Sampling Imputation: Impute missing values using random sampling from observed values. 2. MAR Solutions: • Multiple Imputation: Iteratively predicts missing values using models (like regression) based on observed data, creating multiple versions of the dataset to average over. • Maximum Likelihood Estimation: Fits models using the observed data and accounts for missingness without imputing values. • Predictive Models: Use models like k-Nearest Neighbors (k-NN) or machine learning algorithms to predict missing values. 3. MNAR Solutions: • Sensitivity Analysis: Explore how different assumptions about missing data affect the results. • Pattern-Mixture Models: Model each pattern of missing data separately. • Data Collection: If possible, gather additional data to minimize the impact of MNAR.
Summary Table
Type Definition Detection Solutions
MCAR Missingness is random and unrelated to any data. Little’s MCAR test, pattern analysis Listwise deletion, mean imputation, multiple imputation
MAR Missingness is related to observed data but not to missing data itself. Logistic regression, correlation with observed data Multiple imputation, maximum likelihood, regression imputation
MNAR Missingness is related to the value of the missing data itself. Sensitivity analysis, domain knowledge Sensitivity analysis, pattern-mixture models, data augmentation
Understanding the type of missing data is crucial for applying the right strategy to ensure accurate and unbiased results in your analysis.
How to deal with missing values?
First use what you know about the data
–Why data is missing?
–Distribution of missing data
Decide on the best strategy to yield the least biased estimates
–Do Not Impute (DNI)
– Deletion Methods (list-wise deletion, pair-wise deletion)
– Single Imputation Methods (mean/mode substitution, dummy variable, single regression)
– Model-Based Methods (maximum Likelihood, multiple imputation)
Explain the Do Not Impute strategy
• Do Not Impute (DNI)
–Simply use the default policy of the data mining method
–Works only if the policy exists
–Some methods can work around missing data
Explain the Deletion strategy
• The handling of missing data depends on the type
• Discarding all the examples with a missing values
–Simplest approach
–Allows the use of unmodified data mining methods
–Only practical if there are few examples with missing values.
– Otherwise, it can introduce bias.
Explain the List-wise Deletion strategy
• Only analyze cases with available data on each variable
• Simple, but reduces the data
• Comparability across analyses
• Does not use all the information
• Estimates may be biased if data not MCAR
Explain the Pair-wise Deletion strategy
• Delete cases with missing values that affect only the variables of interest
• Example
– When using only the first two variables, the missing values of the third variable are not considered
• Advantage
–Keeps as many cases as possible for each analysis
–Uses all information possible with each analysis
• Disadvantage
–Comparison of results is more difficult because samples are different each time
Explain the Imputation strategy
• Convert the missing values into a new value
–Use a special value for it
–Add an attribute that indicates if value is missing or not
–Greatly increases the difficulty of the data mining process
• Imputation methods
– Assign a value to the missing one, based on the rest of the dataset
–Use the unmodified data mining methods
Explain the Single Imputation strategy
• Mean/mode substitution (most common value)
–Replace missing value with sample mean or mode
–Run analyses as if all complete cases
– Advantages: Can use complete case analysis methods
– Disadvantages: Reduces variability
• Dummy variable control
–Create an indicator for missing value (1=value is missing for observation; 0=value is observed for observation)
–Impute missing values to a constant (such as the mean)
–Include missing indicator in the algorithm
–Advantage: uses all available information about missing observation
– Disadvantage: results in biased estimates, not theoretically driven
Explain the Model-based Imputation strategy
• Extract a model from the dataset to perform the imputation – Suitable for MCAR and, to a lesser extent, for MAR –Not suitable for NMAR type of missing data
• For NMAR we need to go back to the source of the data to obtain more information
What are some of the reasons behind Inaccurate values?
• Data has not been collected for mining it
• Errors and omissions that don’t affect original purpose of data (e.g. age of customer)
• Typographical errors in nominal attributes, thus values need to be checked for consistency
• Typographical and measurement errors in numeric attributes, thus outliers need to be identified
• Errors may be deliberate (e.g. wrong zip codes)
Why do we care about data types?
They influence the type of statistical analyses and visualization we can perform
Some algorithms and functions fit some specific data types best like Check for valid values, Deal with missing values, etc.
How can we transform categorical data into numerical and vice versa!
Transforming Categorical Data into Numerical Data
Converting categorical data into numerical data is often necessary for machine learning algorithms, which typically require numerical input. Here are common methods for this transformation:
1. Label Encoding:
• Definition: Converts each category into a unique integer.
• Use Case: Useful for ordinal data where categories have a meaningful order (e.g., “Low”, “Medium”, “High”).
• Example:
from sklearn.preprocessing import LabelEncoder
data = [‘Low’, ‘Medium’, ‘High’, ‘Medium’]
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(data)
print(encoded_data) # Output: [1, 2, 0, 2]
• Pros: Simple and efficient. • Cons: Can introduce ordinal relationships where none exist, leading to incorrect model assumptions. 2. One-Hot Encoding: • Definition: Converts each category into a new binary column (0 or 1), with “1” indicating the presence of that category. • Use Case: Ideal for nominal data where categories have no intrinsic order (e.g., “Red”, “Green”, “Blue”). • Example:
import pandas as pd
data = [‘Red’, ‘Green’, ‘Blue’]
df = pd.DataFrame(data, columns=[‘Color’])
one_hot_encoded = pd.get_dummies(df, columns=[‘Color’])
print(one_hot_encoded)
Output:
Color_Blue Color_Green Color_Red
0 0 0 1
1 0 1 0
2 1 0 0
• Pros: No ordinal relationship introduced; ideal for non-ordinal data. • Cons: Can lead to the “curse of dimensionality” if there are many categories (many new columns). 3. Binary Encoding: • Definition: Converts categories into binary digits and uses fewer columns than one-hot encoding. • Use Case: Useful when dealing with high-cardinality features (features with many categories). • Example:
import category_encoders as ce
data = [‘Apple’, ‘Banana’, ‘Orange’]
encoder = ce.BinaryEncoder()
binary_encoded = encoder.fit_transform(data)
print(binary_encoded)
• Pros: Reduces dimensionality compared to one-hot encoding. • Cons: Still introduces additional columns. 4. Frequency Encoding: • Definition: Replaces each category with its frequency in the dataset. • Use Case: Useful for high-cardinality features where you want to retain some information about category distribution. • Example:
data = [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘A’]
freq_encoding = pd.Series(data).value_counts().to_dict()
encoded_data = [freq_encoding[val] for val in data]
print(encoded_data) # Output: [3, 2, 3, 1, 2, 3]
• Pros: Helps reduce dimensionality. • Cons: Can introduce bias if frequency distributions are uneven. 5. Target Encoding (Mean Encoding): • Definition: Replaces categories with the mean of the target variable for each category. • Use Case: Useful for categorical features in supervised learning where target leakage is not a concern. • Example:
df = pd.DataFrame({‘Category’: [‘A’, ‘B’, ‘A’, ‘C’], ‘Target’: [10, 20, 30, 40]})
means = df.groupby(‘Category’)[‘Target’].mean()
df[‘Category_encoded’] = df[‘Category’].map(means)
print(df)
Output:
Category Target Category_encoded
0 A 10 20.0
1 B 20 20.0
2 A 30 20.0
3 C 40 40.0
• Pros: Retains useful information about categories’ relationship with the target. • Cons: Risk of overfitting, especially with small datasets.
Transforming Numerical Data into Categorical Data
Sometimes, you may want to convert numerical data into categorical data, such as when creating bins for continuous data or segmenting data for analysis. Here are common techniques:
1. Binning (Discretization):
• Definition: Divides numerical data into bins (categories) based on value ranges.
• Use Case: Useful for simplifying numerical data by converting it into discrete intervals.
• Example:
import pandas as pd
data = [23, 45, 12, 67, 34, 89]
bins = [0, 30, 60, 90]
labels = [‘Low’, ‘Medium’, ‘High’]
categorized_data = pd.cut(data, bins=bins, labels=labels)
print(categorized_data) # Output: [‘Low’, ‘Medium’, ‘Low’, ‘Medium’, ‘Medium’, ‘High’]
• Pros: Simplifies analysis by reducing data complexity. • Cons: Can lead to loss of information. 2. Quantile Binning: • Definition: Bins data into categories based on quantiles (e.g., quartiles, deciles). • Use Case: Useful for dividing data into equally sized groups, often used for ranking. • Example:
data = [10, 20, 30, 40, 50]
quantile_bins = pd.qcut(data, q=3, labels=[‘Low’, ‘Medium’, ‘High’])
print(quantile_bins) # Output: [‘Low’, ‘Low’, ‘Medium’, ‘Medium’, ‘High’]
• Pros: Ensures bins have approximately the same number of observations. • Cons: Bin boundaries may not be intuitive. 3. Custom Binning: • Definition: Uses specific domain knowledge to create bins based on meaningful thresholds. • Use Case: Useful when there are well-known cutoff points in your data (e.g., age groups like “Child”, “Adult”, “Senior”). • Example:
data = [10, 15, 25, 35, 50]
custom_bins = [0, 18, 35, 60]
custom_labels = [‘Child’, ‘Adult’, ‘Senior’]
custom_categorized = pd.cut(data, bins=custom_bins, labels=custom_labels)
print(custom_categorized)
• Pros: Tailored to your specific data context. • Cons: Requires domain knowledge and can be subjective.
Summary Table
Transformation Technique Use Case Pros Cons
Categorical to Numerical Label Encoding Ordinal data Simple and efficient May introduce artificial order
One-Hot Encoding Nominal data Avoids ordinal assumptions High dimensionality
Binary Encoding High-cardinality features Reduces dimensionality More complex to interpret
Frequency Encoding High-cardinality features Retains frequency information Can introduce bias
Target Encoding Supervised learning Captures relationship with target Risk of overfitting
Numerical to Categorical Binning Simplify continuous data Reduces complexity Loss of information
Quantile Binning Equal-sized category groups Balanced bins Non-intuitive boundaries
Custom Binning Domain-specific thresholds Tailored categorization Requires domain knowledge
The choice of method depends on the type of data and the specific problem you are trying to solve.
What are the different types of encoders?
• LabelEncoder
–Encodes target labels with values between 0 and n_labels-1
• OneHotEncoder
–Performs a one-hot encoding of categorical features.
• OrdinalEncoder
–Performs an ordinal (integer) encoding of the categorical features
What are the pros and cons of one-hot encoding and label encoding?
Label Encoder
Map a categorical variable described by n values into a numerical variables with values from 0 to n-1
For example, attribute Outlook would be replaced by a numerical variables with values 0, 1, and 2
Warning
By replacing a label with a number might influence the process in unexpected ways
In the example, by assigning 0 to overcast and 2 to sunny we give a higher weight to the latter
What happens if we then apply a regression model? Would the result change with different assigned values?
If we apply label encoding, we should store the mapping used
for each attribute to be able to map the encoded data into the original ones
One Hot Encoding
Map each categorical attribute with n values into n binary 0/1 variables
Each one describing one specific attribute values
For example, attribute Outlook is replaced by three binary variables Sunny, Overcast, and Rainy
Warning
One hot encoding assigns the same numerical value (1) to all the labels
But it can generate a massive amount of variables when applied to categorical variables with many values
How can we verify the influence of a label encoding in a model?