Chapter 1 Continued Flashcards
What is the purpose of data transformation?
Data transformation aims to map the values of an attribute to new replacement values.
What are the common techniques used in data transformation?
Common techniques for data transformation include data normalization, standardization, and conversion.
Why might you want to combine attributes during data transformation?
Combining attributes can create more useful ratios or relationships between them.
How does scaling data benefit data mining algorithms?
Scaling attributes to the same approximate scale improves the performance of many data mining algorithms and results in better models.
What is the purpose of data conversion in data transformation?
Data conversion can involve converting categorical data to numeric values and discretizing continuous data, making it more intuitive and improving algorithm performance.
Why is feature scaling important in data mining?
Feature scaling is essential in data mining because variables with widely varying ranges can lead to biases in the results, favoring attributes with larger ranges.
What is the primary goal of feature scaling in data mining?
The main objective of feature scaling is to ensure that all variables or features are within the same scale, preventing attributes with large ranges from dominating those with smaller ranges.
What is feature scaling?
Feature scaling is essential in data mining because variables with widely varying ranges can lead to biases in the results, favoring attributes with larger ranges.
What is the primary goal of feature scaling in data mining?
The main objective of feature scaling is to ensure that all variables or features are within the same scale, preventing attributes with large ranges from dominating those with smaller ranges.
What are the two common methods of feature scaling?
Normalization and standardization are two common methods of feature scaling.
What is normalization?
Normalization scales the values of a feature to a range between 0 and 1.
What is standardization?
Standardization scales the values to have a mean of 0 and a standard deviation of 1.
When is normalization useful?
Normalization is useful when the distribution of the feature is not Gaussian.
When is standardization useful?
Standardization is useful when the distribution of the feature is Gaussian.
Why are normalization and standardization used?
Both techniques are used to improve the performance of machine learning algorithms by ensuring that all features have equal importance.
How is Min-Max Normalization calculated for a value like $73,000 in the income range of $12,000 to $98,000?
To normalize $73,000 using Min-Max Normalization, you calculate it as (73,000 - 12,000) / (98,000 - 12,000), which results in 0.716.
What is the formula for Z-Score Standardization?
The formula for Z-Score Standardization is (x - mean) / standard deviation (sd), where x represents the data point.
What is data conversion?
Changing data from one format to another
What are some DM techniques that can handle categorical variables without transforming them?
Naïve Bayes and decision tree
Other techniques (such as neural nets and regression) require only numeric inputs.
What is data conversion encoding?
Ordinal to Numeric
How is a single categorical variable with m categories typically transformed?
m-1 dummy variables
Why is data conversion important?
To make data usable across different systems or applications
Why do we need to convert nominal fields into numeric values for techniques like neural nets and regression?
These techniques require only numeric inputs
How can ordinal data be converted to numbers?
Preserving natural order
What values do the dummy variables take?
0 or 1
What is the purpose of using dummy variables when converting a categorical predictor with k ≥ 3 possible values?
To create k - 1 dummy variables and use the unassigned category as the reference category.
What is an example of converting ordinal data to numbers?
Grade: A → 4.0, A- → 3.7, B+ → 3.3, B- → 3.0
What does a value of 1 represent in a dummy variable?
yes” category”
What is discretization?
Transforming continuous attributes into discrete ones
In the data mining field, many learning methods –like association rules can handle only discrete attributes.
What is the problem when creating dummy variables for nominal to numeric data conversion with many values?
Too many variables
How many dummy variables can be created for a nominal variable with few values?
Two dummy variables
Why is discretization necessary in data mining?
To handle learning methods that can only handle discrete attributes
What is the solution to reduce the number of variables when converting nominal to numeric data with many values?
Combine similar values and create dummy variables for the new values
Can you give an example of discretization?
Transforming the age attribute into child and adult categories
Color=Red, Yellow or Green.
What does it mean if C_red = 0 and C_yellow = 0?
The color is green
What is the purpose of C_red dummy variable?
To represent if the color is red or not
What is the purpose of C_yellow dummy variable?
To represent if the color is yellow or not
How many common methods are there for binning numerical predictors?
Four
What is the purpose of binning numerical variables?
To discretize data into categories
What are the 3 categories created using equal width binning?
Low, Medium, High
What is the purpose of k-means clustering?
To identify partition
Why would we need to partition numerical predictors into bins?
To use them as categorical predictors for certain algorithms
What is equal width binning?
Partitioning data into equal width categories.
Dataset: {1, 1, 1, 1, 1, 2, 2, 11, 11, 12, 12, 44}
Using equal frequency binning, we have n = 12, k = 3, and n/k = 4. The partition is
Low: Contains the first four data values, all X = 1.
Medium: Contains the next four data values, X = {1,2, 2, 11}.
High: Contains the last four data values, X = {11, 12,12, 44}.
Note that one of the medium data values equals a
data value in the low category, and another
equals a data value in the high category. This
violates what should be a self-evident heuristic:
Equal data values should belong to the same
category.
Suppose we have the following tiny data set,
which we would like to discretize into 3
categories:
X = {1, 1, 1, 1, 1, 2, 2, 11, 11, 12, 12, 44}.
Using equal width binning, we partition X into what of the following categories of equal width:
Low: 0 ≤ X < 15, which contains all the data values except
one.
Medium: 15 ≤ X < 30, which contains no data values at all.
High: 30 ≤ X < 45, which contains a single outlier.
What is the last method of Binning Numerical values?
The partition identified by k-means clustering
Data Modeling
Data modeling refers to a group of processes in which datasets are analyzed to uncover relationships or patterns among the variables.
Goal of Data Modeling
The goal of data modeling is to use past data to generate predictions on new data and make inferences about these relationships.
It helps in understanding how dependent variables change concerning independent variables.
:
Data modeling involves two cultures of reaching conclusions from data. What are they?
One culture assumes data is generated by a stochastic model, while the other employs algorithmic models and treats the data mechanism as unknown.
Stochastic Model
A stochastic model is a tool for estimating probability distributions of potential outcomes by allowing for random variation in one or more inputs over time. It is a common approach in data modeling.
Algorithmic Modeling
Algorithmic modeling, a rapidly developing field, is used in data modeling on both large and small data sets. It is an alternative approach that treats data mechanisms as unknown and can provide more accurate results.
Elaborate on:
Data Modeling Techniques
Data modeling encompasses various techniques originating from the two cultures of modeling. No single learning algorithm is universally superior. It is common practice to experiment with multiple models to find the one best suited to a specific problem.
What is important to consider when selecting a modeling technique?
types of variables involved and relevant business considerations. Model selection is driven by the problem’s specific requirements and characteristics.
Define:
Occam’s Razor Principle
When presented with two models that predict equally well, it is often wise to choose the simpler of the two. This principle, known as Occam’s Razor, suggests that simpler models are preferred unless there is a clear advantage to using a more complex one.
What are the key differences between traditional programming and machine learning?
Traditional Programming:
Input: Data and Program
Process: Computer processes data using the program.
Output: Result or output based on the program’s logic.
Machine Learning:
Input: Data and Output (desired result)
Process: Computer learns from the data to generate a program (model).
Output: The model can predict or generate new outputs based on input data.
What is the definition of machine learning according to Tom Mitchell?
A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
What are the three components of a machine learning task: T, P, and E?
E: Experience
P: Performance measure
T: class of tasks
What is the name of the system that improves performance from experience according to Herbert Simon?
A computer system learns from data, which
represent some “past experiences” of an
application domain.
Machine Learning can simple be identitified as
algorithms that improve their
performance at some task with experience.
Describe:
The machine learning process
Flow from training data to algorithm, then to learning, resulting in a trained model, and finally producing results.
What is the data used for the machine learning task of predicting loan approval?
Loan application data
What is the performance measure used for the machine learning task of predicting loan approval?
Accuracy
How is the model used to classify future loan applications?
By assigning them to either Yes (approved) or No (not approved)
What is the machine learning task of predicting credit card approval?
A classification problem
What are the features used for the machine learning task of predicting credit card approval?
Information about an applicant, such as age, marital status, annual salary, outstanding debts, credit rating, etc.
What are the labels used for the machine learning task of predicting credit card approval?
Two categories, approved and not approved
How is the model trained for the machine learning task of predicting credit card approval?
By using a supervised learning algorithm that learns from a set of labeled examples
List:
Different types of machine learning
- Supervised Machine Learning
- Unsupervised Machine Learning
- Semi-Supervised Learning
- Reinforcement Learning
What is the difference between supervised and unsupervised learning in terms of data labels?
Supervised learning uses data with corresponding labels or desired outputs, while unsupervised learning uses data with no labels or outputs.
What is the goal of semi-supervised learning?
To combine labeled and unlabeled data for training and improve the performance of the model.
What is the main idea of reinforcement learning?
To learn from interactions with an environment and receive rewards or punishments for the actions taken.
What is the key characteristic of supervised learning?
A: In supervised learning, we know the correct labels during training which helps the algorithm learn from historical examples.
How can we categorize algorithms within supervised learning?
We can divide them into classification and regression based on the nature of the output they produce.
What do supervised machine learning techniques automatically learn based on historical examples or instances?
A model of the relationship between a set of descriptive features and a target feature
What is a regression problem?
Predicting a real valued number
What is an example of a regression problem?
Predicting final exam scores based on homework scores
What is supervised machine learning?
Learning with labeled data
What is the goal of supervised machine learning?
Predicting an output variable associated with each input item
What are some examples of supervised machine learning tasks?
Email classification, image classification, regression for predicting real-valued outputs
How does supervised machine learning work?
Using input features to predict a target variable
What is labeled data?
Data that has known and assigned outcomes or labels
What is unsupervised machine learning?
Discovering patterns in unlabeled data
What is an example of unsupervised learning?
what do I mean by structure?
Clustering similar data points
What is a typical example of unsupervised learning?
Clustering
What is an example of an application for unsupervised machine learning?
Categorizing or grouping customers into different types
In the context of E-commerce:
What are some potential customer groups that can be discovered through unsupervised learning?
Power users, quick browser-type users, careful researcher type users
How can unsupervised learning be used to discover different customer groups?
By analyzing data of how people interact with the site
Why is tailing site offerings to different groups useful?
Improved user experience and increased likelihood of purchase
What is a classic example of unsupervised learning?
Problem with no labeled examples
What is an example of an unsupervised learning problem related to web security?
Flagging abnormal access to a web server
Why is it difficult to use supervised learning for the web security
For security reasons, you might want to be notified if a website user is making requests that could be a cyber attack or is somehow very different from typical user behavior on your site.
Lack of reliable training labels
Since there can be many different types of hacking or intrusion attempts to break into the server or exploit in some way
What approach do we need for outlier detection?
Unsupervised approach
What does outlier detection not assume?
Future attacks will be of the same form as previous attacks
What does outlier detection assume?
Features of attacks will look different from average user’s behavior
What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data to discover patterns that relate data attributes with a target attribute. Unsupervised learning uses unlabeled data to explore the data and find some intrinsic structures in them.
What are the two types of supervised learning problems?
The two types of supervised learning problems are regression and classification. Regression predicts numerical values, while classification predicts categorical values or labels.
What are some examples of unsupervised learning techniques?
Some examples of unsupervised learning techniques are clustering, association, link prediction, and data reduction.
Clustering groups data according to “distance”, association finds frequent co-occurrences, link prediction discovers relationships in data, and data reduction projects features to fewer features.
What is the target attribute in supervised learning?
The target attribute in supervised learning is the output variable that we want to predict based on the input data. It can be either numerical or categorical.
Fraud Detection is an application in which of the
following
(A) Unsupervised Learning: Regression
(B) Supervised Learning: Classification
(C) Unsupervised Learning: Clustering
(D) Reinforcement Learnin
B
Customer Segmentation is an application in which of the
following
(A) Supervised Learning: Classification
(B) Unsupervised Learning: Clustering
(C) Unsupervised Learning: Regression
(D) Reinforcement Learning
B
- Customer Retention is an application in which of the
following
(A) Unsupervised Learning: Regression
(B) Supervised Learning: Classification
(C) Unsupervised Learning: Clustering
(D) Reinforcement Learning
B
Image classification is a popular problem in the
computer vision field. Here, the goal is to predict what
class an image belongs to. In this set of problems, we are
interested in finding the class label of an image.
A. Supervised Learning
B. Unsupervised Learning
A