Chapter 1 Continued Flashcards

Question

What values do the dummy variables take?

Answer 1

To create k - 1 dummy variables and use the unassigned category as the reference category.

Answer 2

Grade: A → 4.0, A- → 3.7, B+ → 3.3, B- → 3.0

Answer 3

yes" category"

Answer 4

Transforming continuous attributes into discrete ones ## Footnote In the data mining field, many learning methods –like association rules can handle only discrete attributes.

Answer 5

Too many variables

Answer 6

Two dummy variables

Answer 7

To handle learning methods that can only handle discrete attributes

Answer 8

Combine similar values and create dummy variables for the new values

Answer 9

Transforming the age attribute into child and adult categories

Answer 10

The color is green

Answer 11

To represent if the color is red or not

Answer 12

To represent if the color is yellow or not

Answer 13

To discretize data into categories

Answer 14

Low, Medium, High

Answer 15

To identify partition

Answer 16

To use them as categorical predictors for certain algorithms

Answer 17

Partitioning data into equal width categories.

Answer 18

Low: Contains the first four data values, all X = 1. Medium: Contains the next four data values, X = {1,2, 2, 11}. High: Contains the last four data values, X = {11, 12,12, 44}. ## Footnote Note that one of the medium data values equals a data value in the low category, and another equals a data value in the high category. This violates what should be a self-evident heuristic: Equal data values should belong to the same category.

Answer 19

Low: 0 ≤ X < 15, which contains all the data values except one. Medium: 15 ≤ X < 30, which contains no data values at all. High: 30 ≤ X < 45, which contains a single outlier.

Answer 20

The partition identified by k-means clustering

Answer 21

Data modeling refers to a group of processes in which datasets are analyzed to uncover relationships or patterns among the variables.

Answer 22

The goal of data modeling is to use past data to generate predictions on new data and make inferences about these relationships. ## Footnote It helps in understanding how dependent variables change concerning independent variables.

Answer 23

One culture assumes data is generated by a stochastic model, while the other employs algorithmic models and treats the data mechanism as unknown.

Answer 24

A stochastic model is a tool for estimating probability distributions of potential outcomes by allowing for random variation in one or more inputs over time. It is a common approach in data modeling.

Answer 25

Algorithmic modeling, a rapidly developing field, is used in data modeling on both large and small data sets. It is an alternative approach that treats data mechanisms as unknown and can provide more accurate results.

Answer 26

Data modeling encompasses various techniques originating from the two cultures of modeling. No single learning algorithm is universally superior. It is common practice to experiment with multiple models to find the one best suited to a specific problem.

Answer 27

types of variables involved and relevant business considerations. Model selection is driven by the problem's specific requirements and characteristics.

Answer 28

When presented with two models that predict equally well, it is often wise to choose the simpler of the two. This principle, known as Occam's Razor, suggests that simpler models are preferred unless there is a clear advantage to using a more complex one.

Answer 29

Traditional Programming: Input: Data and Program Process: Computer processes data using the program. Output: Result or output based on the program’s logic. Machine Learning: Input: Data and Output (desired result) Process: Computer learns from the data to generate a program (model). Output: The model can predict or generate new outputs based on input data.

Answer 30

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Answer 31

E: Experience P: Performance measure T: class of tasks

Answer 32

A computer system learns from data, which represent some “past experiences” of an application domain.

Answer 33

algorithms that improve their performance at some task with experience.

Answer 34

Flow from training data to algorithm, then to learning, resulting in a trained model, and finally producing results.

Answer 35

Loan application data

Answer 36

By assigning them to either Yes (approved) or No (not approved)

Answer 37

A classification problem

Answer 38

Information about an applicant, such as age, marital status, annual salary, outstanding debts, credit rating, etc.

Answer 39

Two categories, approved and not approved

Answer 40

By using a supervised learning algorithm that learns from a set of labeled examples

Answer 41

1. Supervised Machine Learning 1. Unsupervised Machine Learning 1. Semi-Supervised Learning 1. Reinforcement Learning

Answer 42

Supervised learning uses data with corresponding labels or desired outputs, while unsupervised learning uses data with no labels or outputs.

Answer 43

To combine labeled and unlabeled data for training and improve the performance of the model.

Answer 44

To learn from interactions with an environment and receive rewards or punishments for the actions taken.

Answer 45

A: In supervised learning, we know the correct labels during training which helps the algorithm learn from historical examples.

Answer 46

We can divide them into classification and regression based on the nature of the output they produce.

Answer 47

A model of the relationship between a set of descriptive features and a target feature

Answer 48

Predicting a real valued number

Answer 49

Predicting final exam scores based on homework scores

Answer 50

Learning with labeled data

Answer 51

Predicting an output variable associated with each input item

Answer 52

Email classification, image classification, regression for predicting real-valued outputs

Answer 53

Using input features to predict a target variable

Answer 54

Data that has known and assigned outcomes or labels

Answer 55

Discovering patterns in unlabeled data

Answer 56

Clustering similar data points

Answer 57

Clustering

Answer 58

Categorizing or grouping customers into different types

Answer 59

Power users, quick browser-type users, careful researcher type users

Answer 60

By analyzing data of how people interact with the site

Answer 61

Improved user experience and increased likelihood of purchase

Answer 62

Problem with no labeled examples

Answer 63

Flagging abnormal access to a web server

Answer 64

Lack of reliable training labels ## Footnote Since there can be many different types of hacking or intrusion attempts to break into the server or exploit in some way

Answer 65

Unsupervised approach

Answer 66

Future attacks will be of the same form as previous attacks

Answer 67

Features of attacks will look different from average user's behavior

Answer 68

Supervised learning uses labeled data to discover patterns that relate data attributes with a target attribute. Unsupervised learning uses unlabeled data to explore the data and find some intrinsic structures in them.

Answer 69

The two types of supervised learning problems are regression and classification. Regression predicts numerical values, while classification predicts categorical values or labels.

Answer 70

Some examples of unsupervised learning techniques are clustering, association, link prediction, and data reduction. ## Footnote Clustering groups data according to “distance”, association finds frequent co-occurrences, link prediction discovers relationships in data, and data reduction projects features to fewer features.

Answer 71

The target attribute in supervised learning is the output variable that we want to predict based on the input data. It can be either numerical or categorical.

Chapter 1 Continued Flashcards

(102 cards)