5 - Preparing to Model the Data Flashcards by Kaman Hung

What is the first phase discussed in Chapter 3 of the Data Science Methodology?

Problem Understanding Phase

How well did you know this?

Not at all

Perfectly

What are the important tasks in the Setup Phase?

Partitioning the data
Validating the data partition
Balancing the data
Establishing baseline model performance

How well did you know this?

Not at all

Perfectly

What is the primary reason data science does not use the statistical inference paradigm?

Statistical significance can occur without practical significance in large sample sizes

How well did you know this?

Not at all

Perfectly

What is data dredging?

Uncovering spurious results due to random variation rather than real effects

How well did you know this?

Not at all

Perfectly

What technique helps avoid data dredging?

Cross-validation

How well did you know this?

Not at all

Perfectly

What are the two common methods of cross-validation?

Twofold cross-validation
K-fold cross-validation

How well did you know this?

Not at all

Perfectly

What does the training data set contain in a twofold cross-validation?

Records with complete data

How well did you know this?

Not at all

Perfectly

What is the purpose of the test data set in cross-validation?

To evaluate predictions against true target values

How well did you know this?

Not at all

Perfectly

What percentage of data is typically used for training in complex data sets?

75–90%

How well did you know this?

Not at all

Perfectly

What command is used to partition data in Python?

train_test_split()

How well did you know this?

Not at all

Perfectly

What function is used in R to set the random number generator seed?

set.seed()

How well did you know this?

Not at all

Perfectly

What statistical test is used for numerical variables to check for differences between training and test sets?

Two-sample t-test

How well did you know this?

Not at all

Perfectly

What is the purpose of balancing the training data set?

To provide a rich selection of records for each category

How well did you know this?

Not at all

Perfectly

What is resampling?

Sampling at random and with replacement from a data set

How well did you know this?

Not at all

Perfectly

What should the test data set never be?

Balanced

How well did you know this?

Not at all

Perfectly

What is the formula to calculate the number of resampled records needed?

x = (p * records) / rare

How well did you know this?

Not at all

Perfectly

In Python, what command is used to check the count of a specific response in the training data set?

value_counts()

What command is used in Python to concatenate two data sets?

pd.concat()

What does the output of bank_train_rebal[’response’].value_counts() show after rebalancing?

The new counts of ‘yes’ responses in the rebalanced training set

What is the purpose of validating your partition?

To ensure the training and test sets do not differ systematically

What is the desired proportion of fraudulent transactions in a balanced training set example?

25%

True or False: The test set should be balanced to improve model performance.

False

Fill in the blank: The command used in R to find the number of ‘yes’ responses is _______.

table()

What percentage of the training data set had ‘yes’ responses before resampling?

11%

This percentage indicates the initial distribution of ‘yes’ responses in the training dataset.

What percentage of 'yes' responses do we aim for after resampling?

30% ## Footnote This target percentage is set to increase the representation of rare records in the training data.

How many 'yes' records do we need to resample to achieve the target percentage?

850 ## Footnote This number is calculated based on the total records and desired proportion.

What command is used to identify the indices of records with 'yes' responses?

which() ## Footnote This command returns the row numbers corresponding to records meeting the specified condition.

What does the command sample(x = to.resample, size = 850, replace = TRUE) do?

It randomly samples 850 records with replacement from the specified indices. ## Footnote The replace = TRUE input allows for the same record to be selected multiple times.

What is the purpose of the rbind() command in the context of resampling?

To append the resampled records to the original training data set. ## Footnote rbind() combines two datasets by stacking them vertically.

What is the output of the table command after rebalancing?

A table showing the count and proportion of 'yes' and 'no' responses. ## Footnote This helps confirm the effectiveness of the resampling process.

What must a model's accuracy exceed to be considered useful in the fraud detection scenario?

99% ## Footnote This is the accuracy of an 'all negative' model that classifies all records as non-fraudulent.

What are the two baseline models for binary classification?

* All Positive Model * All Negative Model ## Footnote These models serve as a reference for evaluating the performance of more complex models.

What is the accuracy of the All Positive Model?

p ## Footnote p represents the proportion of positive records in the dataset.

What is the accuracy of the All Negative Model?

1 - p ## Footnote This represents the proportion of negative records in the dataset.

What is the Biggest Category Model in k-nary classification?

Assign all predictions to the largest category. ## Footnote Its accuracy is pmax, the highest proportion of any single category.

What is the y y model used for in regression?

It compares estimates against the mean response. ## Footnote This model serves as a baseline for evaluating regression performance.

What is the optimal benchmark for calibrating model performance?

The current gold standard model performance. ## Footnote This benchmark is based on established literature or proprietary business models.

What are the four tasks that should be undertaken during the Setup Phase?

* Partitioning the data * Validating the data partition * Balancing the data * Establishing baseline model performance ## Footnote These steps are essential for preparing data for analysis.

True or false: There is no baseline model for k-nary classification.

False ## Footnote The Biggest Category Model serves as a baseline for k-nary classification.

What is balancing in the context of data preparation?

The process of adjusting the dataset to ensure a more even distribution of response categories. ## Footnote Balancing is particularly important for handling rare events in classification tasks.

What is resampling?

The process of selecting records from a dataset to create a new sample. ## Footnote This technique can be used to adjust the proportions of different response categories in the dataset.