5 - Preparing to Model the Data Flashcards
What is the first phase discussed in Chapter 3 of the Data Science Methodology?
Problem Understanding Phase
What are the important tasks in the Setup Phase?
- Partitioning the data
- Validating the data partition
- Balancing the data
- Establishing baseline model performance
What is the primary reason data science does not use the statistical inference paradigm?
Statistical significance can occur without practical significance in large sample sizes
What is data dredging?
Uncovering spurious results due to random variation rather than real effects
What technique helps avoid data dredging?
Cross-validation
What are the two common methods of cross-validation?
- Twofold cross-validation
- K-fold cross-validation
What does the training data set contain in a twofold cross-validation?
Records with complete data
What is the purpose of the test data set in cross-validation?
To evaluate predictions against true target values
What percentage of data is typically used for training in complex data sets?
75–90%
What command is used to partition data in Python?
train_test_split()
What function is used in R to set the random number generator seed?
set.seed()
What statistical test is used for numerical variables to check for differences between training and test sets?
Two-sample t-test
What is the purpose of balancing the training data set?
To provide a rich selection of records for each category
What is resampling?
Sampling at random and with replacement from a data set
What should the test data set never be?
Balanced
What is the formula to calculate the number of resampled records needed?
x = (p * records) / rare
In Python, what command is used to check the count of a specific response in the training data set?
value_counts()
What command is used in Python to concatenate two data sets?
pd.concat()
What does the output of bank_train_rebal[’response’].value_counts() show after rebalancing?
The new counts of ‘yes’ responses in the rebalanced training set
What is the purpose of validating your partition?
To ensure the training and test sets do not differ systematically
What is the desired proportion of fraudulent transactions in a balanced training set example?
25%
True or False: The test set should be balanced to improve model performance.
False
Fill in the blank: The command used in R to find the number of ‘yes’ responses is _______.
table()
What percentage of the training data set had ‘yes’ responses before resampling?
11%
This percentage indicates the initial distribution of ‘yes’ responses in the training dataset.