Data Science using Python and R - 5 Flashcards
What is the first phase discussed in Chapter 3 of the Data Science Methodology?
Problem Understanding Phase
What are the important tasks in the Setup Phase?
- Partitioning the data
- Validating the data partition
- Balancing the data
- Establishing baseline model performance
What is the primary reason data science does not use the statistical inference paradigm?
Statistical significance can occur without practical significance in large sample sizes
What is data dredging?
Uncovering spurious results due to random variation rather than real effects
What technique helps avoid data dredging?
Cross-validation
What are the two common methods of cross-validation?
- Twofold cross-validation
- K-fold cross-validation
What does the training data set contain in a twofold cross-validation?
Records with complete data
What is the purpose of the test data set in cross-validation?
To evaluate predictions against true target values
What percentage of data is typically used for training in complex data sets?
75–90%
What command is used to partition data in Python?
train_test_split()
What function is used in R to set the random number generator seed?
set.seed()
What statistical test is used for numerical variables to check for differences between training and test sets?
Two-sample t-test
What is the purpose of balancing the training data set?
To provide a rich selection of records for each category
What is resampling?
Sampling at random and with replacement from a data set
What should the test data set never be?
Balanced
What is the formula to calculate the number of resampled records needed?
x = (p * records) / rare
In Python, what command is used to check the count of a specific response in the training data set?
value_counts()
What command is used in Python to concatenate two data sets?
pd.concat()
What does the output of bank_train_rebal[’response’].value_counts() show after rebalancing?
The new counts of ‘yes’ responses in the rebalanced training set
What is the purpose of validating your partition?
To ensure the training and test sets do not differ systematically
What is the desired proportion of fraudulent transactions in a balanced training set example?
25%
True or False: The test set should be balanced to improve model performance.
False
Fill in the blank: The command used in R to find the number of ‘yes’ responses is _______.
table()
What percentage of the training data set had ‘yes’ responses before resampling?
11%
This percentage indicates the initial distribution of ‘yes’ responses in the training dataset.
What percentage of ‘yes’ responses do we aim for after resampling?
30%
This target percentage is set to increase the representation of rare records in the training data.
How many ‘yes’ records do we need to resample to achieve the target percentage?
850
This number is calculated based on the total records and desired proportion.
What command is used to identify the indices of records with ‘yes’ responses?
which()
This command returns the row numbers corresponding to records meeting the specified condition.
What does the command sample(x = to.resample, size = 850, replace = TRUE) do?
It randomly samples 850 records with replacement from the specified indices.
The replace = TRUE input allows for the same record to be selected multiple times.
What is the purpose of the rbind() command in the context of resampling?
To append the resampled records to the original training data set.
rbind() combines two datasets by stacking them vertically.
What is the output of the table command after rebalancing?
A table showing the count and proportion of ‘yes’ and ‘no’ responses.
This helps confirm the effectiveness of the resampling process.
What must a model’s accuracy exceed to be considered useful in the fraud detection scenario?
99%
This is the accuracy of an ‘all negative’ model that classifies all records as non-fraudulent.
What are the two baseline models for binary classification?
- All Positive Model
- All Negative Model
These models serve as a reference for evaluating the performance of more complex models.
What is the accuracy of the All Positive Model?
p
p represents the proportion of positive records in the dataset.
What is the accuracy of the All Negative Model?
1 - p
This represents the proportion of negative records in the dataset.
What is the Biggest Category Model in k-nary classification?
Assign all predictions to the largest category.
Its accuracy is pmax, the highest proportion of any single category.
What is the y y model used for in regression?
It compares estimates against the mean response.
This model serves as a baseline for evaluating regression performance.
What is the optimal benchmark for calibrating model performance?
The current gold standard model performance.
This benchmark is based on established literature or proprietary business models.
What are the four tasks that should be undertaken during the Setup Phase?
- Partitioning the data
- Validating the data partition
- Balancing the data
- Establishing baseline model performance
These steps are essential for preparing data for analysis.
True or false: There is no baseline model for k-nary classification.
False
The Biggest Category Model serves as a baseline for k-nary classification.
What is balancing in the context of data preparation?
The process of adjusting the dataset to ensure a more even distribution of response categories.
Balancing is particularly important for handling rare events in classification tasks.
What is resampling?
The process of selecting records from a dataset to create a new sample.
This technique can be used to adjust the proportions of different response categories in the dataset.