Data Science using Python and R - 5 Flashcards

1
Q

What is the first phase discussed in Chapter 3 of the Data Science Methodology?

A

Problem Understanding Phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the important tasks in the Setup Phase?

A
  • Partitioning the data
  • Validating the data partition
  • Balancing the data
  • Establishing baseline model performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary reason data science does not use the statistical inference paradigm?

A

Statistical significance can occur without practical significance in large sample sizes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is data dredging?

A

Uncovering spurious results due to random variation rather than real effects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What technique helps avoid data dredging?

A

Cross-validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two common methods of cross-validation?

A
  • Twofold cross-validation
  • K-fold cross-validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the training data set contain in a twofold cross-validation?

A

Records with complete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of the test data set in cross-validation?

A

To evaluate predictions against true target values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What percentage of data is typically used for training in complex data sets?

A

75–90%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What command is used to partition data in Python?

A

train_test_split()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What function is used in R to set the random number generator seed?

A

set.seed()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What statistical test is used for numerical variables to check for differences between training and test sets?

A

Two-sample t-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of balancing the training data set?

A

To provide a rich selection of records for each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is resampling?

A

Sampling at random and with replacement from a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What should the test data set never be?

A

Balanced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the formula to calculate the number of resampled records needed?

A

x = (p * records) / rare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In Python, what command is used to check the count of a specific response in the training data set?

A

value_counts()

18
Q

What command is used in Python to concatenate two data sets?

A

pd.concat()

19
Q

What does the output of bank_train_rebal[’response’].value_counts() show after rebalancing?

A

The new counts of ‘yes’ responses in the rebalanced training set

20
Q

What is the purpose of validating your partition?

A

To ensure the training and test sets do not differ systematically

21
Q

What is the desired proportion of fraudulent transactions in a balanced training set example?

22
Q

True or False: The test set should be balanced to improve model performance.

23
Q

Fill in the blank: The command used in R to find the number of ‘yes’ responses is _______.

24
Q

What percentage of the training data set had ‘yes’ responses before resampling?

A

11%

This percentage indicates the initial distribution of ‘yes’ responses in the training dataset.

25
Q

What percentage of ‘yes’ responses do we aim for after resampling?

A

30%

This target percentage is set to increase the representation of rare records in the training data.

26
Q

How many ‘yes’ records do we need to resample to achieve the target percentage?

A

850

This number is calculated based on the total records and desired proportion.

27
Q

What command is used to identify the indices of records with ‘yes’ responses?

A

which()

This command returns the row numbers corresponding to records meeting the specified condition.

28
Q

What does the command sample(x = to.resample, size = 850, replace = TRUE) do?

A

It randomly samples 850 records with replacement from the specified indices.

The replace = TRUE input allows for the same record to be selected multiple times.

29
Q

What is the purpose of the rbind() command in the context of resampling?

A

To append the resampled records to the original training data set.

rbind() combines two datasets by stacking them vertically.

30
Q

What is the output of the table command after rebalancing?

A

A table showing the count and proportion of ‘yes’ and ‘no’ responses.

This helps confirm the effectiveness of the resampling process.

31
Q

What must a model’s accuracy exceed to be considered useful in the fraud detection scenario?

A

99%

This is the accuracy of an ‘all negative’ model that classifies all records as non-fraudulent.

32
Q

What are the two baseline models for binary classification?

A
  • All Positive Model
  • All Negative Model

These models serve as a reference for evaluating the performance of more complex models.

33
Q

What is the accuracy of the All Positive Model?

A

p

p represents the proportion of positive records in the dataset.

34
Q

What is the accuracy of the All Negative Model?

A

1 - p

This represents the proportion of negative records in the dataset.

35
Q

What is the Biggest Category Model in k-nary classification?

A

Assign all predictions to the largest category.

Its accuracy is pmax, the highest proportion of any single category.

36
Q

What is the y y model used for in regression?

A

It compares estimates against the mean response.

This model serves as a baseline for evaluating regression performance.

37
Q

What is the optimal benchmark for calibrating model performance?

A

The current gold standard model performance.

This benchmark is based on established literature or proprietary business models.

38
Q

What are the four tasks that should be undertaken during the Setup Phase?

A
  • Partitioning the data
  • Validating the data partition
  • Balancing the data
  • Establishing baseline model performance

These steps are essential for preparing data for analysis.

39
Q

True or false: There is no baseline model for k-nary classification.

A

False

The Biggest Category Model serves as a baseline for k-nary classification.

40
Q

What is balancing in the context of data preparation?

A

The process of adjusting the dataset to ensure a more even distribution of response categories.

Balancing is particularly important for handling rare events in classification tasks.

41
Q

What is resampling?

A

The process of selecting records from a dataset to create a new sample.

This technique can be used to adjust the proportions of different response categories in the dataset.