Data Science using Python and R - 3 Flashcards

1
Q

What data sets are used in the bank marketing analysis?

A

bank_marketing_training and bank_marketing_test data sets

These data sets are adapted from the bank-additional-full.txt data set from the UCI Machine Learning Repository.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the four predictors used in the analysis?

A
  • age
  • education
  • previous_outcome
  • days_since_previous

The target response is whether contacts subscribe to a term deposit account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many records are in the bank_marketing_training data set?

A

26,874 records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many records are in the bank_marketing_test data set?

A

10,255 records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the first phase of the Data Science Methodology?

A

Problem Understanding Phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is one objective of the bank marketing analysis?

A

Learn about potential customers’ characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is another objective of the bank marketing analysis?

A

Develop a profitable method of identifying likely positive responders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a method to learn about potential customers?

A

Use Exploratory Data Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is one classification model that can be developed for the analysis?

A
  • Decision Trees
  • Random Forests
  • Naïve Bayes Classification
  • Neural Networks
  • Logistic Regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of adding an index field?

A

Acts as an ID field and tracks the sort order of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the command to read a CSV file in Python?

A

pd.read_csv()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you create an index field in Python?

A

bank_train[‘index’] = pd.Series(range(0,26874))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What function in R provides the number of records in a data set?

A

dim()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the misleading value in the days_since_previous field?

A

999

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What value should replace the misleading field value of 999 in Python?

A

np.NaN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What command is used to create a histogram in Python?

A

plot(kind = ‘hist’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you change misleading field values in R?

A

bank_train$days_since_previous <- ifelse(test = bank_train$days_since_previous == 999, yes = NA, no = bank_train$days_since_previous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of re-expressing categorical data as numeric?

A

To provide information on the relative differences among categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What issue arises if categorical data is left unchanged?

A

Data science algorithms would not recognize the ordering of categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the command to view the first six records in R?

A

head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Fill in the blank: The bank marketing data sets are used for a _______ campaign.

A

phone-based direct marketing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the goal of transforming data values into numeric values?

A

To ensure that one value is larger than another while preserving relative differences among various categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the numeric value assigned to ‘illiterate’ in the education variable?

A

0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the numeric value assigned to ‘high.school’ in the education variable?

A

12

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What Python command is used to replicate the education variable?

A

bank_train[‘education_numeric’] = bank_train[‘education’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

In Python, how do you replace categorical values with numeric ones in a DataFrame?

A

bank_train.replace(dict_edu, inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What R function is used to replace values in a variable according to specified rules?

A

revalue()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Fill in the blank: The command used in Python to calculate the z-score is _______.

A

stats.zscore()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the purpose of standardizing numeric fields?

A

To ensure the field mean equals 0 and the field standard deviation equals 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is considered an outlier in the context of z-values?

A

A data value with a z-value greater than 3 or less than -3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How do you identify outliers using Python?

A

bank_train.query(‘age_z > 3 | age_z < -3’)

32
Q

What command is used in R to sort a data set by a specific variable?

33
Q

What is the default behavior of the scale() function in R?

A

It centers and scales the variable to calculate the z-score.

34
Q

What does the command bank_train$education_numeric <‐ as.numeric(levels(edu.num))[edu.num] do in R?

A

Converts factor levels of edu.num to numeric and assigns them to education_numeric.

35
Q

What is the numeric value assigned to ‘unknown’ in the education variable?

A

Missing (np.NaN in Python, NA in R)

36
Q

What is the mean number of contacts per customer in the example?

37
Q

What does the replace() function do in Python?

A

Replaces values in a DataFrame according to a specified dictionary.

38
Q

True or False: Outliers should always be removed from the dataset.

39
Q

What does the command bank_train.sort_values([‘age_z’], ascending=False) do in Python?

A

Sorts the DataFrame by the age_z variable in descending order.

40
Q

What is the first step to reexpress categorical field values using Python?

A

Create a dictionary for converting categorical values to numeric values.

41
Q

Fill in the blank: In R, the function _______ is used to center a variable by subtracting its mean.

42
Q

How can you view the first 10 records of a sorted dataset in R?

A

bank_train_sort[1:10, ]

43
Q

What is the purpose of the z-score?

A

To measure how many standard deviations a data value is from the mean.

44
Q

What are the two main objectives of the bank_marketing analysis?

A
  1. Understanding potential customers
  2. Developing profitable models
45
Q

What are the three ways to learn about potential customers?

A
  • Analyze existing data
  • Conduct surveys
  • Use focus groups
46
Q

How can we accomplish the objective of developing profitable models for identifying likely positive responders?

A

By using statistical techniques and machine learning algorithms

47
Q

Why might it be a good idea to add an index field to the data set?

A
  • To uniquely identify each record
  • To facilitate data manipulation
48
Q

Why is the field days_since_previous essentially useless until we handle the 999 code?

A

Because 999 is often used to indicate missing or invalid data

49
Q

Why was it important to reexpress education as a numeric field?

A

To enable quantitative analysis and modeling

50
Q

If a data value has a z-value of 1, how may we interpret this value?

A

It is one standard deviation above the mean

51
Q

What is the rough rule of thumb for identifying outliers using z-values?

A

Values with z-scores greater than 3 or less than -3 are considered outliers

52
Q

Should outliers be automatically removed or changed? Why or why not?

A

No, because outliers may contain valuable information

53
Q

What should we do with outliers we have identified?

A

Investigate their cause and decide whether to keep or modify them

54
Q

What is the first step to work with the bank_marketing_training data set?

A

Derive an index field and add it to the data set

55
Q

What should be done for the days_since_previous field regarding the value 999?

A

Change it to the appropriate code for missing values

56
Q

What should be done to the education field?

A

Reexpress the field values as numeric values

57
Q

What is the task for the age field?

A

Standardize the field age and print the first 10 records

58
Q

What should be done to identify outliers in the age_z field?

A

Obtain a listing of all records that are outliers

59
Q

How should jobs with less than 5% of records be handled?

A

Combine them into a field called ‘other’

60
Q

What should the default predictor be renamed to?

A

credit_default

61
Q

How should the month variable be modified?

A

Change values to 1–12 but keep it as categorical

62
Q

For the duration field, what are the tasks to be completed?

A
  • Standardize the variable
  • Identify outliers and the most extreme outlier
63
Q

What should be done for the campaign field?

A
  • Standardize the variable
  • Identify outliers and the most extreme outlier
64
Q

What does the Nutrition_subset data set contain?

A

Weight in grams, amount of saturated fat, and cholesterol for 961 foods

65
Q

What should be done with the saturated fat data?

A
  • Sort by saturated fat
  • List the five food items highest in saturated fat
66
Q

What is the importance of comparing food items of different sizes?

A

It may not be valid as size affects fat content

67
Q

How can saturated_fat_per_gram be derived?

A

By dividing the amount of saturated fat by the weight in grams

68
Q

What should be done after deriving saturated_fat_per_gram?

A
  • Sort by saturated_fat_per_gram
  • List the five food items highest in saturated fat per gram
69
Q

What is the task for cholesterol_per_gram?

A
  • Derive the variable
  • Sort and list the five food items highest in cholesterol per gram
70
Q

What should be done for saturated_fat_per_gram regarding outliers?

A
  • Standardize the field
  • List high-end outliers and count low-end outliers
71
Q

What should be done for cholesterol_per_gram regarding outliers?

A

Standardize the field and list high-end outliers

72
Q

What is the first step for the adult_ch3_training data set?

A

Add a record index field

73
Q

What should be checked for the education field?

A

Determine if any outliers exist

74
Q

What are the tasks for the age field?

A
  • Standardize the variable
  • Identify outliers and the most extreme outlier
75
Q

What is the flag for capital-gain?

A

capital-gain-flag equals 0 for capital gain equals zero, and 1 otherwise

76
Q

What should be done for records with age at least 80?

A

Construct a histogram of age and analyze the results