3 - Data Preparation Flashcards by Kaman Hung

What data sets are used in the bank marketing analysis?

bank_marketing_training and bank_marketing_test data sets

These data sets are adapted from the bank-additional-full.txt data set from the UCI Machine Learning Repository.

How well did you know this?

Not at all

Perfectly

What are the four predictors used in the analysis?

age
education
previous_outcome
days_since_previous

The target response is whether contacts subscribe to a term deposit account.

How well did you know this?

Not at all

Perfectly

How many records are in the bank_marketing_training data set?

26,874 records

How well did you know this?

Not at all

Perfectly

How many records are in the bank_marketing_test data set?

10,255 records

How well did you know this?

Not at all

Perfectly

What is the first phase of the Data Science Methodology?

Problem Understanding Phase

How well did you know this?

Not at all

Perfectly

What is one objective of the bank marketing analysis?

Learn about potential customers’ characteristics

How well did you know this?

Not at all

Perfectly

What is another objective of the bank marketing analysis?

Develop a profitable method of identifying likely positive responders

How well did you know this?

Not at all

Perfectly

What is a method to learn about potential customers?

Use Exploratory Data Analysis

How well did you know this?

Not at all

Perfectly

What is one classification model that can be developed for the analysis?

Decision Trees
Random Forests
Naïve Bayes Classification
Neural Networks
Logistic Regression

How well did you know this?

Not at all

Perfectly

What is the purpose of adding an index field?

Acts as an ID field and tracks the sort order of records

How well did you know this?

Not at all

Perfectly

What is the command to read a CSV file in Python?

pd.read_csv()

How well did you know this?

Not at all

Perfectly

How do you create an index field in Python?

bank_train[‘index’] = pd.Series(range(0,26874))

How well did you know this?

Not at all

Perfectly

What function in R provides the number of records in a data set?

dim()

How well did you know this?

Not at all

Perfectly

What is the misleading value in the days_since_previous field?

999

How well did you know this?

Not at all

Perfectly

What value should replace the misleading field value of 999 in Python?

np.NaN

How well did you know this?

Not at all

Perfectly

What command is used to create a histogram in Python?

plot(kind = ‘hist’)

How well did you know this?

Not at all

Perfectly

How do you change misleading field values in R?

bank_train$days_since_previous <- ifelse(test = bank_train$days_since_previous == 999, yes = NA, no = bank_train$days_since_previous)

How well did you know this?

Not at all

Perfectly

What is the purpose of re-expressing categorical data as numeric?

To provide information on the relative differences among categories

How well did you know this?

Not at all

Perfectly

What issue arises if categorical data is left unchanged?

Data science algorithms would not recognize the ordering of categories

How well did you know this?

Not at all

Perfectly

What is the command to view the first six records in R?

head()

How well did you know this?

Not at all

Perfectly

Fill in the blank: The bank marketing data sets are used for a _______ campaign.

phone-based direct marketing

How well did you know this?

Not at all

Perfectly

What is the goal of transforming data values into numeric values?

To ensure that one value is larger than another while preserving relative differences among various categories.

How well did you know this?

Not at all

Perfectly

What is the numeric value assigned to ‘illiterate’ in the education variable?

How well did you know this?

Not at all

Perfectly

What is the numeric value assigned to ‘high.school’ in the education variable?

How well did you know this?

Not at all

Perfectly

What Python command is used to replicate the education variable?

bank_train['education_numeric'] = bank_train['education']

In Python, how do you replace categorical values with numeric ones in a DataFrame?

bank_train.replace(dict_edu, inplace=True)

What R function is used to replace values in a variable according to specified rules?

revalue()

Fill in the blank: The command used in Python to calculate the z-score is _______.

stats.zscore()

What is the purpose of standardizing numeric fields?

To ensure the field mean equals 0 and the field standard deviation equals 1.

What is considered an outlier in the context of z-values?

A data value with a z-value greater than 3 or less than -3.

How do you identify outliers using Python?

bank_train.query('age_z > 3 | age_z < -3')

What command is used in R to sort a data set by a specific variable?

order()

What is the default behavior of the scale() function in R?

It centers and scales the variable to calculate the z-score.

What does the command bank_train$education_numeric <‐ as.numeric(levels(edu.num))[edu.num] do in R?

Converts factor levels of edu.num to numeric and assigns them to education_numeric.

What is the numeric value assigned to 'unknown' in the education variable?

Missing (np.NaN in Python, NA in R)

What is the mean number of contacts per customer in the example?

2.6

What does the replace() function do in Python?

Replaces values in a DataFrame according to a specified dictionary.

True or False: Outliers should always be removed from the dataset.

False

What does the command bank_train.sort_values(['age_z'], ascending=False) do in Python?

Sorts the DataFrame by the age_z variable in descending order.

What is the first step to reexpress categorical field values using Python?

Create a dictionary for converting categorical values to numeric values.

Fill in the blank: In R, the function _______ is used to center a variable by subtracting its mean.

scale()

How can you view the first 10 records of a sorted dataset in R?

bank_train_sort[1:10, ]

What is the purpose of the z-score?

To measure how many standard deviations a data value is from the mean.

What are the two main objectives of the bank_marketing analysis?

1. Understanding potential customers 2. Developing profitable models

What are the three ways to learn about potential customers?

* Analyze existing data * Conduct surveys * Use focus groups

How can we accomplish the objective of developing profitable models for identifying likely positive responders?

By using statistical techniques and machine learning algorithms

Why might it be a good idea to add an index field to the data set?

* To uniquely identify each record * To facilitate data manipulation

Why is the field days_since_previous essentially useless until we handle the 999 code?

Because 999 is often used to indicate missing or invalid data

Why was it important to reexpress education as a numeric field?

To enable quantitative analysis and modeling

If a data value has a z-value of 1, how may we interpret this value?

It is one standard deviation above the mean

What is the rough rule of thumb for identifying outliers using z-values?

Values with z-scores greater than 3 or less than -3 are considered outliers

Should outliers be automatically removed or changed? Why or why not?

No, because outliers may contain valuable information

What should we do with outliers we have identified?

Investigate their cause and decide whether to keep or modify them

What is the first step to work with the bank_marketing_training data set?

Derive an index field and add it to the data set

What should be done for the days_since_previous field regarding the value 999?

Change it to the appropriate code for missing values

What should be done to the education field?

Reexpress the field values as numeric values

What is the task for the age field?

Standardize the field age and print the first 10 records

What should be done to identify outliers in the age_z field?

Obtain a listing of all records that are outliers

How should jobs with less than 5% of records be handled?

Combine them into a field called 'other'

What should the default predictor be renamed to?

credit_default

How should the month variable be modified?

Change values to 1–12 but keep it as categorical

For the duration field, what are the tasks to be completed?

* Standardize the variable * Identify outliers and the most extreme outlier

What should be done for the campaign field?

* Standardize the variable * Identify outliers and the most extreme outlier

What does the Nutrition_subset data set contain?

Weight in grams, amount of saturated fat, and cholesterol for 961 foods

What should be done with the saturated fat data?

* Sort by saturated fat * List the five food items highest in saturated fat

What is the importance of comparing food items of different sizes?

It may not be valid as size affects fat content

How can saturated_fat_per_gram be derived?

By dividing the amount of saturated fat by the weight in grams

What should be done after deriving saturated_fat_per_gram?

* Sort by saturated_fat_per_gram * List the five food items highest in saturated fat per gram

What is the task for cholesterol_per_gram?

* Derive the variable * Sort and list the five food items highest in cholesterol per gram

What should be done for saturated_fat_per_gram regarding outliers?

* Standardize the field * List high-end outliers and count low-end outliers

What should be done for cholesterol_per_gram regarding outliers?

Standardize the field and list high-end outliers

What is the first step for the adult_ch3_training data set?

Add a record index field

What should be checked for the education field?

Determine if any outliers exist

What are the tasks for the age field?

* Standardize the variable * Identify outliers and the most extreme outlier

What is the flag for capital-gain?

capital-gain-flag equals 0 for capital gain equals zero, and 1 otherwise

What should be done for records with age at least 80?

Construct a histogram of age and analyze the results

3 - Data Preparation Flashcards

(76 cards)