Data Science using Python and R - 3 Flashcards
What data sets are used in the bank marketing analysis?
bank_marketing_training and bank_marketing_test data sets
These data sets are adapted from the bank-additional-full.txt data set from the UCI Machine Learning Repository.
What are the four predictors used in the analysis?
- age
- education
- previous_outcome
- days_since_previous
The target response is whether contacts subscribe to a term deposit account.
How many records are in the bank_marketing_training data set?
26,874 records
How many records are in the bank_marketing_test data set?
10,255 records
What is the first phase of the Data Science Methodology?
Problem Understanding Phase
What is one objective of the bank marketing analysis?
Learn about potential customers’ characteristics
What is another objective of the bank marketing analysis?
Develop a profitable method of identifying likely positive responders
What is a method to learn about potential customers?
Use Exploratory Data Analysis
What is one classification model that can be developed for the analysis?
- Decision Trees
- Random Forests
- Naïve Bayes Classification
- Neural Networks
- Logistic Regression
What is the purpose of adding an index field?
Acts as an ID field and tracks the sort order of records
What is the command to read a CSV file in Python?
pd.read_csv()
How do you create an index field in Python?
bank_train[‘index’] = pd.Series(range(0,26874))
What function in R provides the number of records in a data set?
dim()
What is the misleading value in the days_since_previous field?
999
What value should replace the misleading field value of 999 in Python?
np.NaN
What command is used to create a histogram in Python?
plot(kind = ‘hist’)
How do you change misleading field values in R?
bank_train$days_since_previous <- ifelse(test = bank_train$days_since_previous == 999, yes = NA, no = bank_train$days_since_previous)
What is the purpose of re-expressing categorical data as numeric?
To provide information on the relative differences among categories
What issue arises if categorical data is left unchanged?
Data science algorithms would not recognize the ordering of categories
What is the command to view the first six records in R?
head()
Fill in the blank: The bank marketing data sets are used for a _______ campaign.
phone-based direct marketing
What is the goal of transforming data values into numeric values?
To ensure that one value is larger than another while preserving relative differences among various categories.
What is the numeric value assigned to ‘illiterate’ in the education variable?
0
What is the numeric value assigned to ‘high.school’ in the education variable?
12
What Python command is used to replicate the education variable?
bank_train[‘education_numeric’] = bank_train[‘education’]
In Python, how do you replace categorical values with numeric ones in a DataFrame?
bank_train.replace(dict_edu, inplace=True)
What R function is used to replace values in a variable according to specified rules?
revalue()
Fill in the blank: The command used in Python to calculate the z-score is _______.
stats.zscore()
What is the purpose of standardizing numeric fields?
To ensure the field mean equals 0 and the field standard deviation equals 1.
What is considered an outlier in the context of z-values?
A data value with a z-value greater than 3 or less than -3.
How do you identify outliers using Python?
bank_train.query(‘age_z > 3 | age_z < -3’)
What command is used in R to sort a data set by a specific variable?
order()
What is the default behavior of the scale() function in R?
It centers and scales the variable to calculate the z-score.
What does the command bank_train$education_numeric <‐ as.numeric(levels(edu.num))[edu.num] do in R?
Converts factor levels of edu.num to numeric and assigns them to education_numeric.
What is the numeric value assigned to ‘unknown’ in the education variable?
Missing (np.NaN in Python, NA in R)
What is the mean number of contacts per customer in the example?
2.6
What does the replace() function do in Python?
Replaces values in a DataFrame according to a specified dictionary.
True or False: Outliers should always be removed from the dataset.
False
What does the command bank_train.sort_values([‘age_z’], ascending=False) do in Python?
Sorts the DataFrame by the age_z variable in descending order.
What is the first step to reexpress categorical field values using Python?
Create a dictionary for converting categorical values to numeric values.
Fill in the blank: In R, the function _______ is used to center a variable by subtracting its mean.
scale()
How can you view the first 10 records of a sorted dataset in R?
bank_train_sort[1:10, ]
What is the purpose of the z-score?
To measure how many standard deviations a data value is from the mean.
What are the two main objectives of the bank_marketing analysis?
- Understanding potential customers
- Developing profitable models
What are the three ways to learn about potential customers?
- Analyze existing data
- Conduct surveys
- Use focus groups
How can we accomplish the objective of developing profitable models for identifying likely positive responders?
By using statistical techniques and machine learning algorithms
Why might it be a good idea to add an index field to the data set?
- To uniquely identify each record
- To facilitate data manipulation
Why is the field days_since_previous essentially useless until we handle the 999 code?
Because 999 is often used to indicate missing or invalid data
Why was it important to reexpress education as a numeric field?
To enable quantitative analysis and modeling
If a data value has a z-value of 1, how may we interpret this value?
It is one standard deviation above the mean
What is the rough rule of thumb for identifying outliers using z-values?
Values with z-scores greater than 3 or less than -3 are considered outliers
Should outliers be automatically removed or changed? Why or why not?
No, because outliers may contain valuable information
What should we do with outliers we have identified?
Investigate their cause and decide whether to keep or modify them
What is the first step to work with the bank_marketing_training data set?
Derive an index field and add it to the data set
What should be done for the days_since_previous field regarding the value 999?
Change it to the appropriate code for missing values
What should be done to the education field?
Reexpress the field values as numeric values
What is the task for the age field?
Standardize the field age and print the first 10 records
What should be done to identify outliers in the age_z field?
Obtain a listing of all records that are outliers
How should jobs with less than 5% of records be handled?
Combine them into a field called ‘other’
What should the default predictor be renamed to?
credit_default
How should the month variable be modified?
Change values to 1–12 but keep it as categorical
For the duration field, what are the tasks to be completed?
- Standardize the variable
- Identify outliers and the most extreme outlier
What should be done for the campaign field?
- Standardize the variable
- Identify outliers and the most extreme outlier
What does the Nutrition_subset data set contain?
Weight in grams, amount of saturated fat, and cholesterol for 961 foods
What should be done with the saturated fat data?
- Sort by saturated fat
- List the five food items highest in saturated fat
What is the importance of comparing food items of different sizes?
It may not be valid as size affects fat content
How can saturated_fat_per_gram be derived?
By dividing the amount of saturated fat by the weight in grams
What should be done after deriving saturated_fat_per_gram?
- Sort by saturated_fat_per_gram
- List the five food items highest in saturated fat per gram
What is the task for cholesterol_per_gram?
- Derive the variable
- Sort and list the five food items highest in cholesterol per gram
What should be done for saturated_fat_per_gram regarding outliers?
- Standardize the field
- List high-end outliers and count low-end outliers
What should be done for cholesterol_per_gram regarding outliers?
Standardize the field and list high-end outliers
What is the first step for the adult_ch3_training data set?
Add a record index field
What should be checked for the education field?
Determine if any outliers exist
What are the tasks for the age field?
- Standardize the variable
- Identify outliers and the most extreme outlier
What is the flag for capital-gain?
capital-gain-flag equals 0 for capital gain equals zero, and 1 otherwise
What should be done for records with age at least 80?
Construct a histogram of age and analyze the results