Data Science using Python and R - 4 Flashcards
What is the primary purpose of exploratory data analysis (EDA)?
To explore data without a priori hypotheses and uncover relationships.
What is a hypothesis test (HT)?
A method to test specific hypotheses about data using statistical methods.
What does EDA allow the user to do?
Explore relationships, derive new variables, and use binning to increase predictive value.
What is the relationship explored in the bar graphs discussed?
The relationship between a categorical predictor and the target variable.
What does ‘previous_outcome’ refer to?
The result of a previous marketing campaign with the same customer.
What is the advantage of normalized bar graphs?
They allow easier comparison of response proportions between categories.
What are two best practices when working with bar graphs?
- Supplement unclear bar graphs with normalized versions
- Provide non-normalized graphs to indicate original distribution.
How do you create a contingency table in Python?
Using the crosstab() command.
What should the response variable represent in a contingency table?
The rows.
How do you calculate column percentages in Python?
Using the sum() and div() commands.
What is a histogram?
A graphical representation of a frequency distribution for a numerical variable.
What is the benefit of using a normalized histogram?
It helps distinguish response patterns more clearly.
What is the best practice for histograms?
- Use non-normalized histograms for original distributions
- Use normalized histograms for response patterns.
What command is used in R to create a contingency table?
The table() command.
What is the purpose of the addmargins() command in R?
To add row and column totals to a contingency table.
What is the significance of the geom_bar() function in ggplot2?
It specifies that a bar chart should be created.
Fill in the blank: EDA is often preferred when clients have _______ about the data.
no salient a priori notions
True or False: A normalized bar graph shows the original distribution of data.
False
What is the main drawback of a normalized histogram?
It does not indicate the original distribution of the data.
What is the purpose of using a non-normalized histogram?
To obtain the original distribution of the data values.
What is the benefit of using a normalized histogram?
To help better distinguish the response patterns.
What Python package is used for constructing histograms?
matplotlib
What command in Python creates a stacked histogram?
plt.hist()
In the plt.hist() command, what does the parameter ‘stacked = True’ do?
It stacks the two variables in the histogram.
What does the ‘bins’ parameter specify in a histogram?
The number of bins in the histogram.
What is the purpose of the column_stack() function in Python?
To combine the heights of the two variables’ bars into one array.
How do you calculate the normalized proportions in a histogram?
By dividing each row by the sum across that row.
What is the command to create a contingency table in Python?
pd.crosstab()
Fill in the blank: The command ‘cut()’ is used in Python to _______.
bin the values into categories.
What is the significance of using ‘right = False’ in the cut() command?
It excludes the right-hand cutpoint from the category.
What R command is used to create a histogram with an overlay?
ggplot() + geom_histogram()
What does the ‘position = “fill”’ input do in R’s geom_histogram()?
It normalizes the histogram.
True or False: The age group 27 to 60 has a high response proportion.
False
What is the recommended best practice for binning?
Use binning based on predictive value.
What is the main advantage of creating categorical variables through binning?
Some algorithms work better with categorical rather than numeric variables.
What does the ‘aes(fill = response)’ command do in R?
It adds an overlay to the histogram based on the response variable.
What visual representation is generated by the command ‘crosstab_02.plot(kind=”bar”, stacked = True)’?
A stacked bar graph of age binned with response overlay.
What does the command ‘prop.table()’ do in R?
It calculates the proportions of the contingency table.
What is a key observation made about the age groups in relation to response rates?
Both the older and the younger groups have a much higher response rate than the middle group.
What is exploratory data analysis (EDA)?
EDA is an approach to analyzing data sets to summarize their main characteristics, often using visual methods.
When should analysts use exploratory data analysis (EDA) rather than hypothesis testing?
EDA should be used when the objective is to explore data and find patterns without predefined hypotheses.
What are some examples of what EDA allows the user to do?
- Identify trends
- Discover patterns
- Detect anomalies
- Generate hypotheses
Which graph do we use to explore the relationship between a categorical predictor and the target variable?
Bar graph
What are (non-normalized) bar graphs useful for?
They are useful for displaying the frequency of categorical data.
State one advantage and one disadvantage of using a normalized bar graph.
Advantage: Easier comparison across categories.
Disadvantage: Can obscure the actual counts.
What are the two best practices when working with bar graphs for EDA?
- Clearly label axes
- Use appropriate scales
What does a contingency table help us to do?
It helps to summarize the relationship between two categorical variables.
Explain the two best practices when working with contingency tables in EDA.
- Ensure proper variable representation
- Report counts and percentages
What is a histogram?
A histogram is a graphical representation of the distribution of numerical data.
Describe one advantage and one disadvantage of using a normalized histogram.
Advantage: Facilitates comparison between different data sets.
Disadvantage: Can mislead if the total counts differ significantly.
What are the best practices for working with histograms in EDA?
- Choose appropriate bin sizes
- Clearly label axes
Why might it be useful for the analyst to bin a numeric variable?
Binning can simplify the analysis and help highlight trends.
Why do we use the binning method shown in this chapter rather than automatic binning methods?
Manual binning allows for more control and better understanding of the data distribution.
True or False: Data scientists should use automatic methods of data analysis without caution.
False
What is the purpose of creating a bar graph of the previous_outcome variable?
To visualize the distribution of previous outcomes.
What is the purpose of creating a normalized bar graph of the previous_outcome variable?
To compare the proportions of responses across different previous outcomes.
What should be included when comparing a contingency table with bar graphs?
Counts and percentages for each category.
What is the relationship between age and response demonstrated in a histogram?
It shows how age distribution correlates with the response variable.
What is the purpose of binning the age variable?
To group ages into categories for easier analysis.
What should be included in a contingency table of job with response?
Counts and column percentages.
What is the significance of combining job categories based on response percentages?
It simplifies analysis and highlights significant trends.
How do you define a new categorical variable from the duration variable?
By identifying cutoff points that separate low and high response values.
What should be done after identifying outliers in the capital-loss variable?
Construct a bar graph for the outlier records.
What is the effect of deleting outliers at the EDA stage?
It changes the character of the data set and can lead to misleading conclusions.
What does the capital-loss-flag variable represent?
It equals 0 when capital-loss equals 0, and 1 otherwise.
What is the rationale for combining certain categories in a contingency table?
To reduce complexity and improve interpretability.
What is the purpose of renaming variables before further analysis?
To maintain clarity and track changes in variable representations.
What should be analyzed when creating a histogram of the education variable?
The relationship between education levels and income.
What is the significance of binning the age variable into specific ranges?
It helps to analyze trends and patterns specific to age groups.
What type of data visualization is suggested for the sex predictor?
Both non-normalized and normalized bar graphs.
What does the normalized bar graph of occupation with a sex overlay illustrate?
The distribution of sex across different occupations.
What is a contingency table with sex for the rows and occupation for the columns used for?
To compare the distribution of sex across different occupations.