Data Science using Python and R - 4 Flashcards

1
Q

What is the primary purpose of exploratory data analysis (EDA)?

A

To explore data without a priori hypotheses and uncover relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a hypothesis test (HT)?

A

A method to test specific hypotheses about data using statistical methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does EDA allow the user to do?

A

Explore relationships, derive new variables, and use binning to increase predictive value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the relationship explored in the bar graphs discussed?

A

The relationship between a categorical predictor and the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does ‘previous_outcome’ refer to?

A

The result of a previous marketing campaign with the same customer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the advantage of normalized bar graphs?

A

They allow easier comparison of response proportions between categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are two best practices when working with bar graphs?

A
  • Supplement unclear bar graphs with normalized versions
  • Provide non-normalized graphs to indicate original distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you create a contingency table in Python?

A

Using the crosstab() command.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What should the response variable represent in a contingency table?

A

The rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you calculate column percentages in Python?

A

Using the sum() and div() commands.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a histogram?

A

A graphical representation of a frequency distribution for a numerical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the benefit of using a normalized histogram?

A

It helps distinguish response patterns more clearly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the best practice for histograms?

A
  • Use non-normalized histograms for original distributions
  • Use normalized histograms for response patterns.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What command is used in R to create a contingency table?

A

The table() command.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of the addmargins() command in R?

A

To add row and column totals to a contingency table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the significance of the geom_bar() function in ggplot2?

A

It specifies that a bar chart should be created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Fill in the blank: EDA is often preferred when clients have _______ about the data.

A

no salient a priori notions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: A normalized bar graph shows the original distribution of data.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the main drawback of a normalized histogram?

A

It does not indicate the original distribution of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the purpose of using a non-normalized histogram?

A

To obtain the original distribution of the data values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the benefit of using a normalized histogram?

A

To help better distinguish the response patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What Python package is used for constructing histograms?

A

matplotlib

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What command in Python creates a stacked histogram?

A

plt.hist()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In the plt.hist() command, what does the parameter ‘stacked = True’ do?

A

It stacks the two variables in the histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does the ‘bins’ parameter specify in a histogram?

A

The number of bins in the histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the purpose of the column_stack() function in Python?

A

To combine the heights of the two variables’ bars into one array.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do you calculate the normalized proportions in a histogram?

A

By dividing each row by the sum across that row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the command to create a contingency table in Python?

A

pd.crosstab()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Fill in the blank: The command ‘cut()’ is used in Python to _______.

A

bin the values into categories.

30
Q

What is the significance of using ‘right = False’ in the cut() command?

A

It excludes the right-hand cutpoint from the category.

31
Q

What R command is used to create a histogram with an overlay?

A

ggplot() + geom_histogram()

32
Q

What does the ‘position = “fill”’ input do in R’s geom_histogram()?

A

It normalizes the histogram.

33
Q

True or False: The age group 27 to 60 has a high response proportion.

34
Q

What is the recommended best practice for binning?

A

Use binning based on predictive value.

35
Q

What is the main advantage of creating categorical variables through binning?

A

Some algorithms work better with categorical rather than numeric variables.

36
Q

What does the ‘aes(fill = response)’ command do in R?

A

It adds an overlay to the histogram based on the response variable.

37
Q

What visual representation is generated by the command ‘crosstab_02.plot(kind=”bar”, stacked = True)’?

A

A stacked bar graph of age binned with response overlay.

38
Q

What does the command ‘prop.table()’ do in R?

A

It calculates the proportions of the contingency table.

39
Q

What is a key observation made about the age groups in relation to response rates?

A

Both the older and the younger groups have a much higher response rate than the middle group.

40
Q

What is exploratory data analysis (EDA)?

A

EDA is an approach to analyzing data sets to summarize their main characteristics, often using visual methods.

41
Q

When should analysts use exploratory data analysis (EDA) rather than hypothesis testing?

A

EDA should be used when the objective is to explore data and find patterns without predefined hypotheses.

42
Q

What are some examples of what EDA allows the user to do?

A
  • Identify trends
  • Discover patterns
  • Detect anomalies
  • Generate hypotheses
43
Q

Which graph do we use to explore the relationship between a categorical predictor and the target variable?

44
Q

What are (non-normalized) bar graphs useful for?

A

They are useful for displaying the frequency of categorical data.

45
Q

State one advantage and one disadvantage of using a normalized bar graph.

A

Advantage: Easier comparison across categories.
Disadvantage: Can obscure the actual counts.

46
Q

What are the two best practices when working with bar graphs for EDA?

A
  • Clearly label axes
  • Use appropriate scales
47
Q

What does a contingency table help us to do?

A

It helps to summarize the relationship between two categorical variables.

48
Q

Explain the two best practices when working with contingency tables in EDA.

A
  • Ensure proper variable representation
  • Report counts and percentages
49
Q

What is a histogram?

A

A histogram is a graphical representation of the distribution of numerical data.

50
Q

Describe one advantage and one disadvantage of using a normalized histogram.

A

Advantage: Facilitates comparison between different data sets.
Disadvantage: Can mislead if the total counts differ significantly.

51
Q

What are the best practices for working with histograms in EDA?

A
  • Choose appropriate bin sizes
  • Clearly label axes
52
Q

Why might it be useful for the analyst to bin a numeric variable?

A

Binning can simplify the analysis and help highlight trends.

53
Q

Why do we use the binning method shown in this chapter rather than automatic binning methods?

A

Manual binning allows for more control and better understanding of the data distribution.

54
Q

True or False: Data scientists should use automatic methods of data analysis without caution.

55
Q

What is the purpose of creating a bar graph of the previous_outcome variable?

A

To visualize the distribution of previous outcomes.

56
Q

What is the purpose of creating a normalized bar graph of the previous_outcome variable?

A

To compare the proportions of responses across different previous outcomes.

57
Q

What should be included when comparing a contingency table with bar graphs?

A

Counts and percentages for each category.

58
Q

What is the relationship between age and response demonstrated in a histogram?

A

It shows how age distribution correlates with the response variable.

59
Q

What is the purpose of binning the age variable?

A

To group ages into categories for easier analysis.

60
Q

What should be included in a contingency table of job with response?

A

Counts and column percentages.

61
Q

What is the significance of combining job categories based on response percentages?

A

It simplifies analysis and highlights significant trends.

62
Q

How do you define a new categorical variable from the duration variable?

A

By identifying cutoff points that separate low and high response values.

63
Q

What should be done after identifying outliers in the capital-loss variable?

A

Construct a bar graph for the outlier records.

64
Q

What is the effect of deleting outliers at the EDA stage?

A

It changes the character of the data set and can lead to misleading conclusions.

65
Q

What does the capital-loss-flag variable represent?

A

It equals 0 when capital-loss equals 0, and 1 otherwise.

66
Q

What is the rationale for combining certain categories in a contingency table?

A

To reduce complexity and improve interpretability.

67
Q

What is the purpose of renaming variables before further analysis?

A

To maintain clarity and track changes in variable representations.

68
Q

What should be analyzed when creating a histogram of the education variable?

A

The relationship between education levels and income.

69
Q

What is the significance of binning the age variable into specific ranges?

A

It helps to analyze trends and patterns specific to age groups.

70
Q

What type of data visualization is suggested for the sex predictor?

A

Both non-normalized and normalized bar graphs.

71
Q

What does the normalized bar graph of occupation with a sex overlay illustrate?

A

The distribution of sex across different occupations.

72
Q

What is a contingency table with sex for the rows and occupation for the columns used for?

A

To compare the distribution of sex across different occupations.