C1 Intro to Probability and Data Analysis with R M4 Exploratory Data Analysis | Inference Intro Flashcards

1
Q

What is the arithmetic average called?

A

Mean

The mean is calculated by adding all values and dividing by the number of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What term refers to the midpoint of a data set?

A

Median

The median divides the data into two equal halves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the most frequent observation in a data set called?

A

Mode

A data set can have more than one mode or none at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What measure indicates variability around the mean?

A

Standard deviation

Standard deviation quantifies the amount of variation or dispersion in a set of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the formula for calculating the range of a data set?

A

Max - Min

Maximum value - Minimum value found in the data set

The range provides a measure of how spread out the values are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the interquartile range represent?

A

The interquartile range represents the range of the middle 50% of the distribution.

It is the distance between the first quartile (25th percentile) and third quartile (75th percentile)

IQR = Q3 - Q1 where Q1 and Q3 are the 25th and 75th percentiles

The IQR is the length of the box in a box plot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Fill in the blank: The three commonly used measures of center are mean, median, and _______.

A

Mode

Mode is essential in understanding the frequency of data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: The range is defined as the maximum value minus the minimum value.

A

True

This measure provides a quick sense of the spread of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the three commonly used measures of center?

A
  • mean (the arithmetic average)
  • median (the midpoint)
  • mode (the most frequent observation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the three commonly used measures of spread?

A
  • standard deviation (variability around the mean)
  • range (max-min)
  • interquartile range (middle 50% of the distribution)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following cannot be determined from a boxplot?

A

Box plots do not display modality, histograms do.

Histograms vs Box Plots

modality: whether the distribution is unimodal, bimodal, uniform, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are box plots and dot plots used for?

A

To highlight outliers and display the median and interquartile range

Box Plots vs Histograms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a common tool for visualizing the relationship between two numerical variables?

A

Scatter plot

The primary purpose of a scatter plot is to visualize the relationship between two numerical variables

A scatterplot provides a case-by-case view of data for two numerical variables.

In a scatter plot, the explanatory variable is placed in the x-axis, with the response variable in the y-axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can we only talk about when using observational data?

A

Correlation, not causation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does a strong relationship in a scatter plot indicate?

A

Little scatter around the curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a naive approach to handling outliers in data analysis?

A

Immediately excluding them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the purpose of histograms in data visualization?

A

Histogram

An histogram is a good way to visualize the distribution of a single numerical variable

In an Histogram, the height of the bers represent the number of cases that fall into each interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a dot plot used for?

A

Visualizing individual values.

19
Q

What does a box plot display?

A

Abox plot summarizes a data set using five statistics while also plotting unusual observations.
In a box plot display, the median and the interquartile range of the data are strongly displayed.

A vertical box plot
20
Q

What does an intensity map reveal?

A

Spatial distribution trends in the data.

When we encounter geographic data, we should create an intensity map, where colors are used to show higher and lower values of a variable.

US intensity map for poverty

The intensity maps are not generally very helpful for getting precise values in any given area, but they are very helpful for seeing geographic trends and generating interesting research questions or hypotheses.

21
Q

What is the meaning of skewness of a distribution?

A

Skewness refers to the direction of the tail in a distribution (left or right)

  • left skewed
  • symmetric
  • right skewed
skewness

Distibutions are skewed to the side of the long tail

22
Q

What is modality?

A

Modality is an important aspect of shape to describe a distribution.
Modality refers to the number of peaks in a distribution.

A distribution might be unimodal with one prominent peak, bimodal with two prominent peaks, or uniform with no prominent peaks. With more than two prominent peaks a distribution is usually said to be multimodal.

23
Q

Definition

Sample statistics

A

Sample statistics are point estimates for the unknown population parameters.

They are measurements calculated from a sample that is representative of the total population.

24
Q

Compare the mean and median according to the skewness of a distributiom

A

left skewed: mean < median
symmetric: mean | median
right skewed: mean > median

Skewness vs measures of center
25
Q

Variance

A

Variance is roughly the average squared deviation from the mean

variance
26
Q

Standard deviation

A

Standard deviation is the square root of the variance, and standard deviation has the same units as the data.

Standard Deviation

The standard deviation represents the typical deviation of observations from the mean.

27
Q

Variability vs Diversity

A

Distributions where more observations are clustered around the center, are less variable, versus distributions where more observations are away from the center, are more variable.

A set with more data at the ends of the distribution (away from the center) is more variable. Diversity in a set refers to the number of different discrete values, not how far or close to the center of a distribution these values are.

28
Q

True or False

In a left skewed distribution the median tends to be greater than the mean

A

True

The statistic mean/median (mean divided by median) can be used as a measure of skewness (either right or left)

29
Q

What is an outlier in a data distribution?

A

An outlier is an observation that appears extreme relative to the rest of the data.

Examining data for outliers serves many useful purposes, including
1. Identifying strong skew in the distribution.
2. Identifying possible data collection or data entry errors.
3. Providing insight into interesting properties of the data.

30
Q

What is the purpose of whishers in a box plot?

A

Extending out from the box, the whiskers attempt to capture the data outside of the box.

However, their reach is never allowed to be more than 1.5 × IQR. They capture everything within this reach.

a vertical box plot

In a sense, the box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data.

31
Q

True or False

Outliers are represented as dots beyond the whiskers in a box plot

A

True
Any observation lying beyond the whiskers is labeled with a dot.

a vertical box plot

The purpose of labeling these points– instead of extending the whiskers to the minimum and maximum observed values– is to help identify any observations that appear to be unusually distant from the rest of the data.

32
Q

robust statistics

A

We define robust statistics as measures on which extreme observations have little effect.

The median is a more robust statistic of center than the mean. The IQR, which is based on the median, is a more robust statistic than the standard deviation which is calculated using the mean.

Robust statistics are most useful for describing skewed distributions, or those with extreme observations. While non-robust statistics like mean and standard deviation are useful for describing symmetric distributions

33
Q

What is a transformation of data?

A

A transformation is a rescaling of the data using a function. When data are very strongly skewed, we sometimes transform them, so that they are easier to model.

We might want to reduce skew to assist in modeling or we might want to straighten a nonlinear relationship in a scatterplot, so that we can model the relationship with simpler methods.

34
Q

What is the most commonly used transformation?

A

The most commonly used transformation is the natural log transformation, which is often applied when much of the data cluster near zero relative to larger values in the dataset and all observations are positive.

35
Q

Hypothesis testing framework

A

1) We start with a null hypothesis that represents that status quo.
2) We also have an alternative hypothesis that represents our research question, in other words, what we’re testing for.
3) We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation or using theoretical methods.
A: If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis.
B:If they do, then we reject the null hypothesis in favor of the alternative.

36
Q

What is a contingency table?

A

A contingency table is a type of table used in statistics to display the frequency distribution of two categorical variables. It helps to summarize the relationship between these variables by showing how the categories of one variable relate to the categories of another.

contingency table
37
Q

What is the purpose of a Contingency Table?

A

Purpose of a Contingency Table:

Relationship Evaluation: It allows researchers to evaluate whether there is an association or relationship between the two categorical variables.

Conditional Distribution: It helps in calculating conditional distributions, which show the distribution of one variable given the levels of the other variable.

Data Visualization: It provides a clear visual representation of the data, making it easier to identify patterns or trends.

Statistical Analysis: It serves as a basis for various statistical tests, such as the Chi-square test, to determine if the observed frequencies differ significantly from expected frequencies.

38
Q

What is a graphical way to represent a single categorical variable?

A

A graphical way of representing data with a single categorical variable is a bar plot.

Also, we usually consider the relative frequencies when evaluating the distributions of categorical variables. We can also make a bar plot of these relative frequencies, which look just like the original bar plot but just has the relative frequencies instead of the counts on the y-axis.

39
Q

How are bar plots different than histograms?

A

First, bar plots are used for displaying distributions of categorical variables, while histograms are used for numerical variables.

Second, the axis in a histogram is a number line. Hence, the orders of the bars cannot be changed. While in a bar plot, the categories can be listed in any order, though some orderings make more sense than others, especially for original variables.

Bar Plot vs Histogram
40
Q

What type of plot is useful to visualize conditional frequency distributions of categorical data?

A

A segmented bar plot is useful for visualizing conditional frequency distributions of categorical variables.

Segmented Bar Plot

In other words, the distribution of the levels of one variable, the response variable, conditioned on the levels of the other, the explanatory variable.

41
Q

True or False

A mosaic plot can display both the marginal distribution and the conditional distribution of categorical variables.

A

True

mosaic plot

Mosaic plots are useful only for categorical variables.

42
Q

p-value

A

The probability of observing data under the assumption that the null hypothesis is true, is called the p-value.

43
Q

True or False

In a right skewed distribution the median tends to be greater than the mean

44
Q

What is suggested of a contingency table when it shows that there are considerable differences between proportions in their categories?

A

It suggests a relationship between categorical variables.

Unions and quality in public schools

35/290 ≈ 12% of Republicans, 146/341 ≈ 43% of Democrats, and 69/341 ≈ 20% of Independents think that teachers belonging to unions or bargaining associations helped the quality of public school education in the United States. Since there is considerable differences between these proportions, the results of the survey suggest a relationship between opinion on teachers belonging to unions or bargaining associations and political party affiliation.