Statistics Flashcards

Question 1

Q

Binary Categorical Variables

Answer

A

Categorical variables can also be binary or dichotomous variables. Binary variables are nominal categorical variables that contain only two, mutually exclusive categories. Examples of binary variables are if a person is pregnant, or if a house’s price is above or below a particular price.

Question 2

Q

dichotomous variables

Answer

A

Categorical variables can also be binary or dichotomous variables. Binary variables are nominal categorical variables that contain only two, mutually exclusive categories. Examples of binary variables are if a person is pregnant, or if a house’s price is above or below a particular price.

Question 3

Q

Ordinal categorical variables

Answer

A

Categorical variables consist of data that can be grouped into distinct categories, and are ordinal or nominal. Ordinal categorical variables which are groups that contain an inherent ranking, such as ratings of plays or responses to a survey question with a point scale e.g., on a scale from 1-7, how happy are you right now?

Question 4

Q

Nominal categorical variables

Answer

A

Nominal categorical variables are made of categories without an inherent order, examples of nominal variables are species of ants, or people’s hair color.

Question 5

Q

Quantitative variables

Answer

A

Quantitative variables are amounts or counts; for example, age, number of children, and income are all quantitative variables.

Question 6

Q

Categorical Variables

Answer

A

Categorical variables represent groupings; for example, type of pet, agreement rating, and brand of shoes are all categorical variables.

Question 7

Q

Categorical Data

Answer

A

Categorical Data refers to data represented by words rather than numbers. Examples of categorical data are tree species and survey responses (Agree, Neutral, Disagree).

Question 8

Q

Messy Data

Answer

A

Messy data is data that violates one of the tidy dataset rules (1. Each variable forms a column; 2. Each observation forms a row; 3. Each type of observational unit forms a table).

Question 9

Q

Tabular Data

Answer

A

Tabular data is organized into rows, or observations, along the vertical axis, and columns, also referred to as variables or features, along the horizontal axis.

Question 10

Q

Tidy Data Rules

Answer

A

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Question 11

Q

Sample Set of Data

Answer

A

A sample set of data is a dataset that is representative of the entire population of interest. Random sampling is the best way to make sure the sample is representative of the whole population but does not guarantee a representative sample, especially if the sample is too small.

Question 12

Q

Structurally Missing Data

Answer

A

Structurally Missing Data is data that is expected to be missing.
For example, there are structurally missing data in the ‘Litters’ and ‘Pups/Litter’ columns for all the male dogs in the table below because we would not expect male dogs to have puppies.

Question 13

Q

Missing at Random Data (MAR Data)

Answer

A

Missing at Random (MAR) data is missing because of some random characteristic about the person or thing being studied. Often, this type of data is reliably missing based on the value of another variable in the dataset.

In the table below, the bacterial cell counts for all the stool samples are ‘NaN’. If we looked into this, we might find that there were too many bacterial cells to count in all those samples. Therefore, the bacterial cell counts for stool samples would be MAR data.

Question 14

Q

Missing Completely at Random (MCAR) data

Answer

A

Missing Completely at Random (MCAR) data has no detectable underlying reason causing the values to be missing.

The table below has MCAR data. The # of fruits is missing for some plants, but the missing fruit data seems unrelated to the height of the plant. Short and tall plants are both missing fruit data. In addition, we are missing the height for one of our plants!

Question 15

Q

distribution

Answer

A

A distribution is a function that shows all possible values of a variable and how frequently each value occurs.

Question 16

Q

normal distribution

Answer

A

This distribution might be considered bell-shaped or hill-shaped and symmetrical. This is actually a very common pattern and is called a normal distribution.

Plot of a normal distribution. The bars in the plots become larger and then smaller moving left to right, forming a bell shape.

Question 17

Q

mean

Answer

A

The mean, also called the average, describes the center of a numeric distribution by adding all values and dividing by the count.

Question 18

Q

standard deviation

Answer

A

The standard deviation describes the spread of values in a numeric distribution by measuring the average distance of the values from the mean.

Question 19

Q

skewed distribution

Answer

A

A skewed distribution is asymmetrical with a steep change in frequency on one side and a flatter, trailing change in frequency on the other. Either right-skewed/positively-skewed (tail on the right) or left-skewed/negatively-skewed (tail on the left)

Question 20

Q

median

Answer

A

the middle value when all values are arranged from smallest to largest. This value is called the median, but it’s also referred to as the 50th percentile or the second quartile (Q2)

Question 21

Q

interquartile range (IQR)

Answer

A

The IQR is the difference between Q3 and Q1, marking the range for just the middle 50% of the data.

The first quartile marks 25% (Q1).
The second quartile marks 50% (Q2 - the median)
The third quartile marks 75% (Q3)

Question 22

Q

outliers

Answer

A

extreme values that are distant from the rest of the distribution. Just as with skewness, outliers tend to more heavily influence the mean than the median.

Question 23

Q

Robust Measures

Answer

A

Because the median and IQR are NOT heavily influenced by extreme values, we say they are robust. Robust statistics are often a better choice to measure the center and spread of a distribution that is skewed or has outliers.

Question 24

Q

mode

Answer

A

The mode is defined as the value with the highest frequency, but we can also think of the mode as the value where the peak of the distribution occurs. While not great for computations, the mode can help us identify interesting features in a variable.

Question 25

Q

bimodal

Answer

A

We would call a distribution bimodal when it has two modes.

Question 26

Q

Aggregate Data

Answer

A

By making a separation and then summarizing with the mean, we have aggregated our data.

Question 27

Q

scatter plot

Answer

A

The cloud of points comparing two Quantitative variables.

Question 28

Q

correlation coefficient

Answer

A

We can describe a relationship more precisely by measuring the correlation coefficient. This number ranges from -1 to +1 and tells us two things about a linear relationship:

Direction: A positive coefficient means that higher values in one variable are associated with higher values in the other. A negative coefficient means higher values in one variable are associated with lower values of the other.
Strength: The farther the coefficient is from 0, the stronger the relationship and the more the points in a scatter plot look like a line.

Question 29

Q

Descriptive analysis

Answer

A

In descriptive analyses, we calculate measures of central tendency and spread to summarize major patterns in a dataset.

Examples of measures of central tendency include: mean, median, mode.

Examples of measures of spread include: range, interquartile range, standard deviation, variance

Descriptive analysis also often include plots that help visualize measures of central tendency and spread. Common examples are box plots and histograms.

One limit of descriptive analysis is that the conclusions we draw cannot be extended beyond the data we directly analyzed.

For example, if we do a descriptive analysis on a dataset of household water usage in one region, we might find that the mean water usage is increasing over time. However, we would not be able to conclude anything about the mean water usage in other regions.

Question 30

Q

Exploratory Analysis

Answer

A

Exploratory data analysis looks for relationships between variables within a dataset. Exploratory analyses might reveal correlations between variables or group subsets of data based on shared characteristics.

Question 31

Q

Correlation and Causation

Answer

A

Correlation between variables does not necessarily mean a causal relationship exists between those variables.

For example, divorce rate in Maine and margarine consumption are correlated but margarine consumption does not cause divorces and divorce does not cause margarine consumption.

Question 32

Q

Inferential Analysis

Answer

A

Inferential analysis lets us draw conclusions about an entire population based on results from a subset or sample of that population. A/B testing, where we test which online feature performs better with a sample of a population, is a popular business application of inferential analysis.

Requirements for Inferential Analysis:
Inferential analysis is a powerful tool. As a result, several rules need to be followed for the analysis to be valid:

The sample selected must be “big enough” in comparison to the population. 10% is a good rule-of-thumb.
The sample should be randomly selected and representative of the total population.
Only test one hypothesis at a time. Manipulating more than one variable makes it impossible to tell which variable influenced the outcome.

Question 33

Q

Causal Analysis

Answer

A

Causal analysis coupled with careful experimental design lets us go beyond correlation and actually assign causation.

Key factors of good experimental design are:

Control: only one variable is changed at a time and the rest are kept from influencing the outcome of the experiment.
Randomization: subjects are randomly selected and randomly assigned to treatment groups.
Replication: many subjects are included in the experiment and the experiment is repeated with the same results.

Question 34

Q

Causal Analysis with Observational Data

Answer

A

Sometimes we need to know why something happened but we cannot perform the necessary experiments because they are too expensive, unethical, or otherwise impossible. In such cases, we may be able to do causal analysis on observational data but it requires meeting strict assumptions and applying advanced techniques.

For example, climate scientists apply advanced causal analysis techniques to determine whether global climate change impacts local weather systems since planet-scale experiments are impossible.

Question 35

Q

Predictive analysis

Answer

A

Predictive analysis takes advantage of supervised machine learning techniques to estimate the likelihood of future outcomes.

For example, recommendation algorithms use the preferences of many other people together with your previous choices to predict what you are most likely to enjoy.

Question 36

Q

Principal Component Analysis (PCA)

Answer

A

compresses the variables into principal components that can be plotted against each other. After PCA, we can use k-means clustering to look for trends in the data. We see that the penguins fall into three distinct clusters in the PCA plot!