Statistics Flashcards

1
Q

Binary Categorical Variables

A

Categorical variables can also be binary or dichotomous variables. Binary variables are nominal categorical variables that contain only two, mutually exclusive categories. Examples of binary variables are if a person is pregnant, or if a house’s price is above or below a particular price.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dichotomous variables

A

Categorical variables can also be binary or dichotomous variables. Binary variables are nominal categorical variables that contain only two, mutually exclusive categories. Examples of binary variables are if a person is pregnant, or if a house’s price is above or below a particular price.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ordinal categorical variables

A

Categorical variables consist of data that can be grouped into distinct categories, and are ordinal or nominal. Ordinal categorical variables which are groups that contain an inherent ranking, such as ratings of plays or responses to a survey question with a point scale e.g., on a scale from 1-7, how happy are you right now?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Nominal categorical variables

A

Nominal categorical variables are made of categories without an inherent order, examples of nominal variables are species of ants, or people’s hair color.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Quantitative variables

A

Quantitative variables are amounts or counts; for example, age, number of children, and income are all quantitative variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Categorical Variables

A

Categorical variables represent groupings; for example, type of pet, agreement rating, and brand of shoes are all categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Categorical Data

A

Categorical Data refers to data represented by words rather than numbers. Examples of categorical data are tree species and survey responses (Agree, Neutral, Disagree).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Messy Data

A

Messy data is data that violates one of the tidy dataset rules (1. Each variable forms a column; 2. Each observation forms a row; 3. Each type of observational unit forms a table).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tabular Data

A

Tabular data is organized into rows, or observations, along the vertical axis, and columns, also referred to as variables or features, along the horizontal axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tidy Data Rules

A
  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sample Set of Data

A

A sample set of data is a dataset that is representative of the entire population of interest. Random sampling is the best way to make sure the sample is representative of the whole population but does not guarantee a representative sample, especially if the sample is too small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Structurally Missing Data

A

Structurally Missing Data is data that is expected to be missing.
For example, there are structurally missing data in the ‘Litters’ and ‘Pups/Litter’ columns for all the male dogs in the table below because we would not expect male dogs to have puppies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Missing at Random Data (MAR Data)

A

Missing at Random (MAR) data is missing because of some random characteristic about the person or thing being studied. Often, this type of data is reliably missing based on the value of another variable in the dataset.

In the table below, the bacterial cell counts for all the stool samples are ‘NaN’. If we looked into this, we might find that there were too many bacterial cells to count in all those samples. Therefore, the bacterial cell counts for stool samples would be MAR data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Missing Completely at Random (MCAR) data

A

Missing Completely at Random (MCAR) data has no detectable underlying reason causing the values to be missing.

The table below has MCAR data. The # of fruits is missing for some plants, but the missing fruit data seems unrelated to the height of the plant. Short and tall plants are both missing fruit data. In addition, we are missing the height for one of our plants!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

distribution

A

A distribution is a function that shows all possible values of a variable and how frequently each value occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

normal distribution

A

This distribution might be considered bell-shaped or hill-shaped and symmetrical. This is actually a very common pattern and is called a normal distribution.

Plot of a normal distribution. The bars in the plots become larger and then smaller moving left to right, forming a bell shape.

17
Q

mean

A

The mean, also called the average, describes the center of a numeric distribution by adding all values and dividing by the count.

18
Q

standard deviation

A

The standard deviation describes the spread of values in a numeric distribution by measuring the average distance of the values from the mean.

19
Q

skewed distribution

A

A skewed distribution is asymmetrical with a steep change in frequency on one side and a flatter, trailing change in frequency on the other. Either right-skewed/positively-skewed (tail on the right) or left-skewed/negatively-skewed (tail on the left)

20
Q

median

A

the middle value when all values are arranged from smallest to largest. This value is called the median, but it’s also referred to as the 50th percentile or the second quartile (Q2)

21
Q

interquartile range (IQR)

A

The IQR is the difference between Q3 and Q1, marking the range for just the middle 50% of the data.

The first quartile marks 25% (Q1).
The second quartile marks 50% (Q2 - the median)
The third quartile marks 75% (Q3)

22
Q

outliers

A

extreme values that are distant from the rest of the distribution. Just as with skewness, outliers tend to more heavily influence the mean than the median.

23
Q

Robust Measures

A

Because the median and IQR are NOT heavily influenced by extreme values, we say they are robust. Robust statistics are often a better choice to measure the center and spread of a distribution that is skewed or has outliers.

24
Q

mode

A

The mode is defined as the value with the highest frequency, but we can also think of the mode as the value where the peak of the distribution occurs. While not great for computations, the mode can help us identify interesting features in a variable.

25
Q

bimodal

A

We would call a distribution bimodal when it has two modes.

26
Q

Aggregate Data

A

By making a separation and then summarizing with the mean, we have aggregated our data.

27
Q

scatter plot

A

The cloud of points comparing two Quantitative variables.

28
Q

correlation coefficient

A

We can describe a relationship more precisely by measuring the correlation coefficient. This number ranges from -1 to +1 and tells us two things about a linear relationship:

Direction: A positive coefficient means that higher values in one variable are associated with higher values in the other. A negative coefficient means higher values in one variable are associated with lower values of the other.
Strength: The farther the coefficient is from 0, the stronger the relationship and the more the points in a scatter plot look like a line.

29
Q

Descriptive analysis

A

In descriptive analyses, we calculate measures of central tendency and spread to summarize major patterns in a dataset.

Examples of measures of central tendency include: mean, median, mode.

Examples of measures of spread include: range, interquartile range, standard deviation, variance

Descriptive analysis also often include plots that help visualize measures of central tendency and spread. Common examples are box plots and histograms.

One limit of descriptive analysis is that the conclusions we draw cannot be extended beyond the data we directly analyzed.

For example, if we do a descriptive analysis on a dataset of household water usage in one region, we might find that the mean water usage is increasing over time. However, we would not be able to conclude anything about the mean water usage in other regions.

30
Q

Exploratory Analysis

A

Exploratory data analysis looks for relationships between variables within a dataset. Exploratory analyses might reveal correlations between variables or group subsets of data based on shared characteristics.

31
Q

Correlation and Causation

A

Correlation between variables does not necessarily mean a causal relationship exists between those variables.

For example, divorce rate in Maine and margarine consumption are correlated but margarine consumption does not cause divorces and divorce does not cause margarine consumption.

32
Q

Inferential Analysis

A

Inferential analysis lets us draw conclusions about an entire population based on results from a subset or sample of that population. A/B testing, where we test which online feature performs better with a sample of a population, is a popular business application of inferential analysis.

Requirements for Inferential Analysis:
Inferential analysis is a powerful tool. As a result, several rules need to be followed for the analysis to be valid:

The sample selected must be “big enough” in comparison to the population. 10% is a good rule-of-thumb.
The sample should be randomly selected and representative of the total population.
Only test one hypothesis at a time. Manipulating more than one variable makes it impossible to tell which variable influenced the outcome.

33
Q

Causal Analysis

A

Causal analysis coupled with careful experimental design lets us go beyond correlation and actually assign causation.

Key factors of good experimental design are:

Control: only one variable is changed at a time and the rest are kept from influencing the outcome of the experiment.
Randomization: subjects are randomly selected and randomly assigned to treatment groups.
Replication: many subjects are included in the experiment and the experiment is repeated with the same results.

34
Q

Causal Analysis with Observational Data

A

Sometimes we need to know why something happened but we cannot perform the necessary experiments because they are too expensive, unethical, or otherwise impossible. In such cases, we may be able to do causal analysis on observational data but it requires meeting strict assumptions and applying advanced techniques.

For example, climate scientists apply advanced causal analysis techniques to determine whether global climate change impacts local weather systems since planet-scale experiments are impossible.

35
Q

Predictive analysis

A

Predictive analysis takes advantage of supervised machine learning techniques to estimate the likelihood of future outcomes.

For example, recommendation algorithms use the preferences of many other people together with your previous choices to predict what you are most likely to enjoy.

36
Q

Principal Component Analysis (PCA)

A

compresses the variables into principal components that can be plotted against each other. After PCA, we can use k-means clustering to look for trends in the data. We see that the penguins fall into three distinct clusters in the PCA plot!