Statistics Flashcards
Binary Categorical Variables
Categorical variables can also be binary or dichotomous variables. Binary variables are nominal categorical variables that contain only two, mutually exclusive categories. Examples of binary variables are if a person is pregnant, or if a house’s price is above or below a particular price.
dichotomous variables
Categorical variables can also be binary or dichotomous variables. Binary variables are nominal categorical variables that contain only two, mutually exclusive categories. Examples of binary variables are if a person is pregnant, or if a house’s price is above or below a particular price.
Ordinal categorical variables
Categorical variables consist of data that can be grouped into distinct categories, and are ordinal or nominal. Ordinal categorical variables which are groups that contain an inherent ranking, such as ratings of plays or responses to a survey question with a point scale e.g., on a scale from 1-7, how happy are you right now?
Nominal categorical variables
Nominal categorical variables are made of categories without an inherent order, examples of nominal variables are species of ants, or people’s hair color.
Quantitative variables
Quantitative variables are amounts or counts; for example, age, number of children, and income are all quantitative variables.
Categorical Variables
Categorical variables represent groupings; for example, type of pet, agreement rating, and brand of shoes are all categorical variables.
Categorical Data
Categorical Data refers to data represented by words rather than numbers. Examples of categorical data are tree species and survey responses (Agree, Neutral, Disagree).
Messy Data
Messy data is data that violates one of the tidy dataset rules (1. Each variable forms a column; 2. Each observation forms a row; 3. Each type of observational unit forms a table).
Tabular Data
Tabular data is organized into rows, or observations, along the vertical axis, and columns, also referred to as variables or features, along the horizontal axis.
Tidy Data Rules
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Sample Set of Data
A sample set of data is a dataset that is representative of the entire population of interest. Random sampling is the best way to make sure the sample is representative of the whole population but does not guarantee a representative sample, especially if the sample is too small.
Structurally Missing Data
Structurally Missing Data is data that is expected to be missing.
For example, there are structurally missing data in the ‘Litters’ and ‘Pups/Litter’ columns for all the male dogs in the table below because we would not expect male dogs to have puppies.
Missing at Random Data (MAR Data)
Missing at Random (MAR) data is missing because of some random characteristic about the person or thing being studied. Often, this type of data is reliably missing based on the value of another variable in the dataset.
In the table below, the bacterial cell counts for all the stool samples are ‘NaN’. If we looked into this, we might find that there were too many bacterial cells to count in all those samples. Therefore, the bacterial cell counts for stool samples would be MAR data.
Missing Completely at Random (MCAR) data
Missing Completely at Random (MCAR) data has no detectable underlying reason causing the values to be missing.
The table below has MCAR data. The # of fruits is missing for some plants, but the missing fruit data seems unrelated to the height of the plant. Short and tall plants are both missing fruit data. In addition, we are missing the height for one of our plants!
distribution
A distribution is a function that shows all possible values of a variable and how frequently each value occurs.