Descriptive Statistics Flashcards
What are descriptive statistics
Simple descriptions of the qualities of a dataset. They can be used as a quick insight into a dataset and most of these descriptions fall into three camps: measures of the central tendency, measures of variability, and measures of frequency distributions
What are the measures of central tendency in descriptive statistics
The measures of central tendency in descriptive statistics are the mean, median, and mode. They describe the central portions of the data
What are the measures of variability in descriptive statistics
The measures of variability in descriptive statistics are the variance, standard deviation, range, and interquartile range. They describe the spread of the data
What are the measures of frequency distributions in descriptive statistics
The measures of frequency distributions in descriptive statistics are counts and histograms. They describe the occurrences of the different observations
What are the limitations of using descriptive statistics
Descriptive statistics often boil down some component of the data into a singular value, which provides a simplified insight but can be misleading and hide underlying information. It is important to understand their limitations and use them appropriately
What is the arithmetic mean μ
The arithmetic mean μ is a description of the average across the population. It is calculated by taking the sum of all the samples xi and dividing it by the total number of samples n, expressed as: μ = Σ(xi) / n. Where xi is a sample from n samples in the populations , Σ is the symbol used as a “sum” operator, subscript denotes the iterator, and superscript denotes the limit
Why is the axis along which we perform operations important
The axis along which we perform operations is important because taking the mean along the horizontal axis may not make sense
What is the formula for taking the mean along the axis
μj = (1/n) * Σi=1 to n (xij), where:
μj is the mean value of the j-th feature/variable
n is the number of samples/observations in the dataset
xij is the value of the j-th feature/variable for the i-th sample/observation
What are the pros of taking the arithmetic mean
The pros of taking the arithmetic mean are that you don’t need to sort the data, it treats all samples equally, and it is commonly used, so many are familiar with what it represents
What are the cons of taking the arithmetic mean
They are sensitive to outliers, must iterate over all samples, not for categorical data
What is the median
A description of the centre of the population by the middle value from the ordered list of the observed values. 50% of the observations will be above it , and 50% below
How is the median calculated for a list x of lenght n
If n is odd:
Median = x(n-1)/2
if n is even:
Median = (x(n/2)-1+ x(n/2))/2
Here we index from 0, but you may see indexing from 1
What are the pros of using the median
Most robust to a few outliers, identifies the middle of the dataset, when combined with the mean, we get a sense of skew in our data, and there’s no need to iterate over an entire set
What are the cons of using the median
Must sort the data, which can be expensive. Different approach based on if n is odd or even. Not for categorical data
What is the mode
The mode is a measure of central tendency that identifies the most frequently occurring value in the dataset
What type of data is the mode best suited for
The mode is best suited for categorical data or discrete variables
What are the pros of using the mode
The mode is good for categorical data and some insight into continuous data if we aggregate well, it identifies the most common observation and there is no need to sort the data
What are the cons of using the mode
The con of using the mode are that it must be counted or aggregated, there may be multiple nodes, and it is not always a good reflection of the dataset as a whole
How is the mode calculated
The mode is calculated by finding the value that occurs most frequently in the dataset
What is frequency distribution
Frequency distribution is a way to describe the frequency of occurrence for observations within the population
What insights can frequency distribution provide
Frequency distribution provides insight into questions like the distribution of animals, age group, and the concentration of values in certain areas
What is the purpose of frequency distribution
To give a summary of the number of observations and the frequency of each value
What type of data is best suited for mean
Continuous numerical data
What is the purpose of frequency distributions counts
Frequency distribution counts provide insights into the number of times each observation occurs within a population
What information can frequency distribution counts provide
Frequency distribution counts can provide information on the relative frequency of each observation, the most common observations, and the spread or dispersion of the data
How are frequency distribution counts calculated
Frequency distributions counts are calculated by counting the number of times each observation appears within a population and presenting the results in a table or graph
What is the difference between frequency distribution counts and frequency distribution percentages
Frequency distribution counts represents the actual number of times each observation appears within a population, while frequency distribution percentages represent the proportion of times each observation appears within a population
Why is it important to understand frequency distribution counts
Understanding frequency counts can provide valuable insight into the characteristics of a population and can be used to make informed decisions based on the data