Descriptive Statistics Flashcards
What are descriptive statistics
Simple descriptions of the qualities of a dataset. They can be used as a quick insight into a dataset and most of these descriptions fall into three camps: measures of the central tendency, measures of variability, and measures of frequency distributions
What are the measures of central tendency in descriptive statistics
The measures of central tendency in descriptive statistics are the mean, median, and mode. They describe the central portions of the data
What are the measures of variability in descriptive statistics
The measures of variability in descriptive statistics are the variance, standard deviation, range, and interquartile range. They describe the spread of the data
What are the measures of frequency distributions in descriptive statistics
The measures of frequency distributions in descriptive statistics are counts and histograms. They describe the occurrences of the different observations
What are the limitations of using descriptive statistics
Descriptive statistics often boil down some component of the data into a singular value, which provides a simplified insight but can be misleading and hide underlying information. It is important to understand their limitations and use them appropriately
What is the arithmetic mean μ
The arithmetic mean μ is a description of the average across the population. It is calculated by taking the sum of all the samples xi and dividing it by the total number of samples n, expressed as: μ = Σ(xi) / n. Where xi is a sample from n samples in the populations , Σ is the symbol used as a “sum” operator, subscript denotes the iterator, and superscript denotes the limit
Why is the axis along which we perform operations important
The axis along which we perform operations is important because taking the mean along the horizontal axis may not make sense
What is the formula for taking the mean along the axis
μj = (1/n) * Σi=1 to n (xij), where:
μj is the mean value of the j-th feature/variable
n is the number of samples/observations in the dataset
xij is the value of the j-th feature/variable for the i-th sample/observation
What are the pros of taking the arithmetic mean
The pros of taking the arithmetic mean are that you don’t need to sort the data, it treats all samples equally, and it is commonly used, so many are familiar with what it represents
What are the cons of taking the arithmetic mean
They are sensitive to outliers, must iterate over all samples, not for categorical data
What is the median
A description of the centre of the population by the middle value from the ordered list of the observed values. 50% of the observations will be above it , and 50% below
How is the median calculated for a list x of lenght n
If n is odd:
Median = x(n-1)/2
if n is even:
Median = (x(n/2)-1+ x(n/2))/2
Here we index from 0, but you may see indexing from 1
What are the pros of using the median
Most robust to a few outliers, identifies the middle of the dataset, when combined with the mean, we get a sense of skew in our data, and there’s no need to iterate over an entire set
What are the cons of using the median
Must sort the data, which can be expensive. Different approach based on if n is odd or even. Not for categorical data
What is the mode
The mode is a measure of central tendency that identifies the most frequently occurring value in the dataset
What type of data is the mode best suited for
The mode is best suited for categorical data or discrete variables
What are the pros of using the mode
The mode is good for categorical data and some insight into continuous data if we aggregate well, it identifies the most common observation and there is no need to sort the data
What are the cons of using the mode
The con of using the mode are that it must be counted or aggregated, there may be multiple nodes, and it is not always a good reflection of the dataset as a whole
How is the mode calculated
The mode is calculated by finding the value that occurs most frequently in the dataset
What is frequency distribution
Frequency distribution is a way to describe the frequency of occurrence for observations within the population
What insights can frequency distribution provide
Frequency distribution provides insight into questions like the distribution of animals, age group, and the concentration of values in certain areas
What is the purpose of frequency distribution
To give a summary of the number of observations and the frequency of each value
What type of data is best suited for mean
Continuous numerical data
What is the purpose of frequency distributions counts
Frequency distribution counts provide insights into the number of times each observation occurs within a population
What information can frequency distribution counts provide
Frequency distribution counts can provide information on the relative frequency of each observation, the most common observations, and the spread or dispersion of the data
How are frequency distribution counts calculated
Frequency distributions counts are calculated by counting the number of times each observation appears within a population and presenting the results in a table or graph
What is the difference between frequency distribution counts and frequency distribution percentages
Frequency distribution counts represents the actual number of times each observation appears within a population, while frequency distribution percentages represent the proportion of times each observation appears within a population
Why is it important to understand frequency distribution counts
Understanding frequency counts can provide valuable insight into the characteristics of a population and can be used to make informed decisions based on the data
What is a histogram in the context of frequency distribution
A histogram is a way of visually representing a frequency distribution for a continuous feature. It divides the observed range into fixed-sized sub-partitions called “bins” and counts the number of observations that fall within each bin
How are values represented in a histogram
Values are represented on the x-axis of a histogram, while the y-axis shows the frequency (count) of observations that fall within each bin
What is the difference between a closed and an open interval in a histogram
A closed interval is a histogram that includes its endpoint, while an open interval does not. In notation, a closed interval is denoted with square brackets [], while an open interval is denoted with parentheses ().
E.g. [0,10],(10,20]
What is range
The range is the difference between the highest and lowest values in a sample
What does the range tell us
The range provides a basic description of how spread out a dataset is
What are the pros of using the range
The range is simple to calculate and easy to understand
What are the cons of using the range
The range is sensitive to outliers and may not provide a complete picture of the variability in the dataset
What does population variance describe
Population variance describes, on average, how the samples in a population vary from the mean
What is the formula for population variance
σ^2 = Σ(xi - μ)^2 / N
Where xi is a sample
μ is the mean of all samples
N is the total number of samples in the population
Can population variance have a negative value
No, it is always non-negative
How does a larger population variance affect the variability of the samples
A larger population variance indicates greater variability in the samples from the mean
What is the difference between population variance and sample variance
Population variance uses all individuals in the population to calculate the variance, while sample variance only uses a subset of individuals from the population
Why is sample variance used instead of population variance in some cases
Sample variance is used when we don’t have access to all individuals in the population and need to estimate the variance based on a sample
What is the formula for sample variance
The formula for sample variance is s^2 = ∑(xi - x̄)^2 / (n-1), where xi is the value of the ith observation, x̄ is the mean of the observations, and n is the number of observations in the sample.
Why is the denominator in the sample variance formula (n-1) instead of n
The denominator is (n-1) instead of n to correct for the bias that results from using a sample to estimate the population variance
What is the unit of measurement for sample variance
The unit of measurement for sample variance is the square of the unit of measurement for the observation in the sample. For example, if the observations are measured in inches, the unit of the measurement for sample variance would be square inches
What is the unit of measurement for population variance
The unit of measurement for population variance is square of the unit of measurement for the data. For example, if the data is measured in meters, the variance will be measured in square meters
What is standard deviation
A measure of the amount of variation or dispersion of a set of values from its mean. It is calculated as the square root of the variance of the dataset, which is the average of the squared differences from the mean
What is the difference between variance and standard deviation
Variance is the average of squared differences from the mean, while standard deviation is the square root of the variance
Why is standard deviation more useful than variance
Standard deviation is more useful than variance because it is in the same units as the original data making it easier to interpret and compare
What is a binary label in machine learning
A binary label is a label that can take on only two possible values, such as “yes” or “no”, “true” or “false”, or 0 or 1. Binary labels are commonly used in classification tasks where the goal is to predict a binary outcome.
What are some examples of binary labels in machine learning?
Examples of binary labels include predicting whether an email is spam or not spam, whether a credit card transaction is fraudulent or not fraudulent, or whether a patient is at high risk or low risk for a particular disease.
How do you train a model using binary labels?
To train a model using binary labels, you typically need a dataset that includes examples of both positive and negative outcomes. You would then use this data to train a classification algorithm, such as logistic regression or a decision tree, to predict the binary outcome based on the input features.
What are some common evaluation metrics for models trained with binary labels?
Common evaluation metrics for binary classification models include accuracy, precision, recall, and F1 score. These metrics are used to measure the performance of the model in terms of its ability to correctly classify positive and negative examples.
What is supervised learning
Supervised learning is a type of machine learning in which the model is trained on a labelled dataset. The labelled data consists of input features and corresponding output values, or “labels “the model is trying to predict. The goal of supervised learning is to train the model to accurately predict the output values for new, unseen input data
What are some examples of supervised learning
Predicting prices of a house based on its features, classifying emails as spam or not spam, and identifying the type of flower based on its measurements
What is unsupervised learning
Unsupervised learning is a type of machine learning in which the model is trained on an unlabelled dataset. The goal of unsupervised learning is to identify patterns or structure in the data without the use of predefined output values. This can be useful for tasks such as clustering, dimensionality, reduction, and anomaly detection
What are some examples of unsupervised learning
Clustering similar customer profiles based on their purchasing behaviour, reducing the dimensionality of high-dimensional data for visualisation purposes, and identifying outliers in a dataset
What is the main difference between supervised and unsupervised learning
The use of labelled versus unlabelled data. In supervised learning, the model is trained on labelled data to predict output values. In unsupervised learning the model is trained on unlabelled data to identify patterns or structure in data
What is feature scaling
A technique used in machine learning to normalise the range of values for each feature in a dataset. This is important because individual features may have different distributions, ranges or units of measurements, which can cause issues for some machine learning algorithms
Why is data cleaning and preparation important in machine learning
Data cleaning and preparation is important because machine learning algorithms require high-quality, well-structured data to produce accurate and reliable results. Dirty or unstructured data, such as missing values, error, or outliers, can significantly impact the performance of a machine learning model
What are some common techniques used for data cleaning and preparation
Removing or inputting missing values, correction of errors in data entry or collection, standardising the formatting and structure of data and identifying and removing outliers
How can you deal with multivariate data in machine learning
Dealing with multivariate data in machine learning requires careful consideration of the relationships between features. This may involve identifying causative variables, removing redundant or irrelevant features, and normalising the range of values for each feature using techniques such as feature scaling
What are some best practices for data cleaning and preparation in machine learning
Best practices for data cleaning and preparation in machine learning include thoroughly documenting data sources and cleaning processes, validating data quality and completeness, exploring the data for patterns and outliers and using established standards and frameworks for data management and preprocessing
What is missing data
Missing data refers to the absence of a value for a particular observation or variable in a dataset. This can occur for various reasons, such as data collection errors, data entry mistakes, or intentionaly omissions
Why is missing data a problem in machine learning
Missing data can cause problems in machine learning because many algorithms are designed to work with complete, high-quality data. Incomplete or missing data can lead to biased inaccurate results, and can even cause the model to fail completely
What are some techniques for inputting missing data
Techniques for inputting missing data include mean or median imputation regression imputation, and k-nearest neighbour imputation. Each technique has its own strengths and weaknesses and the choice of technique depends on the specific characteristics of the dataset and the research question being addressed
What are some best practices for handling missing data in machine learning
Best practices for handling missing data include carefully documenting the reasons for missing data, exploring the patterns of missing data in the dataset, considering both removal and imputation techniques, validating the quality of imputed data, and performing sensitivity analyses to assess the impact of missing data on the results
What is mean/median imputation
This technique involves replacing missing values with the mean or median value of the variable across all other observations. This is a simple and quick technique, but it can be biased if the missing values are not randomly distributed across the variable. Mean or median imputation is best suited for variables with relatively symmetric distribution
What is regression imputation
This technique involves using a regression model to predict the missing values based on other variables in the dataset. This approach assumes that the missing values are related to other variables in the dataset, and can be useful when there is a strong correlation between variables. However, regression imputation can be computationally intensive and may require careful model selection and validation
When is mean/median imputation best suited for
Mean or median imputation is best suited for variables with relatively symmetric distribution
What is the K-nearest neighbor imputation
This technique involves identifying the k-neareset observations to the observation with missing values and using their values to impute the missing data. This apporach assumes that observations with similar values for other varaibles will have similar values for the variable with missing data.
When is regression imputation best suited for
It is best suited when there is a strong relationship or correlation between the variable with missing values and other variables in the dataset