Descriptive Statistics Flashcards

1
Q

What are descriptive statistics

A

Simple descriptions of the qualities of a dataset. They can be used as a quick insight into a dataset and most of these descriptions fall into three camps: measures of the central tendency, measures of variability, and measures of frequency distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the measures of central tendency in descriptive statistics

A

The measures of central tendency in descriptive statistics are the mean, median, and mode. They describe the central portions of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the measures of variability in descriptive statistics

A

The measures of variability in descriptive statistics are the variance, standard deviation, range, and interquartile range. They describe the spread of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the measures of frequency distributions in descriptive statistics

A

The measures of frequency distributions in descriptive statistics are counts and histograms. They describe the occurrences of the different observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the limitations of using descriptive statistics

A

Descriptive statistics often boil down some component of the data into a singular value, which provides a simplified insight but can be misleading and hide underlying information. It is important to understand their limitations and use them appropriately

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the arithmetic mean μ

A

The arithmetic mean μ is a description of the average across the population. It is calculated by taking the sum of all the samples xi and dividing it by the total number of samples n, expressed as: μ = Σ(xi) / n. Where xi is a sample from n samples in the populations , Σ is the symbol used as a “sum” operator, subscript denotes the iterator, and superscript denotes the limit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is the axis along which we perform operations important

A

The axis along which we perform operations is important because taking the mean along the horizontal axis may not make sense

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the formula for taking the mean along the axis

A

μj = (1/n) * Σi=1 to n (xij), where:
μj is the mean value of the j-th feature/variable
n is the number of samples/observations in the dataset
xij is the value of the j-th feature/variable for the i-th sample/observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the pros of taking the arithmetic mean

A

The pros of taking the arithmetic mean are that you don’t need to sort the data, it treats all samples equally, and it is commonly used, so many are familiar with what it represents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the cons of taking the arithmetic mean

A

They are sensitive to outliers, must iterate over all samples, not for categorical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the median

A

A description of the centre of the population by the middle value from the ordered list of the observed values. 50% of the observations will be above it , and 50% below

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is the median calculated for a list x of lenght n

A

If n is odd:
Median = x(n-1)/2
if n is even:
Median = (x(n/2)-1+ x(n/2))/2
Here we index from 0, but you may see indexing from 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the pros of using the median

A

Most robust to a few outliers, identifies the middle of the dataset, when combined with the mean, we get a sense of skew in our data, and there’s no need to iterate over an entire set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the cons of using the median

A

Must sort the data, which can be expensive. Different approach based on if n is odd or even. Not for categorical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the mode

A

The mode is a measure of central tendency that identifies the most frequently occurring value in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What type of data is the mode best suited for

A

The mode is best suited for categorical data or discrete variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the pros of using the mode

A

The mode is good for categorical data and some insight into continuous data if we aggregate well, it identifies the most common observation and there is no need to sort the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the cons of using the mode

A

The con of using the mode are that it must be counted or aggregated, there may be multiple nodes, and it is not always a good reflection of the dataset as a whole

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is the mode calculated

A

The mode is calculated by finding the value that occurs most frequently in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is frequency distribution

A

Frequency distribution is a way to describe the frequency of occurrence for observations within the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What insights can frequency distribution provide

A

Frequency distribution provides insight into questions like the distribution of animals, age group, and the concentration of values in certain areas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the purpose of frequency distribution

A

To give a summary of the number of observations and the frequency of each value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What type of data is best suited for mean

A

Continuous numerical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the purpose of frequency distributions counts

A

Frequency distribution counts provide insights into the number of times each observation occurs within a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What information can frequency distribution counts provide

A

Frequency distribution counts can provide information on the relative frequency of each observation, the most common observations, and the spread or dispersion of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How are frequency distribution counts calculated

A

Frequency distributions counts are calculated by counting the number of times each observation appears within a population and presenting the results in a table or graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the difference between frequency distribution counts and frequency distribution percentages

A

Frequency distribution counts represents the actual number of times each observation appears within a population, while frequency distribution percentages represent the proportion of times each observation appears within a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why is it important to understand frequency distribution counts

A

Understanding frequency counts can provide valuable insight into the characteristics of a population and can be used to make informed decisions based on the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a histogram in the context of frequency distribution

A

A histogram is a way of visually representing a frequency distribution for a continuous feature. It divides the observed range into fixed-sized sub-partitions called “bins” and counts the number of observations that fall within each bin

30
Q

How are values represented in a histogram

A

Values are represented on the x-axis of a histogram, while the y-axis shows the frequency (count) of observations that fall within each bin

31
Q

What is the difference between a closed and an open interval in a histogram

A

A closed interval is a histogram that includes its endpoint, while an open interval does not. In notation, a closed interval is denoted with square brackets [], while an open interval is denoted with parentheses ().
E.g. [0,10],(10,20]

32
Q

What is range

A

The range is the difference between the highest and lowest values in a sample

33
Q

What does the range tell us

A

The range provides a basic description of how spread out a dataset is

34
Q

What are the pros of using the range

A

The range is simple to calculate and easy to understand

35
Q

What are the cons of using the range

A

The range is sensitive to outliers and may not provide a complete picture of the variability in the dataset

36
Q

What does population variance describe

A

Population variance describes, on average, how the samples in a population vary from the mean

37
Q

What is the formula for population variance

A

σ^2 = Σ(xi - μ)^2 / N
Where xi is a sample
μ is the mean of all samples
N is the total number of samples in the population

38
Q

Can population variance have a negative value

A

No, it is always non-negative

39
Q

How does a larger population variance affect the variability of the samples

A

A larger population variance indicates greater variability in the samples from the mean

40
Q

What is the difference between population variance and sample variance

A

Population variance uses all individuals in the population to calculate the variance, while sample variance only uses a subset of individuals from the population

41
Q

Why is sample variance used instead of population variance in some cases

A

Sample variance is used when we don’t have access to all individuals in the population and need to estimate the variance based on a sample

42
Q

What is the formula for sample variance

A

The formula for sample variance is s^2 = ∑(xi - x̄)^2 / (n-1), where xi is the value of the ith observation, x̄ is the mean of the observations, and n is the number of observations in the sample.

43
Q

Why is the denominator in the sample variance formula (n-1) instead of n

A

The denominator is (n-1) instead of n to correct for the bias that results from using a sample to estimate the population variance

44
Q

What is the unit of measurement for sample variance

A

The unit of measurement for sample variance is the square of the unit of measurement for the observation in the sample. For example, if the observations are measured in inches, the unit of the measurement for sample variance would be square inches

45
Q

What is the unit of measurement for population variance

A

The unit of measurement for population variance is square of the unit of measurement for the data. For example, if the data is measured in meters, the variance will be measured in square meters

46
Q

What is standard deviation

A

A measure of the amount of variation or dispersion of a set of values from its mean. It is calculated as the square root of the variance of the dataset, which is the average of the squared differences from the mean

47
Q

What is the difference between variance and standard deviation

A

Variance is the average of squared differences from the mean, while standard deviation is the square root of the variance

48
Q

Why is standard deviation more useful than variance

A

Standard deviation is more useful than variance because it is in the same units as the original data making it easier to interpret and compare

49
Q

What is a binary label in machine learning

A

A binary label is a label that can take on only two possible values, such as “yes” or “no”, “true” or “false”, or 0 or 1. Binary labels are commonly used in classification tasks where the goal is to predict a binary outcome.

50
Q

What are some examples of binary labels in machine learning?

A

Examples of binary labels include predicting whether an email is spam or not spam, whether a credit card transaction is fraudulent or not fraudulent, or whether a patient is at high risk or low risk for a particular disease.

51
Q

How do you train a model using binary labels?

A

To train a model using binary labels, you typically need a dataset that includes examples of both positive and negative outcomes. You would then use this data to train a classification algorithm, such as logistic regression or a decision tree, to predict the binary outcome based on the input features.

52
Q

What are some common evaluation metrics for models trained with binary labels?

A

Common evaluation metrics for binary classification models include accuracy, precision, recall, and F1 score. These metrics are used to measure the performance of the model in terms of its ability to correctly classify positive and negative examples.

53
Q

What is supervised learning

A

Supervised learning is a type of machine learning in which the model is trained on a labelled dataset. The labelled data consists of input features and corresponding output values, or “labels “the model is trying to predict. The goal of supervised learning is to train the model to accurately predict the output values for new, unseen input data

54
Q

What are some examples of supervised learning

A

Predicting prices of a house based on its features, classifying emails as spam or not spam, and identifying the type of flower based on its measurements

55
Q

What is unsupervised learning

A

Unsupervised learning is a type of machine learning in which the model is trained on an unlabelled dataset. The goal of unsupervised learning is to identify patterns or structure in the data without the use of predefined output values. This can be useful for tasks such as clustering, dimensionality, reduction, and anomaly detection

56
Q

What are some examples of unsupervised learning

A

Clustering similar customer profiles based on their purchasing behaviour, reducing the dimensionality of high-dimensional data for visualisation purposes, and identifying outliers in a dataset

57
Q

What is the main difference between supervised and unsupervised learning

A

The use of labelled versus unlabelled data. In supervised learning, the model is trained on labelled data to predict output values. In unsupervised learning the model is trained on unlabelled data to identify patterns or structure in data

58
Q

What is feature scaling

A

A technique used in machine learning to normalise the range of values for each feature in a dataset. This is important because individual features may have different distributions, ranges or units of measurements, which can cause issues for some machine learning algorithms

59
Q

Why is data cleaning and preparation important in machine learning

A

Data cleaning and preparation is important because machine learning algorithms require high-quality, well-structured data to produce accurate and reliable results. Dirty or unstructured data, such as missing values, error, or outliers, can significantly impact the performance of a machine learning model

60
Q

What are some common techniques used for data cleaning and preparation

A

Removing or inputting missing values, correction of errors in data entry or collection, standardising the formatting and structure of data and identifying and removing outliers

61
Q

How can you deal with multivariate data in machine learning

A

Dealing with multivariate data in machine learning requires careful consideration of the relationships between features. This may involve identifying causative variables, removing redundant or irrelevant features, and normalising the range of values for each feature using techniques such as feature scaling

62
Q

What are some best practices for data cleaning and preparation in machine learning

A

Best practices for data cleaning and preparation in machine learning include thoroughly documenting data sources and cleaning processes, validating data quality and completeness, exploring the data for patterns and outliers and using established standards and frameworks for data management and preprocessing

63
Q

What is missing data

A

Missing data refers to the absence of a value for a particular observation or variable in a dataset. This can occur for various reasons, such as data collection errors, data entry mistakes, or intentionaly omissions

64
Q

Why is missing data a problem in machine learning

A

Missing data can cause problems in machine learning because many algorithms are designed to work with complete, high-quality data. Incomplete or missing data can lead to biased inaccurate results, and can even cause the model to fail completely

65
Q

What are some techniques for inputting missing data

A

Techniques for inputting missing data include mean or median imputation regression imputation, and k-nearest neighbour imputation. Each technique has its own strengths and weaknesses and the choice of technique depends on the specific characteristics of the dataset and the research question being addressed

66
Q

What are some best practices for handling missing data in machine learning

A

Best practices for handling missing data include carefully documenting the reasons for missing data, exploring the patterns of missing data in the dataset, considering both removal and imputation techniques, validating the quality of imputed data, and performing sensitivity analyses to assess the impact of missing data on the results

67
Q

What is mean/median imputation

A

This technique involves replacing missing values with the mean or median value of the variable across all other observations. This is a simple and quick technique, but it can be biased if the missing values are not randomly distributed across the variable. Mean or median imputation is best suited for variables with relatively symmetric distribution

68
Q

What is regression imputation

A

This technique involves using a regression model to predict the missing values based on other variables in the dataset. This approach assumes that the missing values are related to other variables in the dataset, and can be useful when there is a strong correlation between variables. However, regression imputation can be computationally intensive and may require careful model selection and validation

69
Q

When is mean/median imputation best suited for

A

Mean or median imputation is best suited for variables with relatively symmetric distribution

70
Q

What is the K-nearest neighbor imputation

A

This technique involves identifying the k-neareset observations to the observation with missing values and using their values to impute the missing data. This apporach assumes that observations with similar values for other varaibles will have similar values for the variable with missing data.

71
Q

When is regression imputation best suited for

A

It is best suited when there is a strong relationship or correlation between the variable with missing values and other variables in the dataset