Descriptive Statistics Flashcards

Question

What information can frequency distribution counts provide

Answer 1

Frequency distribution counts can provide information on the relative frequency of each observation, the most common observations, and the spread or dispersion of the data

Answer 2

Frequency distributions counts are calculated by counting the number of times each observation appears within a population and presenting the results in a table or graph

Answer 3

Frequency distribution counts represents the actual number of times each observation appears within a population, while frequency distribution percentages represent the proportion of times each observation appears within a population

Answer 4

Understanding frequency counts can provide valuable insight into the characteristics of a population and can be used to make informed decisions based on the data

Answer 5

A histogram is a way of visually representing a frequency distribution for a continuous feature. It divides the observed range into fixed-sized sub-partitions called "bins" and counts the number of observations that fall within each bin

Answer 6

Values are represented on the x-axis of a histogram, while the y-axis shows the frequency (count) of observations that fall within each bin

Answer 7

A closed interval is a histogram that includes its endpoint, while an open interval does not. In notation, a closed interval is denoted with square brackets [], while an open interval is denoted with parentheses (). E.g. [0,10],(10,20]

Answer 8

The range is the difference between the highest and lowest values in a sample

Answer 9

The range provides a basic description of how spread out a dataset is

Answer 10

The range is simple to calculate and easy to understand

Answer 11

The range is sensitive to outliers and may not provide a complete picture of the variability in the dataset

Answer 12

Population variance describes, on average, how the samples in a population vary from the mean

Answer 13

σ^2 = Σ(xi - μ)^2 / N Where xi is a sample μ is the mean of all samples N is the total number of samples in the population

Answer 14

No, it is always non-negative

Answer 15

A larger population variance indicates greater variability in the samples from the mean

Answer 16

Population variance uses all individuals in the population to calculate the variance, while sample variance only uses a subset of individuals from the population

Answer 17

Sample variance is used when we don't have access to all individuals in the population and need to estimate the variance based on a sample

Answer 18

The formula for sample variance is s^2 = ∑(xi - x̄)^2 / (n-1), where xi is the value of the ith observation, x̄ is the mean of the observations, and n is the number of observations in the sample.

Answer 19

The denominator is (n-1) instead of n to correct for the bias that results from using a sample to estimate the population variance

Answer 20

The unit of measurement for sample variance is the square of the unit of measurement for the observation in the sample. For example, if the observations are measured in inches, the unit of the measurement for sample variance would be square inches

Answer 21

The unit of measurement for population variance is square of the unit of measurement for the data. For example, if the data is measured in meters, the variance will be measured in square meters

Answer 22

A measure of the amount of variation or dispersion of a set of values from its mean. It is calculated as the square root of the variance of the dataset, which is the average of the squared differences from the mean

Answer 23

Variance is the average of squared differences from the mean, while standard deviation is the square root of the variance

Answer 24

Standard deviation is more useful than variance because it is in the same units as the original data making it easier to interpret and compare

Answer 25

A binary label is a label that can take on only two possible values, such as "yes" or "no", "true" or "false", or 0 or 1. Binary labels are commonly used in classification tasks where the goal is to predict a binary outcome.

Answer 26

Examples of binary labels include predicting whether an email is spam or not spam, whether a credit card transaction is fraudulent or not fraudulent, or whether a patient is at high risk or low risk for a particular disease.

Answer 27

To train a model using binary labels, you typically need a dataset that includes examples of both positive and negative outcomes. You would then use this data to train a classification algorithm, such as logistic regression or a decision tree, to predict the binary outcome based on the input features.

Answer 28

Common evaluation metrics for binary classification models include accuracy, precision, recall, and F1 score. These metrics are used to measure the performance of the model in terms of its ability to correctly classify positive and negative examples.

Answer 29

Supervised learning is a type of machine learning in which the model is trained on a labelled dataset. The labelled data consists of input features and corresponding output values, or "labels "the model is trying to predict. The goal of supervised learning is to train the model to accurately predict the output values for new, unseen input data

Answer 30

Predicting prices of a house based on its features, classifying emails as spam or not spam, and identifying the type of flower based on its measurements

Answer 31

Unsupervised learning is a type of machine learning in which the model is trained on an unlabelled dataset. The goal of unsupervised learning is to identify patterns or structure in the data without the use of predefined output values. This can be useful for tasks such as clustering, dimensionality, reduction, and anomaly detection

Answer 32

Clustering similar customer profiles based on their purchasing behaviour, reducing the dimensionality of high-dimensional data for visualisation purposes, and identifying outliers in a dataset

Answer 33

The use of labelled versus unlabelled data. In supervised learning, the model is trained on labelled data to predict output values. In unsupervised learning the model is trained on unlabelled data to identify patterns or structure in data

Answer 34

A technique used in machine learning to normalise the range of values for each feature in a dataset. This is important because individual features may have different distributions, ranges or units of measurements, which can cause issues for some machine learning algorithms

Answer 35

Data cleaning and preparation is important because machine learning algorithms require high-quality, well-structured data to produce accurate and reliable results. Dirty or unstructured data, such as missing values, error, or outliers, can significantly impact the performance of a machine learning model

Answer 36

Removing or inputting missing values, correction of errors in data entry or collection, standardising the formatting and structure of data and identifying and removing outliers

Answer 37

Dealing with multivariate data in machine learning requires careful consideration of the relationships between features. This may involve identifying causative variables, removing redundant or irrelevant features, and normalising the range of values for each feature using techniques such as feature scaling

Answer 38

Best practices for data cleaning and preparation in machine learning include thoroughly documenting data sources and cleaning processes, validating data quality and completeness, exploring the data for patterns and outliers and using established standards and frameworks for data management and preprocessing

Answer 39

Missing data refers to the absence of a value for a particular observation or variable in a dataset. This can occur for various reasons, such as data collection errors, data entry mistakes, or intentionaly omissions

Answer 40

Missing data can cause problems in machine learning because many algorithms are designed to work with complete, high-quality data. Incomplete or missing data can lead to biased inaccurate results, and can even cause the model to fail completely

Answer 41

Techniques for inputting missing data include mean or median imputation regression imputation, and k-nearest neighbour imputation. Each technique has its own strengths and weaknesses and the choice of technique depends on the specific characteristics of the dataset and the research question being addressed

Answer 42

Best practices for handling missing data include carefully documenting the reasons for missing data, exploring the patterns of missing data in the dataset, considering both removal and imputation techniques, validating the quality of imputed data, and performing sensitivity analyses to assess the impact of missing data on the results

Answer 43

This technique involves replacing missing values with the mean or median value of the variable across all other observations. This is a simple and quick technique, but it can be biased if the missing values are not randomly distributed across the variable. Mean or median imputation is best suited for variables with relatively symmetric distribution

Answer 44

This technique involves using a regression model to predict the missing values based on other variables in the dataset. This approach assumes that the missing values are related to other variables in the dataset, and can be useful when there is a strong correlation between variables. However, regression imputation can be computationally intensive and may require careful model selection and validation

Answer 45

Mean or median imputation is best suited for variables with relatively symmetric distribution

Answer 46

This technique involves identifying the k-neareset observations to the observation with missing values and using their values to impute the missing data. This apporach assumes that observations with similar values for other varaibles will have similar values for the variable with missing data.

Answer 47

It is best suited when there is a strong relationship or correlation between the variable with missing values and other variables in the dataset

Descriptive Statistics Flashcards

(71 cards)