Review Flashcards
Discrete metrics
Discrete metrics are metrics that count distinct, individual events or items. These metrics do not deal with fractional or continuous values.
Ex:
On-time delivery (C): This metric counts whether a delivery was on time or not (yes or no), which is discrete because it is a countable event that can only be categorized as “on-time” or “not on-time.”
Continuous metrics
Continuous metrics take on any value within a range and are not limited to whole numbers.
Ex:
delivery time and package weight, which are continuous because they can take any value within a range (e.g., 1.5 hours, 10.2 kg).
Categorical (Nominal) Data
Data that is divided into categories that are mutually exclusive, meaning items can only belong to one category, and the categories do not have any order or ranking.
Key Characteristics: No quantitative value, no inherent order.
Ex:
Customer region: North America, Europe, Asia
Favorite fruit: Apple, Banana, Orange
Ordinal Data
Data that can be ranked or ordered based on a relative scale, but the differences between data points are not necessarily equal. There is no measurable distance between categories.
Key Characteristics: Order matters, but the intervals between data points are not consistent or measurable.
Ex:
Customer satisfaction rating: Poor, Average, Good, Excellent
Education level: High school, Bachelor’s degree, Master’s degree, Doctorate
Interval Data
Data where the differences between values are consistent and measurable, but there is no true zero point. Ratios between values are not meaningful.
Key Characteristics: Equal intervals between data points, but no true zero point (so ratios like “twice as much” don’t make sense).
Ex:
Temperature: 20°C, 30°C, 40°C (the difference between 20°C and 30°C is the same as between 30°C and 40°C, but 0°C does not mean “no temperature”)
Time of day: 2:00 PM, 4:00 PM, 6:00 PM (the intervals between times are consistent, but time does not have a true zero point)
Ratio Data
Data that has a meaningful zero point, and both differences and ratios between data points are meaningful. This type of data allows for all arithmetic operations, including calculating ratios.
Key Characteristics: Equal intervals between data points, true zero point (which represents the absence of the variable).
Ex:
Height: 150 cm, 180 cm (height has a true zero, and you can say someone is twice as tall as someone else)
Salary: $30,000, $60,000 (you can say one person earns twice as much as another, and $0 represents no salary)
Hierarchy or data
Nominal, ordinal, interval, ratio
Conditional Probability
Conditional probability is the probability of an event A occurring given that another event B has already occurred. This is denoted as P(A | B)
Joint Probability
P(A ∩ B) = P(A) × P(B)
refers to the probability of two (or more) events occurring at the same time or in conjunction with one another. It is the likelihood that both events happen simultaneously.
Union Probability
P(A ∪ B) = P(A) + P(B)
refers to the probability that at least one of two (or more) events will occur. In other words, it’s the probability that either event A, event B, or both events occur.
Normalization Methods
Z-score normalization
Min-Max normalization
Normalization by decimal scaling
Z-score Normalization
is a technique used to rescale data so that it has a mean of 0 and a standard deviation of 1. This process helps in comparing values from different scales by transforming the data into a standard format where it can be interpreted in terms of standard deviations from the mean.
Uses:
- Standardizing Data
- Outlier Detection - Extreme z-scores (x < -3, x > 3)
- Handling Standardized Data - When working with data, it’s common to encounter features (or variables) that are measured on different scales. For instance, one feature may represent age in years (which might range from 0 to 100), while another might represent income in dollars (which could range from 10,000 to 1,000,000). If features are on different scales, the algorithm might be biased toward the feature with the larger numerical range, leading to suboptimal performance.
Median
Middle value of data set
Mode
Most occurring value (can be multiple)
Mean
Average of numbers