Principles of Statistics Flashcards
What does analysing data with statistics do?
- Framework to uncover hidden patterns 🏗️
- Objective Perspective 🎯
- Test Hypotheses🧪
- Confident Decisions: Rely on Data > Assumptions e.g. lead time changes💪
How have you applied statistical testing when analysing data?
- Descriptive stats: mean, median etc.
- Inferential stats: Hypothesis Testing: pearson’s correlation coefficient or Regression
- Assess Model: RMSE, MAE
What is Hypothesis Testing?
- Inferential stats method 📈
- Assess a hypothesis about a larger population based on a sample 👥 🎛️
- 2 Competing hypothesises - null (no sig correlation) and alternative (a sig correlation)❌🔀
- See if observed data is due to chance🔭🍀
What is Inferential Statistics?
- Field of Statistics🌾
- Analytical tools to draw conclusions about a whole population 🌍 based on a sample 🔬
What is Pearson’s correlation test?
Type of hypothesis testing that determines if a relationship exists between 2 variables (lead time and stock holding)
What is a t test?
Hypothesis test that compares the means of 2 groups
What was the significance level that the P value was tested against?
5% significance level (p < 0.05)
What is a p-value?
- Statistical Measure 📏
- DETERMINES if the results are statisically significant⭐⭐⭐⭐⭐⭐⭐
- A low p value < 5% = reject the null hypothesis and conclude the alternative that there is an effect/relationship/difference
- A high p value > 5% = conclude the null hypothesis and that there is no effect/relationship/difference between 2 variables
Interpret the P value results of the Pearson’s Correlation Test
- P Value < 0.05
- Reject Null
- Conclude Alternative
- WAS a significant relationship between Lead Time & Stock Holding
Interpret the correlation coefficient of the pearson’s correlation test
- Strength of relationship
- -1 to 1
- Positive Value, far from 1
- Weak Positive relationship
- Could infer from the sample: a relationship did exist between lead time and stock holding in the Frozen Warehouse (Inferential Stats example)
Have you encountered a situation where stats method did not yield the desired results? How did you rectify it?
- Regression = high error & poor fit
- Due to small sample size, DQ issues or weak relationship
- Frozen Suppliers not adhere to lead times
- Summer build stock (irrespective of lead time)
- Customer demand, supplier shortages, warehouse space (not considered by model)
- External factors: historical data may be better
- Time series: identify patterns
What is linear regression?
- Stats method
- Predicts an outcome based on another
- By fitting a line of best fit to the data
- The equation of the line allows the model to make predictions
- E.g. if the lead time was 30 days (x axis), you could see where the line intercepts the x axis and see the corresponding y value (stocking holding) as the prediction
When did you use linear regression?
- To predict stock holding from lead time
- Lead Time as the independent variable (x axis)
- Stock Holding as the dependent variable (y axis)
What was the independent variable in your regression model?
Lead time on the x axis
What was the dependent variable on your regression model?
Stock holding on the y axis
What evaluation metrics did you use to determine the accuracy and effectiveness of your models?
- ROOT MEAN SQUARED ERROR- measures the difference between actual and predicted values (lower value is better)
- MEAN ABSOLUTE ERROR- showed how much error was in the predictions too (lower value is better)
- R SQUARED - most common - shows how much data variation is explained by the model. 0 - 1. 1 = 100% of the variation is explained by the model. 1 = better fit💯✅
- Plotted predicted stock/lead time - not a straight 45 line, not performing well
What is R SQUARED and interpret your results
- Number that shows how well the line (LR Model) fits the data🔢
- Tells me how much of a difference in stock holding can be explained by lead time⏱️
- My R-squared was no bigger than 0.05, which means only 5% of the differences in stock holding can be explained by lead time⚄
- Additionally, Training and Test numbers were lower, which could suggested the model was** too simple to capture the patterns** in the data (underfitting)🤺🧪⚪️
What does over fitting mean?
- Model is too complex
- Fits the training data too well
- Cannot handle data that is different from that
What is a limitation of R squared?
sensitive to outliers and my data had a few that could have influenced the score
Why did you choose those error metrics?
MAE and RMSE as together as RMSE is sensitive to outliers and using both can show more insights. E.g. RMSE bigger than MAE = outliers exist that could throw model off
What is a time series forecast?
Type of predictive analysis that predicts future values based on historical data collected at specific intervals. It analyses past trends, patterns and seasonal variations to make these predictions.
What tool did you use for the time series forecast model and why?
Python
- flexibility: exponential smoothing levels
- experiment with different models
How do you know if your forecast is accurate?
Root Mean Sq Error - margin of error between actual and predicted values
Mean Absolute Error - also measures error between actual/predictive values
Also use confidence levels in the chart to see how confident the model is
What are the 4 plots on the decomposition plot show?
Observed - actual data
trend - the long term upward or downward direction of the data
seasonality - repeating patterns within specific time periods
Residual (noise): random fluctuations that cannot be explained by trend or seaonality
What does decomposition plot do?
Breaks down the data into underlying components
What does my decomposition plot show specifically?
Observed: Peaks and troughs show fluctuation in stock levels over time
Trend: downward trend unto 2020, steep upwards until 2021 - levels off a little
Seasonal: annual pattern - stock levels rising in the second quarter and gradually decreasing throughout the year - working capital management at year-end and ice cream stock building
Residual: random fluctuation do exist that could be due to supplier issues, manual adjustments to orders, changes in space allocated to shelves in shops
What time series forecast model did you use to forecast?
- Naive - assumes values would be equal to the most recently observed data value (establishes baseline for comparison)
- Holt Linear - trend - identifies underlying trend in the data to make future predictions (does not capture autocorrelation which is the relationship between the variables current and past value)
- Holt Winters - trend and seasonality
What parameters did you use to customise the holt winters model?
- Seasonal periods: 52 for weekly
- Trend/Seasonality: add
- Smoothing level: how much weight given to past observations when forecasting future values (0.10)
What was the outcome of the time series forecast?
- Holt winters 🏆
- Forecast: higher stock than previous years
- BUT below 4.3 million unit maximum
- Winter Stock Build (model not considered)
- 5 - 7% error, < 5% better but better than “finger in the air”
What could your stakeholders use the time series forecast model for?
Stock prediction used to:
- 🎯 Optimise Targets and KPIs
- ⚠️ Foresee potential issues and correct them
- ⚡ Maintain efficient warehouse operations
Explain the Linear Regression Equation
y = mx + c
y = predicted stock holding
m = intercept: where line intersects the y value when x = 0
x = value we change i.e. lead time days
c = gradient that measures the slope of the line
What is Descriptive Statistics?
- Summarise & Describe a dataset that’s a representation of a population
- Overview of characteristics of the data e.g. central tenency (mean, mode, median) or variability (SD, min, max)
- Understanding of the data - foundation for inferential statistics
What are Limitations of Linear Regression?
- Need a linear relationship between varibles (strong relationship = better model)
- Predictions are limited to models trained range (e.g. 100 days would not produce reliable values as model has not been trained on such values)
- Predictions are not fact
Why did you want to forecast stock holding?
- Action needed before year end W/C metrics?
- Reduce risk of reach capacity and impacting operations
- We pay for space, need to use it efficiently, useful to predict stock holding
In Time Series Forecasting, why do you Resample the Data?
- Adjust frequency
- Smooth Gaps
- Reduce Noise
- Easier to identify patterns