Module 8: Forecasting with Pandas Flashcards
What is forecasting and what are the two types (cross-sectional and time series)
In this module, we will create a basic forecasting model. This will include exploratory analysis, obtaining statistical summaries, and applying common forecasting methods to predict variable behavior.
Forecasting is about predicting future events as accurately as possible. Predictions (in the form of a time series) are an important aid for effective and efficient planning. Good forecasts capture patterns / relationships in the historical data, without replicating past events that are unlikely to reoccur.
There are two types of forecasting:
1. Cross-sectional: an observational analysis from a population, or a representative subset, at a specific point in time.
2. Time-series forecasting: uses only information on the variables to be forecast, and makes no attempt to discover additional factors which affect their behaviour
what does the predicatability of an event rely on
The predictability of an event depends on: 1) how well we understand the factors contributing to the quantity, 2) how much data is available; and 3) how past forecasts affect future forecasts
E.g., when you forecast currency exchange rates, we have a limited understanding of the factors that affect exchange rates – they directly affect one another, publications, people, and future will all determine it. Something as complex as that will likely.
Thus, two conditions need to be met to even pursue a forecasting model:
1) Historical data is available
2) Reasonable to assume that some aspects of the past will continue into the future
terms, trend, seasonality, cycles
In describing time series we use several terms:
- Trend: exists when there is a long-term increase or decrease in the data – it does not have to be linear
- Seasonality: occurs when a time series if affected by seasonal factors such as the time of the year, or the day of the week, or other calendard period. Seasonality is always of a fixed and known period
- Cycles: occur when data exhbits rises and falls that are not of a fixed period – fluctuations usually due to economic conditions are often called business cycle
Cycles, unlike seasons, have an unknown and unfixed length and are usually longer and more variable than that of a seasonal variation.
Before creating a forecasting model, we will have to determine if the data is a trend, season, or cycle.
Data assumptions: stationarity and autocorrelation
The biggest assumption a time-series model makes is stationarity, where one assumes a time series’ statistical properties (mean, variance, growth rate) are not varying over time. This is a sound assumption if we have explained away cyclic properties and trends as separate factors.
With stationarity assumption, it is possible to employ regular data summarizing statistics on time-series data. The most commonly used bivariate statistic is the correlation coefficient (measures the strength of the linear relationship between to variables).
Non-stationary tells us that there is some sort of trend, we don’t know why or what it is yet, but we at least know that we can forecast – with stationary, nothing changes overtime so forecasting is useless.
Based on this, autocorrelation measures the linear relationship between lagged values of a time series. Autocorrelation looks at how a time series (like daily temperatures or stock prices) relates to itself over time. Imagine you have a series of data points, and you want to see if today’s value is related to yesterday’s value, or the value from two days ago, and so on. The “lag” is how many steps back you look. For example, if you have a lag of 1, you’re checking if today’s value is related to yesterday’s value. If you have a lag of 2, you’re checking if today’s value is related to the value from two days ago. Autocorrelation coefficients are numbers that show the strength and direction of these relationships for different lags. So, if you use different lags, you’ll get different coefficients that tell you how much the time series values at those different time steps are connected.
Time series that show no autocorrelation are called white noise, we expect each autocorrelation to be close to zero – not exactly but close.
First step, exploratory analysis
As a first step, we should explore the data by observing a graph.
creating a range of times starting from ‘1/1/1’ going forward hourly for 100 units.
longTimeRange = pd.date_range(‘1/1/1’, periods=100, freq=’H’)
Creating a random number for each time unit as a Series
aTimeSeries = pd.Series(np.random.randn(len(longTimeRange)), index=longTimeRange)
aTimeSeries
2001-01-01 00:00:00 -1.565657
2001-01-01 01:00:00 -0.562540
2001-01-01 02:00:00 -0.032664
2001-01-01 03:00:00 -0.929006
2001-01-01 04:00:00 -0.482573
2001-01-01 05:00:00 -0.036264
2001-01-01 06:00:00 1.095390
2001-01-01 07:00:00 0.980928
2001-01-01 08:00:00 -0.589488
2001-01-01 09:00:00 1.581700
…
2001-01-04 18:00:00 -0.734297
2001-01-04 19:00:00 -0.728505
2001-01-04 20:00:00 0.838775
2001-01-04 21:00:00 0.266893
2001-01-04 22:00:00 0.721194
2001-01-04 23:00:00 0.910983
2001-01-05 00:00:00 -1.020903
2001-01-05 01:00:00 -1.413416
2001-01-05 02:00:00 1.296608
2001-01-05 03:00:00 0.252275
Freq: H, Length: 100, dtype: float64
For time series data, the graph to start with is a time plot. Observations are plotted against the time of observation, with consectuve observations joined by straight lines:
A Time plot
aTimePlot = aTimeSeries.plot(style=”-o”, title=”A Time Plot (Random Data)”)
aTimePlot.set_ylabel(“Values (No Units)”)
aTimePlot.set_xlabel(“Time (Days)”)
tmp = aTimePlot.plot()
A seasonal plot is similar to a time plot except that the data is plotted against individual seasons, that is to say the same data is shown using a different time horizon or unit of time.
aTimeFrame = aTimeSeries.to_frame(name=”value”) # convert series to data frame
# pivot_table
is used to make each season its own column. And, plot.
# Recall the melt
function. This is a complementary function for reshaping data.
aSeasonalFrame = aTimeFrame.pivot_table(values=”value”,
aggfunc=’sum’,
index=aTimeFrame.index.hour, # index for units (could be aggregates)
columns=aTimeFrame.index.day # index for creating seasons
)
aSeasonalFrame.columns.name = ‘Seasons (Day #)’
aSeasonalPlot = aSeasonalFrame.plot(style=”-o”,
title=”A Seasonal Plot (Same Random Data as before)”)
aSeasonalPlot.set_ylabel(“Values (No Units)”)
aSeasonalPlot.set_xlabel(“Time (Hours)”)
tmp = aSeasonalPlot.plot()
A lag plot is a scatter plot comparing time series data points against themselves with a fixed delay or sequence shift.
# pandas has a shorthand notation for constructing this type of plot.
aLagPlot = pd.plotting.lag_plot(series=aTimeSeries, lag=1)
aLagPlot.set_ylabel(“Value at ${t+1}$ (No Units)”)
aLagPlot.set_xlabel(“Value at $t$ (No Units)”)
aLagPlot.set_title(“A Lag Plot with 1 period of lag (Same Random Data as before)”)
None
white noise and autocorrelation
White noise is a type of time series data where each value is completely random and does not follow any predictable pattern from one value to the next. Imagine it as a series of random numbers that don’t have any relationship with each other.
To check if a time series is white noise, we use something called the autocorrelation function (ACF). The ACF measures how much each value in the series is related to previous values.
For white noise:
* Autocorrelations should be close to zero. This means that knowing one value in the series gives no useful information about other values.
* Most of the autocorrelation values should be within certain limits. Specifically, for a series of length TTT, we expect about 95% of these values to be within ±0.28 (for a series of length 50, like in your example).
So, if the autocorrelations are mostly zero and within the expected range, the time series is considered white noise.
One way to generate white noise is via random data:
# Let’s define and create a randomly generated stationary time series and look at the summary.
longTimeRange = pd.date_range(‘1/1/1’, periods=30000, freq=’H’)
stationaryTimeSeries = pd.Series(np.random.randn(len(longTimeRange)), index=longTimeRange)
aWNTimePlot = stationaryTimeSeries.plot(style=”-o”, title=”White Noise Time Plot”)
aWNTimePlot.set_ylabel(“Values (No Units)”)
aWNTimePlot.set_xlabel(“Time (Hours)”)
None
Since the summary statistics are stable for a stationary process. What this means is that a subsample of the time series will result in virtually the same summary statistics:
# Print to see statistical summary
stationaryTimeSeries.describe()
count 30000.000000
mean -0.004257
std 1.000675
min -3.745356
25% -0.685543
50% -0.001547
75% 0.670007
max 3.961734
dtype: float64
Print to see statistical summary
stationaryTimeSeries.sample(n=10000).describe()
count 10000.000000
mean 0.006428
std 1.000344
min -3.745356
25% -0.674085
50% 0.001151
75% 0.674976
max 3.961734
Look at the count for the second one, it is a sample of the first one
Example Explained
1. Full White Noise Series Summary:
o This shows the summary statistics for the entire time series with 30,000 data points.
o Mean: -0.004257
o Standard Deviation: 1.000675
o Minimum: -3.745356
o Maximum: 3.961734
o And other percentiles (25%, 50%, 75%)
2. White Noise Random-Subset Series Summary:
o This shows the summary statistics for a smaller random sample of 10,000 data points taken from the same series.
o Mean: 0.006428 (a bit different from the full series, but still close)
o Standard Deviation: 1.000344 (similar to the full series)
o Minimum and Maximum values are the same as in the full series.
o And other percentiles
Key Point
Because the time series is stationary, the summary statistics from the full dataset and the random subset are very similar. This shows that the statistical properties of the series don’t change, confirming its stationary nature. It is impossible to do forecasting on white noise.
Lag and autocorrelation plots
Lag Plot
How do we know that the summaries will remain stable? We compare different lags in the time series. A skew would indicate correlation, violating stationarity.
“””
Scatterplots of different lags further affirm this conclusion.
As you can see, there is no linear relationship visible in
any of the lagged scatterplots.
“””
plt.subplot(3, 3, 1)
pd.plotting.lag_plot(series=stationaryTimeSeries, lag = 100)
plt.subplot(3, 3, 3)
pd.plotting.lag_plot(series=stationaryTimeSeries, lag = 1000)
plt.subplot(3, 3, 7)
pd.plotting.lag_plot(series=stationaryTimeSeries, lag = 10)
plt.subplot(3, 3, 9)
pd.plotting.lag_plot(series=stationaryTimeSeries, lag = 10000)
plt.suptitle(‘White Noise Lag Plots for Lags of Different Orders of Magnitudes.’)
plt.show()
Autocorrelation Plot
By plotting the correlation for every lag period, we are able to solve the problem of detecting portions of the time series violating stationarity.
”””
Actual calculation of correlation for each lag value.
Looking at the prior graph, we can affirm that the observed series is white noise.
“””
aWNTimePlot = pd.plotting.autocorrelation_plot(stationaryTimeSeries)
aWNTimePlot.set_title(“An Autocorrelation Graph of White Noise”)
aWNTimePlot.set_ylabel(“Correlation”)
aWNTimePlot.set_xlabel(“Lag (Hours)”)
None
Since all lag periods have a correlation of virtually zero, we know for sure our white noise is stationary.
In our example, we knew we were dealing with white noise. In practice, we would normally start out analysis with an autocorrelation plot to find non-stationary subsets. We could then look at that with lag plots to find further patterns.
Here is another autocorrelation plot example:
- The horizontal axis is the lag – so remember that autocorrelation compares the time-series to itself – so the lag here is 25 days. We will take a point in day 25 chart and then create a new graph for day 50 and cross-correlate those two together
- -1 means the two time-series are completely opposite
- 1 means the two are totally positively correlated
- 0 means no obvious pattern
forecasting methods: average, naive method, seasonal naive, drift method
How to calculate Drift
Now that we know about data summaries and assumptions, we can discuss how to forecast. Here we will cover different methods and approaches to predict the next item in a series.
- Average method
This method uses the average of a data series for forecasting. The forecasts of all future values are equal to the mean/average of the historical data.
timerange = pd.date_range(‘7/7/7’, periods=70, freq=’H’) # Fixed frequency of hours
randomTimeSeries = pd.Series(np.random.randn(len(timerange)), index=timerange) + 10
average forecast
randomTimeSeries.mean()
output:
10.085853406843622
Y1……..,YT – where Y is an observation, T is the size of time frequencies spanned and h is the number of frequencies ahead of being predicted
- Naïve method
This is only appropriate for time series data. All forecasts are simply set to be the value of the last observation. This method works well for economic and financial time series.
# Need to state the amount to grab by stating the unit time
forecast
randomTimeSeries.last(offset=”H”)
output:
2007-07-09 21:00:00 8.87057
Freq: H, dtype: float64
- Seasonal naïve method
A similar method also exists for highly seasonal data. In this case, we set each forecast to be equal to the last observed value from the same season of the year. Thus, we are only looking at the last periodic interval rather than the last value of the time series.
’'’Technically, the last season of a time series is just grabbing values for another (larger)
frequency unit, but then resampling down to the original frequency unit.
In this case, let our season be every 3 hour unit in a day.
This is a dummy example, but a longer hourly TimeSeries could have been offset using days as the
frequency of a season.
i.e., “D”
‘’’
Seasonal forecast
randomTimeSeries.last(offset=”3H”)
2007-07-09 19:00:00 11.960911
2007-07-09 20:00:00 9.088409
2007-07-09 21:00:00 8.870570
Freq: H, dtype: float64
- Drift method
This method is a variation on naïve where we extrapolate the trend by drawing a line through the first and last observations. The amount of change over time (called the drift) is set to be the average change seen in the historical data.
Unlike prior methods, this is an estimation of growth rather than a future value. The growth estimate can then be applied to the last known value to obtain a forecast prediction.
h = 1
numerator = randomTimeSeries.last(“H”)[0] - randomTimeSeries.first(“H”)[0]
denominator = np.size(randomTimeSeries) - 1
trendSlope = numerator / denominator
We shift to only show valid forecasts for respective time ranges.
# Then we grab the last value, our forecast.
Try playing aound with h
# Let’s show forecast
(randomTimeSeries + h * trendSlope).shift(h).last(offset=”H”)
2007-07-09 21:00:00 9.037562
Freq: H, dtype: float64
Adjustments and transformation: their purpose, and terms such as calendar, population, and inflation
Usually are applied prior to transformations and are meant to normalize data for the purpose of useablity with other data sources:
- Calendar adjustments refer to variation seen in seasonal data due to simpler calendar effects (i.e., months don’t have the same number of days). In such cases, it is easier to remove the variaton before fitting a forecasting model.
Download the Dataset
csvUrl = “https://archive.ics.uci.edu/ml/machine-learning-databases/00409/Daily_Demand_Forecasting_Orders.csv”
## Uncommment if cell is broken NOTE: The source will block you if you download too often.
Reading dataset into pandas dataframe
df = pd.read_csv(csvUrl, sep=”;”)
Fixed frequency of business days. Arbitrary date.
timerange = pd.date_range(‘1/10/0’, periods=len(df.iloc[:, 12]), freq=’B’)
demandSeries = pd.Series(np.array(df.iloc[:, 12]), index=timerange)
Plotting data
plt.subplot(3, 1, 1)
Note how the cycling is incorrectly represented as seasonality.
demandPlot = demandSeries.asfreq(freq=’30T’, method=’pad’).plot(style=”-o”)
demandPlot.set_ylabel(“Orders (#)”)
demandPlot.set_xlabel(“Time (Business Days)”)
demandPlot.plot()
plt.subplot(3, 1, 2)
demandPlot2 = demandSeries.plot(style=”-o”)
demandPlot2.set_ylabel(“Orders (#)”)
demandPlot2.set_xlabel(“Time (Business Days)”)
demandPlot2.plot()
plt.subplot(3, 1, 3)
Note how more of the trend is coming through.
demandPlot3 = demandSeries.asfreq(freq=’M’, method=’pad’).plot(style=”-o”)
demandPlot3.set_ylabel(“Orders (#)”)
demandPlot3.set_xlabel(“Time (from top to bottom: every 30-minutes / business day / month)”)
demandPlot3.plot()
plt.suptitle(‘# of Total Orders vs. Time’)
plt.show()
Here the same data is being displayed in three different ways, in 30-minutes, by business day, and by month. The trend only starts to become evident in the last one, by month. However, shorter periods do show more seasonality.
- Population: any data affected by population changes can be adjusted to give per-capita data. That is consider the data per 1,000 people for example, rather than the total.
Get GDP Data
csvUrl = “https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv”
csvFile = download_file(csvUrl)
csvFile = “gdp.csv”
df = pd.read_csv(csvFile)
canDFgdp = df[df.loc[:][“Country Name”] == “Canada”]
Get Population Data
csvUrl = “https://raw.githubusercontent.com/datasets/population/master/data/population.csv”
csvFile = download_file(csvUrl)
csvFile = “population.csv”
df2 = pd.read_csv(csvFile)
canDFpop = df2[df2.loc[:][“Country Name”] == “Canada”]
Read time correctly
canDFgdp.iloc[:][“Year”] = pd.to_datetime(canDFgdp.iloc[:][“Year”], format=’%Y’)
canDFpop.iloc[:][“Year”] = pd.to_datetime(canDFpop.iloc[:][“Year”], format=’%Y’)
Calculate GDP per person
canDFgdpPerPerson = canDFgdp.merge(canDFpop, on=”Year”)
canDFgdpPerPerson[“GDP Per Person (USD)”] = (canDFgdpPerPerson[“Value_x”]).divide(canDFgdpPerPerson[“Value_y”])
canDFgdpPerPerson[“GDP (USD)”] = canDFgdpPerPerson[“Value_x”]
canDFgdpPerPerson[“Population Size (# of people)”] = canDFgdpPerPerson[“Value_y”]
Plot
GDP
canDFgdpPerPersonPlot = canDFgdpPerPerson.plot(x=”Year”, y=”GDP (USD)”)
canDFgdpPerPersonPlot.set_title(“GDP vs. Time”)
canDFgdpPerPersonPlot.set_ylabel(“Dollars (USD)”)
canDFgdpPerPersonPlot.set_xlabel(“Time (Years)”)
canDFgdpPerPersonPlot.plot()
# population
canDFgdpPerPersonPlot = canDFgdpPerPerson.plot(x=”Year”, y=”Population Size (# of people)”)
canDFgdpPerPersonPlot.set_title(“Population vs. Time”)
canDFgdpPerPersonPlot.set_ylabel(“Population Size (# of people)”)
canDFgdpPerPersonPlot.set_xlabel(“Time (Years)”)
canDFgdpPerPersonPlot.plot()
# GDP per person
# NOTE: Pay attention to the scales (top-left-corner of plot), and not just the y-values
canDFgdpPerPersonPlot = canDFgdpPerPerson.plot(x=”Year”, y=”GDP Per Person (USD)”)
canDFgdpPerPersonPlot.set_title(“GDP Per Person vs. Time”)
canDFgdpPerPersonPlot.set_ylabel(“GDP Per Person (USD)”)
canDFgdpPerPersonPlot.set_xlabel(“Time (Years)”)
canDFgdpPerPersonPlot.plot()
The point is to use multiplication or division to obtain comparable data or appropriate units, which pandas allows.
Transformations
When standard deviance fluctuates across a time series, it can be difficult to decompose, a transformation might be useful in this situation. This would be called a logarithmic transformation:
Applying a log transformation
logTimeSeries = np.log(randomTimeSeries)
randomPlot = randomTimeSeries.plot(title=”Random Time Series”)
randomPlot.set_xlabel(“Time (hours)”)
randomPlot.set_ylabel(“Values (No Unit)”)
randomLogPlot = logTimeSeries.plot(title=”Logarithm-Transformed Random Time Series”)
randomLogPlot.set_xlabel(“Time (hours)”)
randomLogPlot.set_ylabel(“Logarithmic Values (No Unit)”)
Changes in a log value are meant to show relative change on the original scale.