Module 9: Time Series Flashcards
what is time series
A time series is a set of observations at different points in time, it can have a fized frequency of measures points (i.e., monthly) or an irregular frequency of data points (whenever the data is available).
The time dimension in pandas can be expressed in multiple ways:
Marking Method Example
Timestamps December 13, 2017 at 11:22 EST
Fixed periods monthly
Intervals 2015-04-03 03:12 to 2015-04-14 11:11
Elapsed time 45 mins. 32:05 secs.
In pandas, time ranges can be utilized as an index – if a series is created where the index is made from a list of datetime objects, the series will become a time series.
”””
Creating a datetime index.
start : str or datetime-like, optional (left bound for generating dates)
end : str or datetime-like, optional (right bound for generating dates)
periods : integer, optional (number of periods to generate)
freq : str or DateOffset, default ‘D’ (calendar daily)
Frequency strings can have multiples, e.g. ‘5H’.
“””
timerange = pd.date_range(‘7/7/7’, periods=7, freq=’H’) # Fixed frequency of hours
timerange
DatetimeIndex([‘2007-07-07 00:00:00’, ‘2007-07-07 01:00:00’,
‘2007-07-07 02:00:00’, ‘2007-07-07 03:00:00’,
‘2007-07-07 04:00:00’, ‘2007-07-07 05:00:00’,
‘2007-07-07 06:00:00’],
dtype=’datetime64[ns]’, freq=’H’)
if a series is created where the index is made from a list of datetime objects, the series will become a time series.
Creating an arbitrary time series with random numbers
randomTimeSeries = pd.Series(np.random.randn(len(timerange)), index=timerange) + 1
randomTimeSeries
2007-07-07 00:00:00 0.795292
2007-07-07 01:00:00 1.478943
2007-07-07 02:00:00 0.480561
2007-07-07 03:00:00 0.444270
2007-07-07 04:00:00 2.965781
2007-07-07 05:00:00 2.393406
2007-07-07 06:00:00 1.092908
Freq: H, dtype: float64
Viewing time with different granularities
Unlike other index types, a unit of measure is not implicitly associated with the index.
Time can be viewed at different granularities (i.e., weeks, days, hours). It is important to know how to transform time scales in order to align data points.
You can easily convert into different frequencies
# asfreq() provides us with easy conversion methods to express our index with different frequencies.
“””
method : {‘backfill’/’bfill’, ‘pad’/’ffill’}, default None:
Method to use for filling holes when switching frequencies (note this does not fill NaNs)
‘pad’ / ‘ffill’: use LAST valid observation to fill (propagate) forward to next valid observation
‘backfill’ / ‘bfill’: use NEXT valid observation to fill (propagate) backwards to last valid observation
“””
randomTimeSeries.asfreq(freq=’30Min’)
2007-07-07 00:00:00 0.795292
2007-07-07 00:30:00 NaN
2007-07-07 01:00:00 1.478943
2007-07-07 01:30:00 NaN
2007-07-07 02:00:00 0.480561
2007-07-07 02:30:00 NaN
2007-07-07 03:00:00 0.444270
2007-07-07 03:30:00 NaN
2007-07-07 04:00:00 2.965781
2007-07-07 04:30:00 NaN
2007-07-07 05:00:00 2.393406
2007-07-07 05:30:00 NaN
2007-07-07 06:00:00 1.092908
Freq: 30T, dtype: float64
You can add value repetition forward (method= ‘pad’)
# You can also try to fill backwards using method= ‘bfill’
# We can get a timeseries with or without value repetition forward or backward
# NOTE: Use case is for when our values are continuous and not discrete bursts
randomTimeSeries.asfreq(freq=’30Min’, method=’pad’)
2007-07-07 00:00:00 0.795292
2007-07-07 00:30:00 0.795292
2007-07-07 01:00:00 1.478943
2007-07-07 01:30:00 1.478943
2007-07-07 02:00:00 0.480561
2007-07-07 02:30:00 0.480561
2007-07-07 03:00:00 0.444270
2007-07-07 03:30:00 0.444270
2007-07-07 04:00:00 2.965781
2007-07-07 04:30:00 2.965781
2007-07-07 05:00:00 2.393406
2007-07-07 05:30:00 2.393406
2007-07-07 06:00:00 1.092908
Freq: 30T, dtype: float64
performing arithmetic on time series
We can also perform arithmetic on values since the index selection, and subsets, work the way we’ve seen for dataframes as they conform to a type of pandas index with a few regular methods.
For example, we can apply simple operations to the time series, like so below.
# Time Series behave similar to numpy ndarrays, series, and data frames
# (i.e. list-like comprehensions when selecting and transforming).
2 * randomTimeSeries
2007-07-07 00:00:00 1.590585
2007-07-07 01:00:00 2.957887
2007-07-07 02:00:00 0.961123
2007-07-07 03:00:00 0.888539
2007-07-07 04:00:00 5.931561
2007-07-07 05:00:00 4.786812
2007-07-07 06:00:00 2.185816
Freq: H, dtype: float64
But… changing frequencies to smaller periods adds a lot of NaN
values (when without value padding)
# So be mindful. The values might not transform the way you expect
# (i.e., NaN + 3 = NaN
and not 3
)
randomTimeSeries.asfreq(freq=’30Min’, method=’pad’) + randomTimeSeries.asfreq(freq=’50Min’, method=’pad’)
2007-07-07 00:00:00 1.590585
2007-07-07 00:30:00 NaN
2007-07-07 00:50:00 NaN
2007-07-07 01:00:00 NaN
2007-07-07 01:30:00 NaN
2007-07-07 01:40:00 NaN
2007-07-07 02:00:00 NaN
2007-07-07 02:30:00 0.961123
2007-07-07 03:00:00 NaN
2007-07-07 03:20:00 NaN
2007-07-07 03:30:00 NaN
2007-07-07 04:00:00 NaN
2007-07-07 04:10:00 NaN
2007-07-07 04:30:00 NaN
2007-07-07 05:00:00 4.786812
2007-07-07 05:30:00 NaN
2007-07-07 05:50:00 NaN
2007-07-07 06:00:00 NaN
dtype: float64
Finally, index entries can have duplication, just like any other index
newTimerange = timerange.append(timerange)
newTimeSeries = pd.Series(np.random.randn(len(newTimerange)), index=newTimerange)
newTimeSeries
2007-07-07 00:00:00 0.281746
2007-07-07 01:00:00 0.769023
2007-07-07 02:00:00 1.246435
2007-07-07 03:00:00 1.007189
2007-07-07 04:00:00 -1.296221
2007-07-07 05:00:00 0.274992
2007-07-07 06:00:00 0.228913
2007-07-07 00:00:00 1.352917
2007-07-07 01:00:00 0.886429
2007-07-07 02:00:00 -2.001637
2007-07-07 03:00:00 -0.371843
2007-07-07 04:00:00 1.669025
2007-07-07 05:00:00 -0.438570
2007-07-07 06:00:00 -0.539741
dtype: float64
however, unlike regular indexes, time ranges can be relabaled by shifting dates (i.e., addin a time offset) or resampling (i.e., reconstructing a time series from itself).
Shifting time values with timedelta, without changing the index structure.
# Pay attention to freq
value. All values are shifted. The frequency is still the same.
timerange + pd.Timedelta(“3Min”)
DatetimeIndex([‘2007-07-07 00:03:00’, ‘2007-07-07 01:03:00’,
‘2007-07-07 02:03:00’, ‘2007-07-07 03:03:00’,
‘2007-07-07 04:03:00’, ‘2007-07-07 05:03:00’,
‘2007-07-07 06:03:00’],
dtype=’datetime64[ns]’, freq=’H’)