Pandas Time Series Flashcards
from datetime import datetime
datetime(year=2015, month=7, day=4)
manually build a date using the datetime type
datetime.datetime(2015, 7, 4, 0, 0)
from dateutil import parser
date = parser.parse(“4th of July, 2015”)
date
using the dateutil module, you can parse dates from a variety of string formats
date.strftime(‘%A’)
Once you have a datetime object, you can do things like printing the day of the week:
datetime(year=1976, month=9, day=13).strftime(‘%A’+’ %B’)
‘Monday September’
import numpy as np
date = np.array(‘2015-07-04’, dtype=np.datetime64)
date
NumPy team to add a set of native time series data type to NumPy
numpy datetime
date + np.arange(12)
Once we have this date formatted, however, we can quickly do vectorized operations on it
np.datetime64(‘2015-07-04 12:00’)
Here is a minute-based datetime
NumPy will infer the desired unit from the input
np.datetime64(‘2015-07-04 12:59:59.50’, ‘ns’)
Y Year ± 9.2e18 years [9.2e18 BC, 9.2e18 AD]
M Month ± 7.6e17 years [7.6e17 BC, 7.6e17 AD]
W Week ± 1.7e17 years [1.7e17 BC, 1.7e17 AD]
D Day ± 2.5e16 years [2.5e16 BC, 2.5e16 AD]
h Hour ± 1.0e15 years [1.0e15 BC, 1.0e15 AD]
m Minute ± 1.7e13 years [1.7e13 BC, 1.7e13 AD]
s Second ± 2.9e12 years [ 2.9e9 BC, 2.9e9 AD]
ms Millisecond ± 2.9e9 years [ 2.9e6 BC, 2.9e6 AD]
The following table, drawn from the NumPy datetime64 documentation, lists the available format codes along with the relative and absolute timespans that they can encode
Pandas TIMESTAMP
import pandas as pd
date = pd.to_datetime(“4th of July, 2015”)
date
Timestamp(‘2015-07-04 00:00:00’)
numpy style operations on pandas object
date + pd.to_timedelta(np.arange(12), ‘D’)
DatetimeIndex([‘2015-07-04’, ‘2015-07-05’, ‘2015-07-06’, ‘2015-07-07’,
‘2015-07-08’, ‘2015-07-09’, ‘2015-07-10’, ‘2015-07-11’,
‘2015-07-12’, ‘2015-07-13’, ‘2015-07-14’, ‘2015-07-15’],
dtype=’datetime64[ns]’, freq=None)
index = pd.DatetimeIndex([‘2014-07-04’, ‘2014-08-04’,
‘2015-07-04’, ‘2015-08-04’])
data = pd.Series([0, 1, 2, 3], index=index)
data
Pandas time series tools really become useful is when you begin to index data by timestamps
data[‘2014-07-04’:’2015-07-04’]
data[‘2015’]
make use of any of the Series indexing patterns we discussed in previous sections, passing values that can be coerced into dates:
passing a year to obtain a slice of all data from that year:
dates = pd.to_datetime([datetime(2015, 7, 3), ‘4th of July, 2015’,
‘2015-Jul-6’, ‘07-07-2015’, ‘20150708’])
dates
passing a series of dates by default yields a DatetimeIndex
dates.to_period(‘D’)
Any DatetimeIndex can be converted to a PeriodIndex with the to_period() function with the addition of a frequency code; here we’ll use ‘D’ to indicate daily frequency:
dates - dates[0]
A TimedeltaIndex is created, for example, when a date is subtracted from another:
pd. date_range(‘2015-07-03’, ‘2015-07-10’)
pd. date_range(‘2015-07-03’, periods=8)
pd. date_range(‘2015-07-03’, periods=8, freq=’H’)
pd.date_range() accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates. By default, the frequency is one day:
Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods:
The spacing can be modified by altering the freq argument, which defaults to D. For example, here we will construct a range of hourly timestamps:
pd. period_range(‘2015-07’, periods=8, freq=’M’)
pd. timedelta_range(0, periods=10, freq=’H’)
To create regular sequences of Period or Timedelta values, the very similar pd.period_range() and pd.timedelta_range() functions are useful.
Code Description Code Description D Calendar day B Business day W Weekly M Month end BM Business month end Q Quarter end BQ Business quarter end A Year end BA Business year end H Hours BH Business hours T Minutes S Seconds L Milliseonds U Microseconds N nanoseconds
Fundamental to these Pandas time series tools is the concept of a frequency or date offset. Just as we saw the D (day) and H (hour) codes above, we can use such codes to specify any desired frequency spacing. The following table summarizes the main codes available:
Code Description Code Description
MS Month start BMS Business month start
QS Quarter start BQS Business quarter start
AS Year start BAS Business year start
The monthly, quarterly, and annual frequencies are all marked at the end of the specified period. By adding an S suffix to any of these, they instead will be marked at the beginning:
Additionally, you can change the month used to mark any quarterly or annual code by adding a three-letter month code as a suffix:
Q-JAN, BQ-FEB, QS-MAR, BQS-APR, etc.
A-JAN, BA-FEB, AS-MAR, BAS-APR, etc.
In the same way, the split-point of the weekly frequency can be modified by adding a three-letter weekday code:
W-SUN, W-MON, W-TUE, W-WED, etc.
3 letter codes
pd.timedelta_range(0, periods=9, freq=”2H30T”)
On top of this, codes can be combined with numbers to specify other frequencies. For example, for a frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows:
from pandas.tseries.offsets import BDay
pd.date_range(‘2015-07-01’, periods=5, freq=BDay())
All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the pd.tseries.offsets module. For example, we can create a business day offset directly as follows:
the accompanying pandas-datareader package (installable via conda install pandas-datareader)
knows how to import financial data from a number of available sources, including Yahoo finance, Google Finance, and others. Here we will load Google’s closing price history:
from pandas_datareader import data
goog = data.DataReader(‘GOOG’, start=’2004’, end=’2016’,
data_source=’yahoo’)
goog.head()
datareader
goog.plot(alpha=0.5, style=’-‘)
goog.resample(‘BA’).mean().plot(style=’:’)
goog.asfreq(‘BA’).plot(style=’–’);
plt.legend([‘input’, ‘resample’, ‘asfreq’],
loc=’upper left’);
at each point, resample reports the average of the previous year, while asfreq reports the value at the end of the year.
fig, ax = plt.subplots(2, sharex=True)
data = goog.iloc[:10]
data.asfreq(‘D’).plot(ax=ax[0], marker=’o’)
data.asfreq(‘D’, method=’bfill’).plot(ax=ax[1], style=’-o’)
data.asfreq(‘D’, method=’ffill’).plot(ax=ax[1], style=’–o’)
ax[1].legend([“back-fill”, “forward-fill”]);
For up-sampling, resample() and asfreq() are largely equivalent, though resample has many more options available. In this case, the default for both methods is to leave the up-sampled points empty, that is, filled with NA values. Just as with the pd.fillna() function discussed previously, asfreq() accepts a method argument to specify how values are imputed. Here, we will resample the business day data at a daily frequency (i.e., including weekends):
# apply a frequency to the data goog = goog.asfreq('D', method='pad')
goog. plot(ax=ax[0])
goog. shift(900).plot(ax=ax[1])
goog. tshift(900).plot(ax=ax[2])
Another common time series-specific operation is shifting of data in time. Pandas has two closely related methods for computing this: shift() and tshift() In short, the difference between them is that shift() shifts the data, while tshift() shifts the index. In both cases, the shift is specified in multiples of the frequency.
Here we will both shift() and tshift() by 900 days;
For example, we use shifted values to compute the one-year return on investment for Google stock over the course of the dataset:
?????????????????
rolling = goog.rolling(365, center=True)
data = pd.DataFrame({‘input’: goog,
‘one-year rolling_mean’: rolling.mean(),
‘one-year rolling_std’: rolling.std()})
ax = data.plot(style=[’-‘, ‘–’, ‘:’])
ax.lines[0].set_alpha(0.3)
here is the one-year centered rolling mean and standard deviation of the Google stock prices
data = pd.read_csv(‘FremontBridge.csv’, index_col=’Date’, parse_dates=True)
data.head()
We will specify that we want the Date as an index, and we want these dates to be automatically parsed:
weekly = data.resample(‘W’).sum()
We can gain more insight by resampling the data to a coarser grid. Let’s resample by week:
daily.rolling(50, center=True,
win_type=’gaussian’).sum(std=10).plot(style=[’:’, ‘–’, ‘-‘]);
The jaggedness of the result is due to the hard cutoff of the window. We can get a smoother version of a rolling mean using a window function–for example, a Gaussian window. The following code specifies both the width of the window (we chose 50 days) and the width of the Gaussian within the window (we chose 10 days):
weekend = np.where(data.index.weekday < 5, ‘Weekday’, ‘Weekend’)
numpy.where(condition[, x, y])
Return elements, either from x or y, depending on condition.