Midterm Flashcards
What are CSV files?
CSV - Comma Separated Values
Header Row, separated by commas
Data Rows, separated by commas
Which field would you expect to see in a CSV file of stock data?
- # of employees
- Date/Time
- Company Name
- Price of the stock
- Company’s hometown
- Date/time
- Price of the stock
What does real stock data look like?
Header: Date, Open, High, Low, Close, Volume, Adjusted Close
Close - closing price reported at exchange
Volume - volume sold
Adjusted Close - Number data provider provides based on stock splits and dividend payments. The rate of return looking back with adjusted close should be larger.
What is a data frame?
Columns represent the stock symbols ( Separate dataframes can have different dimensions of data AdjClose, Volume, Close, etc..)
Rows represent time
What pandas code would allow you to print the first or last 5 rows of the DataFrame df?
df = pd.read_csv(“data/AAPL.csv”)
First 5 rows:
print df.head()
Last 5 rows:
print df.tail()
Last n rows:
print df.tail(n)
How do you review specific rows in a data frame betwen random values? For example between rows 10 to 20?
df = pd.read_csv(“data/AAPL.csv”)
print df[10:21]
Note that the second number is not inclusive in the range
How do you compute the max closing price for a stock using pandas?
df = pd.read_csv(“data/{}.csv”.format(symbol))
max value = df [‘Close’].max()
How do you compute the mean volume for a symbol?
df = pd.read_csv(“data/{}.csv”.format(symbol))
Mean = df[‘Volume’].mean()
How would you plot the adjusted close of the following data?
df = pd.read_csv(“data/AAPL.csv”)
print df [‘Adj Close’]
df [‘Adj Close’].plot()
plt.show()
Select the ‘High’ column from the dataframe and then plot it.
df = pd.read_csv(“data/XXX.csv”)
print df [‘High’]
df [‘High’].plot()
plt.show()
How do you plot two columns, such as ‘Close’ and ‘Adj Close’
df = pd.read_csv(“data/AAPL.csv”)
df [[ ‘Close’, ‘Adj Close’] ]. plot()
How many days were US stocks traded at NYSE in 2014?
252
What is S&P 500 and what is SPY?
S&P 500 - Stock Market Index based on 500 large American companies listed on the NYSE or NASDAQ. Essentially a weighted mean of the stock prices of the companies
SPY - SPDR S&P 500 - An ETF (Exchange-Traded Fund) that tracks the S&P 500 index
How do you create an empty data frame (df1) with a given datetime range?
start_date = ‘2010-01-22’
end_date = ‘2010-01-26’
dates = pd.date_range(start_date, end_date)
df1 = pd.DataFrame( index = dates)
Using an empty data frame (df1) in a specified daterange, how do you join df1 to a data frame for SPY (dfSPY)
df1 = pd.DataFrame( index = dates )
Ensure first that the SPY dataframe is indexed with the date column, not the numbered column. Additionally, ensure that na values are interpreted is a “not a number” and not as strings
dfSPY = pd.read_csv( “data/SPY.csv”, index_col = “Date”, parse_dates = True, na_values=[‘nan’] )
Join the two dataframes using DataFrame.join()
df1 = df1.join(dfSPY)
How do you drop NaN values on a data frame (df1)?
df1 = df1.dropna()
How do you drop NaN values when combining two dataframes (ie. df1 and dfSPY)?
df1.join ( dfSPY, how = ‘inner’ )
What is the default operation for the “how” parameter in the dataframe.join function?
The default option is left which indicates that the calling dataframe’s index will be used. Therefore, any dates from the calling dataframe will be preserved, potentially yielding NaN values if not shared by the other dataframe.
How can you read in multiple stocks into one dataframe though they may contain the same column names?
symbols = [‘GOOG’, ‘IBM’, ‘GLD’]
for symbol in symbols:
df_temp = pd.read_csv(“data/{}.csv”.format(symbol), index_col = ‘Date’, parse_dates = True, usecols=[‘Date’, ‘Adj Close’], na_values = [‘nan’])
Rename columns
df_temp = df_temp_rename( columns = {‘ Adj Close’ : symbol})
df1 = df1.join(df_temp)
In a dataframe (df) containing multiple symbols,
how would you drop dates in which SPY did not trade?
if symbol == ‘SPY’:
df = df.dropna( subset = [SPY])
How do you select the piece of data indicating 2010 - 02 - 13 to 2010 - 02 - 15 and only GOOG and GLD?
df = df.ix [‘2010-02-13’ : ‘2010-02-15’, [ ‘GOOG’, ‘GLD’] ]
What is the best way to normalize price data so that all prices start at 1.0?
df1 = df1 / df1[0]
OR
df1 = df1 / df.ix[0]
Slice and plot SPY and IBM over the daterange ‘2010-03-01’ to ‘2010-04-01’
start_index = ‘2010-03-01’
end_index = ‘2010-04-01’
columns = [‘SPY’, ‘IBM’]
plot_data(df.ix [start_index: end_index, columns], title=”title”)
df. plot()
plt. show()
…
def plot_data(df, title=”title”):
ax = df.plot(title = title, fontsize = 2)
ax. set_xlabel(“Date”)
ax. set_ylabel(“Price”)
plt. show()
How do you normalize a dataframe df?
df = df / df.ix [0 :]
How do you return the number of rows in an array, a?
a.shape[0]
How do you return the number of columns in an array, a?
a.shape[1]
How do you get the number of items in an array, a?
a.size
How do you get the sum of elements of an array, a?
a.sum()
How do you get the sum of each column of an array, a?
a.sum(axis = 0)
How do you get the sum of each row of an array, a?
a.sum(axis =1)
How do you get the location of the maximum value of an array, a?
a.argmax()
In an array a, how would you get the entire row of every other column up to the 3rd column?
a[:, 0:3:2]
where 0 indicates start at first column
3 indicates end before 3rd column
2 indicates choose every second element
How do you index an array, a, with another array, b?
a = np.random(10, size = 5)
indices = np.array( [1, 1, 2, 3]
a = [7, 6, 8, 5, 9]
a [indices] = [6, 6, 8, 5]