Pandas Flashcards
dimension of a Series
1
dimension of a DataFrame
2
create a Series object from a list, index it using another list
ser=pd.Series(data=list1, index=list2)
creat a Series using a numpy array
import numpy as np
arr=np.array([1,2,3,4])
ser=pd.Series(arr)
create a Series using a dictionary with keys as the index
ser=pd.Series(dict)
from the Series ser access the element with index ‘k’
ser['k'] #just like a dict
Describe pd.DataFrame
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
What arguments does pd.DataFrame take ?
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
from DataFrame df grab the columns ‘name’ and ‘age’
mind the two brackets
df[[‘name, ‘age’]]
from DataFrame df grab the row with index ‘B’ as a Series
df.loc[‘B’]
from DataFrame df grab columns ‘one’, ‘three’ intersections with rows ‘B’, ‘D’
df.loc[[‘B’, ‘D’], [‘one’, ‘three’]]
from DataFrame df, grab row with location 3,2
df.iloc[3,2]
create a new column Total which shows the sum of the columns ‘C’, ‘D’, and ‘E’
df[‘Total’]=df[‘C’] + df[‘D’] + df[‘E’]
in df, delete the row with index ‘F’
df.drop(‘F’, axis=0, inplace=True)
in df, delete the column with index ‘Total’
df.drop(‘Total’, axis=1, inplace=True)
in df, create a new column named ‘Sex’ and assign it as the index
df[‘Sex’]=[‘Men’, ‘Women’]
df.set_index(‘Sex’, inplace=True)
Where does pandas beat numpy ?
NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it
serves this purpose very well, its limitations become clear when we need more flexibility
(attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (groupings, pivots,
etc.), each of which is an important piece of analyzing the less structured data available
in many forms in the world around us. Pandas, and in particular its Series and
DataFrame objects, builds on the NumPy array structure and provides efficient access
to these sorts of “data munging” tasks that occupy much of a data scientist’s time.
What does this dataframe call return : df[‘x’]
The column Series, and not the ROW, with index ‘x’
What is the difference between slicing using an explicit index and using an implicit index ( series[‘a’:’c’] and series[0:2]) ?
the final index is included in the slice in the case of explicit indexing, and excluded from the slice in the case of implicit indexing
Suppose my Series ‘data’ has explicit integer indexing, what does each of the following expressions yield ?
data[1] ? #indexing
data[2:3] #slicing
indexing will use the explicit index and slicing will use the implicit index
State different methods of creating a multi-index
- pass a list of two or more index arrays to the constructor
- pass a dictionary with appropriate tuples as keys
- explit constructor pd.MultiIndex:
pd. MultiIndex.from_arrays([[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]])
pd. MultiIndex.from_tuples([(‘a’, 1), (‘a’, 2), (‘b’, 1), (‘b’, 2)])
pd. MultiIndex.from_product([[‘a’, ‘b’], [1, 2]])
how do we name the different levels of a multiIndex
df.index.names or by passing the parameters names directly in the MultiIndex constructor
DataFrame health_data has two level column MultiIndexing:
columns = pd.MultiIndex.from_product([[‘Bob’, ‘Guido’, ‘Sue’], [‘HR’, ‘Temp’]], names=[‘subject’, ‘type’])
Create in two different ways a slice with only the HR data
OR
health_data.xs(‘HR’, level=1,axis=1, drop_level=False)
idx = pd.IndexSlice
health_data.loc[:, idx[:, ‘HR’]]
Suppose df has a multiIndex with level 0 being [‘a’,’c’,’b’], what could be a caveat to slicing :
df[‘a’:’b’] ?
How can it be fixed ?
This will return an error.
Fix : data = data.sort_index()