Pandas Flashcards

Question 1

Q

dimension of a Series

Question 2

Q

dimension of a DataFrame

Question 3

Q

create a Series object from a list, index it using another list

Answer

A

ser=pd.Series(data=list1, index=list2)

Question 4

Q

creat a Series using a numpy array

Answer

A

import numpy as np
arr=np.array([1,2,3,4])
ser=pd.Series(arr)

Question 5

Q

create a Series using a dictionary with keys as the index

Answer

A

ser=pd.Series(dict)

Question 6

Q

from the Series ser access the element with index ‘k’

Answer

A

ser['k']
#just like a dict

Question 7

Q

Describe pd.DataFrame

Answer

A

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Question 8

Q

What arguments does pd.DataFrame take ?

Answer

A

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

Question 9

Q

from DataFrame df grab the columns ‘name’ and ‘age’

Answer

A

mind the two brackets

df[[‘name, ‘age’]]

Question 10

Q

from DataFrame df grab the row with index ‘B’ as a Series

Answer

A

df.loc[‘B’]

Question 11

Q

from DataFrame df grab columns ‘one’, ‘three’ intersections with rows ‘B’, ‘D’

Answer

A

df.loc[[‘B’, ‘D’], [‘one’, ‘three’]]

Question 12

Q

from DataFrame df, grab row with location 3,2

Answer

A

df.iloc[3,2]

Question 13

Q

create a new column Total which shows the sum of the columns ‘C’, ‘D’, and ‘E’

Answer

A

df[‘Total’]=df[‘C’] + df[‘D’] + df[‘E’]

Question 14

Q

in df, delete the row with index ‘F’

Answer

A

df.drop(‘F’, axis=0, inplace=True)

Question 15

Q

in df, delete the column with index ‘Total’

Answer

A

df.drop(‘Total’, axis=1, inplace=True)

Question 16

Q

in df, create a new column named ‘Sex’ and assign it as the index

Answer

A

df[‘Sex’]=[‘Men’, ‘Women’]

df.set_index(‘Sex’, inplace=True)

Question 17

Q

Where does pandas beat numpy ?

Answer

A

NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it
serves this purpose very well, its limitations become clear when we need more flexibility
(attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (groupings, pivots,
etc.), each of which is an important piece of analyzing the less structured data available
in many forms in the world around us. Pandas, and in particular its Series and
DataFrame objects, builds on the NumPy array structure and provides efficient access
to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

Question 18

Q

What does this dataframe call return : df[‘x’]

Answer

A

The column Series, and not the ROW, with index ‘x’

Question 19

Q

What is the difference between slicing using an explicit index and using an implicit index ( series[‘a’:’c’] and series[0:2]) ?

Answer

A

the final index is included in the slice in the case of explicit indexing, and excluded from the slice in the case of implicit indexing

Question 20

Q

Suppose my Series ‘data’ has explicit integer indexing, what does each of the following expressions yield ?

data[1] ? #indexing
data[2:3] #slicing

Answer

A

indexing will use the explicit index and slicing will use the implicit index

Question 21

Q

State different methods of creating a multi-index

Answer

A

pass a list of two or more index arrays to the constructor
pass a dictionary with appropriate tuples as keys
explit constructor pd.MultiIndex:
pd. MultiIndex.from_arrays([[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]])
pd. MultiIndex.from_tuples([(‘a’, 1), (‘a’, 2), (‘b’, 1), (‘b’, 2)])
pd. MultiIndex.from_product([[‘a’, ‘b’], [1, 2]])

Question 22

Q

how do we name the different levels of a multiIndex

Answer

A

df.index.names or by passing the parameters names directly in the MultiIndex constructor

Question 23

Q

DataFrame health_data has two level column MultiIndexing:

columns = pd.MultiIndex.from_product([[‘Bob’, ‘Guido’, ‘Sue’], [‘HR’, ‘Temp’]], names=[‘subject’, ‘type’])

Create in two different ways a slice with only the HR data

Answer

A

OR

health_data.xs(‘HR’, level=1,axis=1, drop_level=False)

idx = pd.IndexSlice
health_data.loc[:, idx[:, ‘HR’]]

Question 24

Q

Suppose df has a multiIndex with level 0 being [‘a’,’c’,’b’], what could be a caveat to slicing :

df[‘a’:’b’] ?

How can it be fixed ?

Answer

A

This will return an error.

Fix : data = data.sort_index()

Question 25

Q

What does this line of code do ?

data.mean(axis=1, level=’type’)

Answer

A

Groups the columns by ‘type’ and then computes the mean along the columns

Question 26

Q

How does pandas concatenate dataframes with similar indices ?

Answer

A

Pandas preserves indices, even if it means repeating them

Question 27

Q

How do we raise an error if the dataframes we want to concatenate x, y have overlapping indices ? what if we want to ignore the indices instead and replace them with a default index ?

Answer

A

pd.concat([x, y], verify_integrity=True)
#OR naming the data sources (multiIndexing)
pd.concat([x, y], keys=['x', 'y'])

Question 28

Q

concatenate df1 and df2

Answer

A

pd.concat ( [ df1, df2 ] , join=’outer’) #default
OR use df1.append(df2) #not in-place, returns a new dataframe

pd. concat ( [ df1, df2 ] , join=’inner’) #intersect cols
pd. concat ( [ df1, df2 ] , join_axes=[‘df1.columns’] ) #keep only df1 cols
pd. concat ( [ df1, df2 ] , join_axes=[‘df2.columns’] ) #keep only df2 cols

Question 29

Q

Merge df1 and df2 on df1.name and df2.employee

Answer

A

pd.merge(df1, df3, left_on=”employee”, right_on=”name”)

Question 30

Q

How do we specify set arithmetics for joins

Answer

A

pd.merge(df6, df7, how=’left’)

Question 31

Q

df has columns A, B. what does df.mean() ?

Answer

A

the mean by columns

Question 32

Q

Find mean of df a dataframe with cols A and B by row

Answer

A

df.mean(axis=’columns’)

Question 33

Q

Get a statistical description of the data in df

Answer

A

df.dropna.describe()

Question 34

Q

what are the three steps involved in a groupby operation in pandas ?

Answer

A

split - apply - combine

Question 35

Q

group the rows of df by key then find min, max, and mean for each key for columns [ ‘data1’, ‘data2’ ]

Answer

A

df.groupby('key').aggregate(['min', np.aggregate(), max])
#aggregate takes a string, a function or a list thereof

Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

df.groupby(‘key’).aggregate({‘data1’: ‘min’, ‘data2’: ‘max’})

Question 36

Q

from DataFrame df, keep all groups in which the standard deviation of column ‘data2’ is larger than 4

Answer

A

df.groupby(‘key’).filter(lambda x: x[‘data2’.std() > 4])

Question 37

Q

in df, center the data by subtracting the (key’s) group-wise mean

hint:transform

Answer

A

df.groupby(‘key’).transform(lambda x: x-x.mean)

Question 38

Q

group and sum the elements of df following the list : L = [0, 1, 0, 1, 2, 0]

Answer

A

L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()

Another method is to provide a dictionary that maps index values to the group keys:
df2 = df.set_index(‘key’)
mapping = {‘A’: ‘vowel’, ‘B’: ‘consonant’, ‘C’: ‘consonant’}
df2.groupby(mapping).sum()

Similar to mapping, you can pass any Python function that will input the index value and output the group:
df2.groupby(str.lower).mean()

Further, any of the preceding key choices can be combined to group on a multi-index:
df2.groupby([str.lower, mapping]).mean()

Question 39

Q

what motivated the creation of Pivot Tables ?

Answer

A

The need for a multidimensional version of GroupBy aggregation. That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.

Question 40

Q

1/ Find the ratio of survival by sex and class using the titanic dataframe with both GroupBy and Pivot syntax

2/ Look at age as a third dimension using pivot_table syntax, bin the age into [0,18,80] using pd.cut

Answer

A

import seaborn as sns
titanic = sns.load_dataset(‘titanic’)

1/

#GROUPBY
titanic.groupby( ['sex', 'class'])['survived'].mean().unstack()

#PIVOT
titanic.pivot_table('survived', index='sex', columns='class', aggfunc='mean') //mean is default, can also take a column>aggfunc mapping (dict)

2/

age=pd.cut(titanic[‘age’], [0,18,80])
titanic.pivot_table(‘survived’, [‘sex’, age], ‘class’)

Question 41

Q

At times it’s useful to compute totals along each grouping of pandas.pivot_table. This can be done via the ?

Answer

A

margins keyword :

titanic.pivot_table(‘survived’, index=’sex’, columns=’class’, margins=True)

Question 42

Q

data = [‘peter’, ‘Paul’, None, ‘MARY’, ‘gUIDO’]

use pandas.str to capitalize the strings in a vectorized fashion. Why would numpy or list comprehension fail at efficiently achieving this ?

Answer

A

names=pd.Series(data)

names.str.capitalize()

Question 43

Q

from dataframe names, slice the 3 first characters of each name into a new dataframe

Answer

A

names.str[0:3]

Question 44

Q

create a date in numpy’s datetime64

Answer

A

we can now do vectorized operations on ‘date’

import numpy as np
date=np.array(‘2015-07-04’, dtype=np.datetime64)

Question 45

Q

One detail of the datetime64 and timedelta64 objects is that they are built on a fundamental time unit. Because the datetime64 object is limited to 64-bit precision, the range of encodable times is 2pow64 times this fundamental unit. In other words, datetime64 imposes a trade-off between : ?

Answer

A

time resolution and maximum time span

Question 46

Q

what’s the use of the pd.Period object ?

Answer

A

It encapsulates the granularity (‘D’, ‘M’) for arithmetic

Brainscape's Knowledge GenomeTM

Pandas Flashcards

Brainscape's Knowledge Genome^TM