Pandas Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

dimension of a Series

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dimension of a DataFrame

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

create a Series object from a list, index it using another list

A

ser=pd.Series(data=list1, index=list2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

creat a Series using a numpy array

A

import numpy as np
arr=np.array([1,2,3,4])
ser=pd.Series(arr)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

create a Series using a dictionary with keys as the index

A

ser=pd.Series(dict)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

from the Series ser access the element with index ‘k’

A
ser['k']
#just like a dict
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe pd.DataFrame

A

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What arguments does pd.DataFrame take ?

A

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

from DataFrame df grab the columns ‘name’ and ‘age’

A

mind the two brackets

df[[‘name, ‘age’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

from DataFrame df grab the row with index ‘B’ as a Series

A

df.loc[‘B’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

from DataFrame df grab columns ‘one’, ‘three’ intersections with rows ‘B’, ‘D’

A

df.loc[[‘B’, ‘D’], [‘one’, ‘three’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

from DataFrame df, grab row with location 3,2

A

df.iloc[3,2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

create a new column Total which shows the sum of the columns ‘C’, ‘D’, and ‘E’

A

df[‘Total’]=df[‘C’] + df[‘D’] + df[‘E’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

in df, delete the row with index ‘F’

A

df.drop(‘F’, axis=0, inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

in df, delete the column with index ‘Total’

A

df.drop(‘Total’, axis=1, inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

in df, create a new column named ‘Sex’ and assign it as the index

A

df[‘Sex’]=[‘Men’, ‘Women’]

df.set_index(‘Sex’, inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Where does pandas beat numpy ?

A

NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it
serves this purpose very well, its limitations become clear when we need more flexibility
(attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (groupings, pivots,
etc.), each of which is an important piece of analyzing the less structured data available
in many forms in the world around us. Pandas, and in particular its Series and
DataFrame objects, builds on the NumPy array structure and provides efficient access
to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does this dataframe call return : df[‘x’]

A

The column Series, and not the ROW, with index ‘x’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the difference between slicing using an explicit index and using an implicit index ( series[‘a’:’c’] and series[0:2]) ?

A

the final index is included in the slice in the case of explicit indexing, and excluded from the slice in the case of implicit indexing

20
Q

Suppose my Series ‘data’ has explicit integer indexing, what does each of the following expressions yield ?

data[1] ? #indexing
data[2:3] #slicing

A

indexing will use the explicit index and slicing will use the implicit index

21
Q

State different methods of creating a multi-index

A
  • pass a list of two or more index arrays to the constructor
  • pass a dictionary with appropriate tuples as keys
  • explit constructor pd.MultiIndex:
    pd. MultiIndex.from_arrays([[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]])
    pd. MultiIndex.from_tuples([(‘a’, 1), (‘a’, 2), (‘b’, 1), (‘b’, 2)])
    pd. MultiIndex.from_product([[‘a’, ‘b’], [1, 2]])
22
Q

how do we name the different levels of a multiIndex

A

df.index.names or by passing the parameters names directly in the MultiIndex constructor

23
Q

DataFrame health_data has two level column MultiIndexing:

columns = pd.MultiIndex.from_product([[‘Bob’, ‘Guido’, ‘Sue’], [‘HR’, ‘Temp’]], names=[‘subject’, ‘type’])

Create in two different ways a slice with only the HR data

A

OR

health_data.xs(‘HR’, level=1,axis=1, drop_level=False)

idx = pd.IndexSlice
health_data.loc[:, idx[:, ‘HR’]]

24
Q

Suppose df has a multiIndex with level 0 being [‘a’,’c’,’b’], what could be a caveat to slicing :

df[‘a’:’b’] ?

How can it be fixed ?

A

This will return an error.

Fix : data = data.sort_index()

25
Q

What does this line of code do ?

data.mean(axis=1, level=’type’)

A

Groups the columns by ‘type’ and then computes the mean along the columns

26
Q

How does pandas concatenate dataframes with similar indices ?

A

Pandas preserves indices, even if it means repeating them

27
Q

How do we raise an error if the dataframes we want to concatenate x, y have overlapping indices ? what if we want to ignore the indices instead and replace them with a default index ?

A
pd.concat([x, y], verify_integrity=True)
#OR naming the data sources (multiIndexing)
pd.concat([x, y], keys=['x', 'y'])
28
Q

concatenate df1 and df2

A

pd.concat ( [ df1, df2 ] , join=’outer’) #default
OR use df1.append(df2) #not in-place, returns a new dataframe

pd. concat ( [ df1, df2 ] , join=’inner’) #intersect cols
pd. concat ( [ df1, df2 ] , join_axes=[‘df1.columns’] ) #keep only df1 cols
pd. concat ( [ df1, df2 ] , join_axes=[‘df2.columns’] ) #keep only df2 cols

29
Q

Merge df1 and df2 on df1.name and df2.employee

A

pd.merge(df1, df3, left_on=”employee”, right_on=”name”)

30
Q

How do we specify set arithmetics for joins

A

pd.merge(df6, df7, how=’left’)

31
Q

df has columns A, B. what does df.mean() ?

A

the mean by columns

32
Q

Find mean of df a dataframe with cols A and B by row

A

df.mean(axis=’columns’)

33
Q

Get a statistical description of the data in df

A

df.dropna.describe()

34
Q

what are the three steps involved in a groupby operation in pandas ?

A

split - apply - combine

35
Q

group the rows of df by key then find min, max, and mean for each key for columns [ ‘data1’, ‘data2’ ]

A
df.groupby('key').aggregate(['min', np.aggregate(), max])
#aggregate takes a string, a function or a list thereof

Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

df.groupby(‘key’).aggregate({‘data1’: ‘min’, ‘data2’: ‘max’})

36
Q

from DataFrame df, keep all groups in which the standard deviation of column ‘data2’ is larger than 4

A

df.groupby(‘key’).filter(lambda x: x[‘data2’.std() > 4])

37
Q

in df, center the data by subtracting the (key’s) group-wise mean

hint:transform

A

df.groupby(‘key’).transform(lambda x: x-x.mean)

38
Q

group and sum the elements of df following the list : L = [0, 1, 0, 1, 2, 0]

A

L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()

Another method is to provide a dictionary that maps index values to the group keys:
df2 = df.set_index(‘key’)
mapping = {‘A’: ‘vowel’, ‘B’: ‘consonant’, ‘C’: ‘consonant’}
df2.groupby(mapping).sum()

Similar to mapping, you can pass any Python function that will input the index value and output the group:
df2.groupby(str.lower).mean()

Further, any of the preceding key choices can be combined to group on a multi-index:
df2.groupby([str.lower, mapping]).mean()

39
Q

what motivated the creation of Pivot Tables ?

A

The need for a multidimensional version of GroupBy aggregation. That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.

40
Q

1/ Find the ratio of survival by sex and class using the titanic dataframe with both GroupBy and Pivot syntax

2/ Look at age as a third dimension using pivot_table syntax, bin the age into [0,18,80] using pd.cut

A

import seaborn as sns
titanic = sns.load_dataset(‘titanic’)

1/

#GROUPBY
titanic.groupby( ['sex', 'class'])['survived'].mean().unstack()
#PIVOT
titanic.pivot_table('survived', index='sex', columns='class', aggfunc='mean') //mean is default, can also take a column>aggfunc mapping (dict)

2/

age=pd.cut(titanic[‘age’], [0,18,80])
titanic.pivot_table(‘survived’, [‘sex’, age], ‘class’)

41
Q

At times it’s useful to compute totals along each grouping of pandas.pivot_table. This can be done via the ?

A

margins keyword :

titanic.pivot_table(‘survived’, index=’sex’, columns=’class’, margins=True)

42
Q

data = [‘peter’, ‘Paul’, None, ‘MARY’, ‘gUIDO’]

use pandas.str to capitalize the strings in a vectorized fashion. Why would numpy or list comprehension fail at efficiently achieving this ?

A

names=pd.Series(data)

names.str.capitalize()

43
Q

from dataframe names, slice the 3 first characters of each name into a new dataframe

A

names.str[0:3]

44
Q

create a date in numpy’s datetime64

A

we can now do vectorized operations on ‘date’

import numpy as np
date=np.array(‘2015-07-04’, dtype=np.datetime64)

45
Q

One detail of the datetime64 and timedelta64 objects is that they are built on a fundamental time unit. Because the datetime64 object is limited to 64-bit precision, the range of encodable times is 2pow64 times this fundamental unit. In other words, datetime64 imposes a trade-off between : ?

A

time resolution and maximum time span

46
Q

what’s the use of the pd.Period object ?

A

It encapsulates the granularity (‘D’, ‘M’) for arithmetic