pandas Flashcards

1
Q

info about the df

A

df.info()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dimension

A

df.shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3,1 entry of df

A

df.iloc[3,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3rd entry of column called A

A

df.A[2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

1st-3rd rows, 1st-3rd columns

A

df.iloc[0:3,], df.iloc[:,0:3], note the colon needed to get the columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

replace something in list of strings

A

temp_names = [word.replace(“.”, “_”) for word in list(df)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

add to front of list

A

a.insert(0,x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

sort list

A

sorted(mylist), or to modify the list mylist.sort()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

drop element of list

A

a.pop(5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

list of lists

A

a[2][3]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

selection from lists

A

[x for x in nums if x>=0]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

inert items into list

A

r=[1,2,3,4]
r[1:1] =[9,8]
r
[1, 9, 8, 2, 3, 4]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

sample dict

looping over dict key-value pairs

looping over keys

looping over values

A

ratings = {‘4+’: 4433, ‘9+’: 987}

for fruit, qty in fruit_freq.items():

for fruit in fruit_freq.keys():

for qty in fruit_freq.values()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

repeat list

A

a = [2,0]*4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

append vs extend

A

x = [1,2]
x.append([3,4]) gives [1,2,[3,4]]
x.extend([3,4]) gives [1,2,3,4]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

zip and lists

A

x=[1,2,3]
y=[4,5,6]
list(zip(x,y))
[(1, 4), (2, 5), (3, 6)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

convert string to list

A

list(‘hello’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

get multiple values from list

A

lst=[1,5,8,9]
indices=[1,3]
[value for (i, value) in enumerate(lst) if i in set(indices) ]
Out[35]: [5, 9]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

count number of occurrences in list

A

y = [1,2,3,1,4]

y.count(1)
2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

get index of item in list

A

first index:
[“foo”,”bar”,”baz”].index(‘bar’)

all indices
indexes = [i for i,x in enumerate(xs) if x == ‘foo’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

unique elements of list

A

mynewlist = list(set(mylist))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Example creating df

A

In [9]: df2 = pd.DataFrame(
…: {
…: “A”: 1.0,
…: “B”: pd.Timestamp(“20130102”),
…: “C”: pd.Series(1, index=list(range(4)), dtype=”float32”),
…: “D”: np.array([3] * 4, dtype=”int32”),
…: “E”: pd.Categorical([“test”, “train”, “test”, “train”]),
…: “F”: “foo”,
…: }
…: )
…:

In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Types of columns

A

df.dtypes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Summary each column

A

df.describe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Transpose

A

df T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Sort by an index or by a column

A

In [22]: df.sort_index(axis=1, ascending=False)
Out[22]:
D C B A
2013-01-01 -1.135632 -1.509059 -0.282863 0.469112
2013-01-02 -1.044236 0.119209 -0.173215 1.212112
….

In [23]: df.sort_values(by=”B”)
Out[23]:
A B C D
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
….

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Select row with value of index

A

In [27]: df.loc[dates[0]]
Out[27]:
A 0.469112
B -0.282863
C -1.509059
D -1.135632
Name: 2013-01-01 00:00:00, dtype: float64

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Select by multiple indices

A

NOTE: For label slicing, both endpoints are included:

df.loc[“20130102”:”20130104”, [“A”, “B”]]
Out[29]:
A B
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

For getting fast access to a scalar

A

df.at[dates[0], “A”] Out[31]: 0.4691122999071863

Does same thing as below, but above faster

In [30]: df.loc[dates[0], “A”]
Out[30]: 0.4691122999071863

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Select multiple rows and positions by number

A

In [33]: df.iloc[3:5, 0:2]
Out[33]:
A B
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Select with
Lists of integer position locations:

A

In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]:
A C
2013-01-02 1.212112 0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972 0.276232

32
Q

Fast access to scalar with numerical position

A

In [38]: df.iat[1, 1]
Out[38]: -0.17321464905330858

Below is same, but slower
In [37]: df.iloc[1, 1]
Out[37]: -0.17321464905330858

33
Q

Selecting rows satisfying a condition

A

In [39]: df[df[“A”] > 0]
Out[39]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860

34
Q

Selecting rows where value in a list

A

df2[df2[“E”].isin([“two”, “four”])]

In [41]: df2 = df.copy()

In [42]: df2[“E”] = [“one”, “one”, “two”, “three”, “four”, “three”]

In [43]: df2
Out[43]:
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three

In [44]: df2[df2[“E”].isin([“two”, “four”])]
Out[44]:
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four

35
Q

Setting values by index or position

A

Setting values by label:

df.at[dates[0], “A”] = 0

Setting values by position:

df.iat[0, 1] = 0

36
Q

Setting a column equal to values of a numpy array

A

Setting by assigning with a NumPy array:

df.loc[:, “D”] = np.array([5] * len(df))

37
Q

display all columns of df

A

with pd.option_context(‘display.max_rows’, 5, ‘display.max_columns’, None):
print(my_df)

38
Q

filter

A

df[df[‘x’] == 300]

39
Q

Can assign to multiple entries at one time,like in R

A

In [3]: df.loc[df.AAA >= 5, “BBB”] = -1

In [4]: df
Out[4]:
AAA BBB CCC
0 4 10 100
1 5 -1 50
2 6 -1 -30
3 7 -1 -50

40
Q

drop columns

A

df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

dfnew = df.drop(columns=[‘B’, ‘C’])
A D
0 0 3
1 4 7
2 8 11

NOTE THAT THIS has a default of inplace=False, so will not modify df, it just returns a
copy. See below

Note that this method defaults to dropping rows, not columns. To switch the method settings to operate on columns, we must pass it in the axis=1 argument.

df.drop(‘A + B’, axis = 1)

https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

41
Q

convert to categorical

A

df_cleaned[‘reltoref’] = df_cleaned[‘reltoref’].astype(‘category’)

42
Q

Sort according to value closest to a particular value

A

df
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50

aValue = 43.0

df.loc[(df.CCC - aValue).abs().argsort()]

AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50

43
Q

Select columns ,also select one entry by column and tie banners

A

df’a’]

nycolumns={‘a’,’b’]
df[nycolumns]

Or

df[[‘a’,’b’]]

df[‘B’][‘Z’]

44
Q

Create column

A

df[‘A + B’] = df[‘A’] + df[‘B’]

45
Q

Select row

A

df.loc[‘X’]

df.iloc[0]

46
Q

Select two columns and two rows by names

A

df[[‘A’, ‘B’]].loc[[‘X’, ‘Y’]]

47
Q

subset of the DataFrame where the value in column C is less than 1

A

df[‘C’] < 1

X True Y False Z False Name: C, dtype: bool

48
Q

Select subset based on some condition

A

df[(df[‘C’] > 0) & (df[‘A’]> 0)]

49
Q

R: arrange(df, col1, col2)
R: arrange(df, desc(col1))

A

pandas: df.sort_values([‘col1’, ‘col2’])
pandas: df.sort_values(‘col1’, ascending=False)

50
Q

R: filter(df, col1 == 1, col2 == 1)

A

df.query(‘col1 == 1 & col2 == 1’)

51
Q

R: distinct(select(df, col1))

R: distinct(select(df, col1, col2))

A

df[[‘col1’]].drop_duplicates()

df[[‘col1’, ‘col2’]].drop_duplicates()

52
Q

select based on dtype

A

For example, to select bool columns
In [329]: df.select_dtypes(include=[bool])

To select string columns you must use the object dtype:
In [332]: df.select_dtypes(include=[‘object’])

53
Q

R: mutate(df, c=a-b)

A

df.assign(c=df[‘a’]-df[‘b’])

54
Q

R: summarise(gdf, avg=mean(col1, na.rm=TRUE))
R: summarise(gdf, total=sum(col1))

A

df.groupby(‘col1’).agg({‘col1’: ‘mean’})
df.groupby(‘col1’).sum()

55
Q

Drop rows with na

Drop columns

A

df.dropna()

Drop columns
df.dropna(axis=1)

56
Q

fill the missing values within a particular column with the average value from that column

A

df[‘A’].fillna(df[‘A’].mean())

57
Q

Group by examples

A

df.groupby(‘Organization’).mean()

Mean of other columns that are numerical

df.groupby(‘Organization’).sum()

df.groupby(‘Organization’).std()

df.groupby(‘Organization’).count()

df.groupby(‘Organization’).describe()

df.groupby(‘Organization’).max()

df.groupby(‘Organization’).min()
#The standard deviation of the sales column

58
Q

Concatenate data frames along rows or columns

A

Rows
pd.concat([df1,df2,df3])

Columns
pd.concat([df1,df2,df3],axis=1)

59
Q

Merge

A

pd.merge(leftDataFrame, rightDataFrame, how=’inner’, on=’id’)

Join does same, but does it on index

60
Q

Unique and counts of unique

A

df[‘col2’].unique()
Gets unique items, only works on a series

df[‘col2’].nunique()

61
Q

Counts

A

df[‘col2’].value_counts()

62
Q

Map

A

The apply method allows you to easily apply the exponentify function to each element of the Series:

df[‘col2’].apply(exponentify)

63
Q

Sort rows by value of a column

A

df.sort_values(‘col2’)

64
Q

Pipe / chaining

A

df_chain = (
pd.read_csv(‘https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv’)
.fillna(‘’) )

Note the parentheses.

65
Q

df[0:3]

A

Gets first 3 rows. Similarly if use consecutive indices

66
Q

Label slicing and endpoints

A

df.loc[“20130102”:”20130104”, [“A”, “B”]]

               A         B 2013-01-02  1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04  0.721555 -0.706771
67
Q

Lists of integer positions

A

In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]:
A C
2013-01-02 1.212112 0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972 0.276232

68
Q

Select like in does in R

A

df2[df2[“E”].isin([“two”, “four”])]

Out[44]:
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four

69
Q

to and from pickle

A

to_pickle(“myfile.pkl”) , read_pickle(“myfile.pkl”)

70
Q
A
71
Q

Bitwise Boolean

A

In [14]: s = pd.Series(range(5))

In [15]: s == 4
Out[15]:
0 False
1 False
2 False
3 False
4 True
dtype: bool

72
Q

In and isin

A

Using the Python in operator on a Series tests for membership in the index, not membership among the values.

s = pd.Series(range(5), index=list(“abcde”))

2 in s
Out[17]: False

‘b’ in s
Out[18]: True

If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use

isin

isin([2])
Out[19]:
a False
b False
c True
d False
e False
dtype: bool

s.isin([2]).any()
Out[20]: True

For DataFrame, likewise, in applies to the column axis, testing for membership in the list of column names.

method isin():

73
Q

Create series from a scalar

A

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

pd.Series(5.0, index=[“a”, “b”, “c”, “d”, “e”])

a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64

74
Q

Change/assign categories/levels vs R

A

In contrast to R’s factor function, there is currently no way to assign/change labels at creation time. Use categories to change the categories after creation time.

75
Q

Category type default

A

Category type is unordered,

need CategoricalDtype for ordered factors

As a convenience, you can use the string ‘category’ in place of a CategoricalDtype when you want the default behavior of the categories being unordered, and equal to the set values present in the array. In other words, dtype=’category’ is equivalent to dtype=CategoricalDtype().

76
Q

Categorical series and describe

A

In [53]: cat = pd.Categorical([“a”, “c”, “c”, np.nan], categories=[“b”, “a”, “c”])

In [54]: df = pd.DataFrame({“cat”: cat, “s”: [“a”, “c”, “c”, np.nan]})

In [55]: df.describe()
Out[55]:
cat s
count 3 3
unique 2 2
top c c
freq 2 2

In [56]: df[“cat”].describe()
Out[56]:
count 3
unique 2
top c
freq 2
Name: cat, dtype: object

77
Q

Categorical series and unique

A

The result of unique() is not always the same as Series.cat.categories, because Series.unique() has a couple of guarantees, namely that it returns categories in the order of appearance, and it only includes values that are actually present.