pandas Flashcards

1
Q

info about the df

A

df.info()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dimension

A

df.shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3,1 entry of df

A

df.iloc[3,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3rd entry of column called A

A

df.A[2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

1st-3rd rows, 1st-3rd columns

A

df.iloc[0:3,], df.iloc[:,0:3], note the colon needed to get the columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

replace something in list of strings

A

temp_names = [word.replace(“.”, “_”) for word in list(df)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

add to front of list

A

a.insert(0,x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

sort list

A

sorted(mylist), or to modify the list mylist.sort()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

drop element of list

A

a.pop(5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

list of lists

A

a[2][3]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

selection from lists

A

[x for x in nums if x>=0]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

inert items into list

A

r=[1,2,3,4]
r[1:1] =[9,8]
r
[1, 9, 8, 2, 3, 4]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

sample dict

looping over dict key-value pairs

looping over keys

looping over values

A

ratings = {‘4+’: 4433, ‘9+’: 987}

for fruit, qty in fruit_freq.items():

for fruit in fruit_freq.keys():

for qty in fruit_freq.values()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

repeat list

A

a = [2,0]*4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

append vs extend

A

x = [1,2]
x.append([3,4]) gives [1,2,[3,4]]
x.extend([3,4]) gives [1,2,3,4]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

zip and lists

A

x=[1,2,3]
y=[4,5,6]
list(zip(x,y))
[(1, 4), (2, 5), (3, 6)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

convert string to list

A

list(‘hello’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

get multiple values from list

A

lst=[1,5,8,9]
indices=[1,3]
[value for (i, value) in enumerate(lst) if i in set(indices) ]
Out[35]: [5, 9]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

count number of occurrences in list

A

y = [1,2,3,1,4]

y.count(1)
2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

get index of item in list

A

first index:
[“foo”,”bar”,”baz”].index(‘bar’)

all indices
indexes = [i for i,x in enumerate(xs) if x == ‘foo’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

unique elements of list

A

mynewlist = list(set(mylist))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Example creating df

A

In [9]: df2 = pd.DataFrame(
…: {
…: “A”: 1.0,
…: “B”: pd.Timestamp(“20130102”),
…: “C”: pd.Series(1, index=list(range(4)), dtype=”float32”),
…: “D”: np.array([3] * 4, dtype=”int32”),
…: “E”: pd.Categorical([“test”, “train”, “test”, “train”]),
…: “F”: “foo”,
…: }
…: )
…:

In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Types of columns

A

df.dtypes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Summary each column

A

df.describe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Transpose
df.T
26
Sort by an index or by a column
In [22]: df.sort_index(axis=1, ascending=False) Out[22]: D C B A 2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 2013-01-02 -1.044236 0.119209 -0.173215 1.212112 .... In [23]: df.sort_values(by="B") Out[23]: A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 ....
27
Select row with value of index
In [27]: df.loc[dates[0]] Out[27]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64
28
Select by multiple indices
NOTE: For label slicing, both endpoints are included: df.loc["20130102":"20130104", ["A", "B"]] Out[29]: A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771
29
For getting fast access to a scalar
df.at[dates[0], "A"] Out[31]: 0.4691122999071863 Does same thing as below, but above faster In [30]: df.loc[dates[0], "A"] Out[30]: 0.4691122999071863
30
Select multiple rows and positions by number
In [33]: df.iloc[3:5, 0:2] Out[33]: A B 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020
31
Select with Lists of integer position locations:
In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.212112 0.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.424972 0.276232
32
Fast access to scalar with numerical position
In [38]: df.iat[1, 1] Out[38]: -0.17321464905330858 Below is same, but slower In [37]: df.iloc[1, 1] Out[37]: -0.17321464905330858
33
Selecting rows satisfying a condition
In [39]: df[df["A"] > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.039575 0.271860
34
Selecting rows where value in a list
df2[df2["E"].isin(["two", "four"])] In [41]: df2 = df.copy() In [42]: df2["E"] = ["one", "one", "two", "three", "four", "three"] In [43]: df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three In [44]: df2[df2["E"].isin(["two", "four"])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
35
Setting values by index or position
Setting values by label: df.at[dates[0], "A"] = 0 Setting values by position: df.iat[0, 1] = 0
36
Setting a column equal to values of a numpy array
Setting by assigning with a NumPy array: df.loc[:, "D"] = np.array([5] * len(df))
37
display all columns of df
with pd.option_context('display.max_rows', 5, 'display.max_columns', None): print(my_df)
38
filter
df[df['x'] == 300]
39
Can assign to multiple entries at one time,like in R
In [3]: df.loc[df.AAA >= 5, "BBB"] = -1 In [4]: df Out[4]: AAA BBB CCC 0 4 10 100 1 5 -1 50 2 6 -1 -30 3 7 -1 -50
40
drop columns
df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 dfnew = df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11 NOTE THAT THIS has a default of inplace=False, so will not modify df, it just returns a copy. See below Note that this method defaults to dropping rows, not columns. To switch the method settings to operate on columns, we must pass it in the axis=1 argument. df.drop('A + B', axis = 1) https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
41
convert to categorical
df_cleaned['reltoref'] = df_cleaned['reltoref'].astype('category')
42
Sort according to value closest to a particular value
df AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 aValue = 43.0 df.loc[(df.CCC - aValue).abs().argsort()] AAA BBB CCC 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
43
Select columns ,also select one entry by column and tie banners
df'a'] nycolumns={'a','b'] df[nycolumns] Or df[['a','b']] df['B']['Z']
44
Create column
df['A + B'] = df['A'] + df['B']
45
Select row
df.loc['X'] df.iloc[0]
46
Select two columns and two rows by names
df[['A', 'B']].loc[['X', 'Y']]
47
subset of the DataFrame where the value in column C is less than 1
df['C'] < 1 X True Y False Z False Name: C, dtype: bool
48
Select subset based on some condition
df[(df['C'] > 0) & (df['A']> 0)]
49
R: arrange(df, col1, col2) R: arrange(df, desc(col1))
pandas: df.sort_values(['col1', 'col2']) pandas: df.sort_values('col1', ascending=False)
50
R: filter(df, col1 == 1, col2 == 1)
df.query('col1 == 1 & col2 == 1')
51
R: distinct(select(df, col1)) R: distinct(select(df, col1, col2))
df[['col1']].drop_duplicates() df[['col1', 'col2']].drop_duplicates()
52
select based on dtype
For example, to select bool columns In [329]: df.select_dtypes(include=[bool]) To select string columns you must use the object dtype: In [332]: df.select_dtypes(include=['object'])
53
R: mutate(df, c=a-b)
df.assign(c=df['a']-df['b'])
54
R: summarise(gdf, avg=mean(col1, na.rm=TRUE)) R: summarise(gdf, total=sum(col1))
df.groupby('col1').agg({'col1': 'mean'}) df.groupby('col1').sum()
55
Drop rows with na Drop columns
df.dropna() Drop columns df.dropna(axis=1)
56
fill the missing values within a particular column with the average value from that column
df['A'].fillna(df['A'].mean())
57
Group by examples
df.groupby('Organization').mean() Mean of other columns that are numerical df.groupby('Organization').sum() df.groupby('Organization').std() df.groupby('Organization').count() df.groupby('Organization').describe() df.groupby('Organization').max() df.groupby('Organization').min() #The standard deviation of the sales column
58
Concatenate data frames along rows or columns
Rows pd.concat([df1,df2,df3]) Columns pd.concat([df1,df2,df3],axis=1)
59
Merge
pd.merge(leftDataFrame, rightDataFrame, how='inner', on='id') Join does same, but does it on index
60
Unique and counts of unique
df['col2'].unique() Gets unique items, only works on a series df['col2'].nunique()
61
Counts
df['col2'].value_counts()
62
Map
The apply method allows you to easily apply the exponentify function to each element of the Series: df['col2'].apply(exponentify)
63
Sort rows by value of a column
df.sort_values('col2')
64
Pipe / chaining
df_chain = ( pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv') .fillna('') ) Note the parentheses.
65
df[0:3]
Gets first 3 rows. Similarly if use consecutive indices
66
Label slicing and endpoints
df.loc["20130102":"20130104", ["A", "B"]] A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771
67
Lists of integer positions
In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.212112 0.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.424972 0.276232
68
Select like in does in R
df2[df2["E"].isin(["two", "four"])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
69
to and from pickle
to_pickle("myfile.pkl") , read_pickle("myfile.pkl")
70
71
Bitwise Boolean
In [14]: s = pd.Series(range(5)) In [15]: s == 4 Out[15]: 0 False 1 False 2 False 3 False 4 True dtype: bool
72
In and isin
Using the Python in operator on a Series tests for membership in the index, not membership among the values. s = pd.Series(range(5), index=list("abcde")) 2 in s Out[17]: False 'b' in s Out[18]: True If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use isin isin([2]) Out[19]: a False b False c True d False e False dtype: bool s.isin([2]).any() Out[20]: True For DataFrame, likewise, in applies to the column axis, testing for membership in the list of column names. method isin():
73
Create series from a scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index. pd.Series(5.0, index=["a", "b", "c", "d", "e"]) a 5.0 b 5.0 c 5.0 d 5.0 e 5.0 dtype: float64
74
Change/assign categories/levels vs R
In contrast to R’s factor function, there is currently no way to assign/change labels at creation time. Use categories to change the categories after creation time.
75
Category type default
Category type is unordered, need CategoricalDtype for ordered factors As a convenience, you can use the string 'category' in place of a CategoricalDtype when you want the default behavior of the categories being unordered, and equal to the set values present in the array. In other words, dtype='category' is equivalent to dtype=CategoricalDtype().
76
Categorical series and describe
In [53]: cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"]) In [54]: df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]}) In [55]: df.describe() Out[55]: cat s count 3 3 unique 2 2 top c c freq 2 2 In [56]: df["cat"].describe() Out[56]: count 3 unique 2 top c freq 2 Name: cat, dtype: object
77
Categorical series and unique
The result of unique() is not always the same as Series.cat.categories, because Series.unique() has a couple of guarantees, namely that it returns categories in the order of appearance, and it only includes values that are actually present.