pandas Flashcards

Question

Transpose

Answer 1

In [22]: df.sort_index(axis=1, ascending=False) Out[22]: D C B A 2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 2013-01-02 -1.044236 0.119209 -0.173215 1.212112 .... In [23]: df.sort_values(by="B") Out[23]: A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 ....

Answer 2

In [27]: df.loc[dates[0]] Out[27]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64

Answer 3

NOTE: For label slicing, both endpoints are included: df.loc["20130102":"20130104", ["A", "B"]] Out[29]: A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771

Answer 4

df.at[dates[0], "A"] Out[31]: 0.4691122999071863 Does same thing as below, but above faster In [30]: df.loc[dates[0], "A"] Out[30]: 0.4691122999071863

Answer 5

In [33]: df.iloc[3:5, 0:2] Out[33]: A B 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020

Answer 6

In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.212112 0.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.424972 0.276232

Answer 7

In [38]: df.iat[1, 1] Out[38]: -0.17321464905330858 Below is same, but slower In [37]: df.iloc[1, 1] Out[37]: -0.17321464905330858

Answer 8

In [39]: df[df["A"] > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.039575 0.271860

Answer 9

df2[df2["E"].isin(["two", "four"])] In [41]: df2 = df.copy() In [42]: df2["E"] = ["one", "one", "two", "three", "four", "three"] In [43]: df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three In [44]: df2[df2["E"].isin(["two", "four"])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four

Answer 10

Setting values by label: df.at[dates[0], "A"] = 0 Setting values by position: df.iat[0, 1] = 0

Answer 11

Setting by assigning with a NumPy array: df.loc[:, "D"] = np.array([5] * len(df))

Answer 12

with pd.option_context('display.max_rows', 5, 'display.max_columns', None): print(my_df)

Answer 13

df[df['x'] == 300]

Answer 14

In [3]: df.loc[df.AAA >= 5, "BBB"] = -1 In [4]: df Out[4]: AAA BBB CCC 0 4 10 100 1 5 -1 50 2 6 -1 -30 3 7 -1 -50

Answer 15

df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 dfnew = df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11 NOTE THAT THIS has a default of inplace=False, so will not modify df, it just returns a copy. See below Note that this method defaults to dropping rows, not columns. To switch the method settings to operate on columns, we must pass it in the axis=1 argument. df.drop('A + B', axis = 1) https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

Answer 16

df_cleaned['reltoref'] = df_cleaned['reltoref'].astype('category')

Answer 17

df AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 aValue = 43.0 df.loc[(df.CCC - aValue).abs().argsort()] AAA BBB CCC 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50

Answer 18

df'a'] nycolumns={'a','b'] df[nycolumns] Or df[['a','b']] df['B']['Z']

Answer 19

df['A + B'] = df['A'] + df['B']

Answer 20

df.loc['X'] df.iloc[0]

Answer 21

df[['A', 'B']].loc[['X', 'Y']]

Answer 22

df['C'] < 1 X True Y False Z False Name: C, dtype: bool

Answer 23

df[(df['C'] > 0) & (df['A']> 0)]

Answer 24

pandas: df.sort_values(['col1', 'col2']) pandas: df.sort_values('col1', ascending=False)

Answer 25

df.query('col1 == 1 & col2 == 1')

Answer 26

df[['col1']].drop_duplicates() df[['col1', 'col2']].drop_duplicates()

Answer 27

For example, to select bool columns In [329]: df.select_dtypes(include=[bool]) To select string columns you must use the object dtype: In [332]: df.select_dtypes(include=['object'])

Answer 28

df.assign(c=df['a']-df['b'])

Answer 29

df.groupby('col1').agg({'col1': 'mean'}) df.groupby('col1').sum()

Answer 30

df.dropna() Drop columns df.dropna(axis=1)

Answer 31

df['A'].fillna(df['A'].mean())

Answer 32

df.groupby('Organization').mean() Mean of other columns that are numerical df.groupby('Organization').sum() df.groupby('Organization').std() df.groupby('Organization').count() df.groupby('Organization').describe() df.groupby('Organization').max() df.groupby('Organization').min() #The standard deviation of the sales column

Answer 33

Rows pd.concat([df1,df2,df3]) Columns pd.concat([df1,df2,df3],axis=1)

Answer 34

pd.merge(leftDataFrame, rightDataFrame, how='inner', on='id') Join does same, but does it on index

Answer 35

df['col2'].unique() Gets unique items, only works on a series df['col2'].nunique()

Answer 36

df['col2'].value_counts()

Answer 37

The apply method allows you to easily apply the exponentify function to each element of the Series: df['col2'].apply(exponentify)

Answer 38

df.sort_values('col2')

Answer 39

df_chain = ( pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv') .fillna('') ) Note the parentheses.

Answer 40

Gets first 3 rows. Similarly if use consecutive indices

Answer 41

df.loc["20130102":"20130104", ["A", "B"]] A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771

Answer 42

In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.212112 0.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.424972 0.276232

Answer 43

df2[df2["E"].isin(["two", "four"])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four

Answer 44

to_pickle("myfile.pkl") , read_pickle("myfile.pkl")

Answer 45

In [14]: s = pd.Series(range(5)) In [15]: s == 4 Out[15]: 0 False 1 False 2 False 3 False 4 True dtype: bool

Answer 46

Using the Python in operator on a Series tests for membership in the index, not membership among the values. s = pd.Series(range(5), index=list("abcde")) 2 in s Out[17]: False 'b' in s Out[18]: True If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use isin isin([2]) Out[19]: a False b False c True d False e False dtype: bool s.isin([2]).any() Out[20]: True For DataFrame, likewise, in applies to the column axis, testing for membership in the list of column names. method isin():

Answer 47

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index. pd.Series(5.0, index=["a", "b", "c", "d", "e"]) a 5.0 b 5.0 c 5.0 d 5.0 e 5.0 dtype: float64

Answer 48

In contrast to R’s factor function, there is currently no way to assign/change labels at creation time. Use categories to change the categories after creation time.

Answer 49

Category type is unordered, need CategoricalDtype for ordered factors As a convenience, you can use the string 'category' in place of a CategoricalDtype when you want the default behavior of the categories being unordered, and equal to the set values present in the array. In other words, dtype='category' is equivalent to dtype=CategoricalDtype().

Answer 50

In [53]: cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"]) In [54]: df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]}) In [55]: df.describe() Out[55]: cat s count 3 3 unique 2 2 top c c freq 2 2 In [56]: df["cat"].describe() Out[56]: count 3 unique 2 top c freq 2 Name: cat, dtype: object

Answer 51

The result of unique() is not always the same as Series.cat.categories, because Series.unique() has a couple of guarantees, namely that it returns categories in the order of appearance, and it only includes values that are actually present.

pandas Flashcards

(77 cards)