Pandas Primer Flashcards

Question

* What can be used with pandas vectorization?

Answer 1

1. built-in pd.Series methods 2. operations that are compatible with Numpy arrays, for example basic math operations or Boolean conditons

Answer 2

operating a procedure on the entire column array at once, instead of on individual column elements

Answer 3

1. Use the apply method 2. Why? * It's the 2nd fastest row iteration approach of the 4 * It can work with any input function

Answer 4

apply vectorization on the underlying Numpy arrays (by calling df[column\_name].to\_numpy()) for an even greater speedup.

Answer 5

its powerful data manipulation functions

Answer 6

**most** do not modify the input dataframe and only return the output in a new dataframe

Answer 7

Either: 1. Look for the parameter inplace in the method API 2. Or reassign your dataframe to the method output. (e.g. df = df.fillna(0))

Answer 8

fillna(new\_val, inplace = True) By default inplace = False I think

Answer 9

* Wide * every row represents a unique observation and every column represents a feature (e.g. rows are distinct countries and the columns are relevant attributes of each country) * Long * there is one column for the observation ID, one column for attribute name, and one for attribute value. (e.g. screenshot) * Tradeoffs * Long is often easier to implement, as addition of a new feature does not change the table structure * Long is harder to understand. * When they should be used: * The long format is useful when you are curating data and do not yet know what the final structure will be. When your data is ready for analysis, the wide format is preferred.

Answer 10

Use .melt with four parameters: * **id\_vars**: names of the columns with the observation IDs * **value\_vars**: names of the feature columns * **var\_name**: name of the new column that will contain the feature names * **value\_name**: name of the new column that will contain the feature values. E.g. df\_wide.melt(id\_vars = ["country"], value\_vars = ["population\_in\_million", "gdp\_percapita"], var\_name = "attribute", value\_name = "value")

Answer 11

Use .pivot\_table with three parameters: * **index**: name of the column with the ids. * **columns**: name of the column that contains the feature names. * **values**: name of the column that contains the feature values. E.g. df\_long.pivot\_table(index = "country", columns = "attribute", values = "value")

Answer 12

groupby() E.g. in screenshot

Answer 13

* takes as input all the rows in a group and outputs one value * Examples * .count() * .max() * .min() * .sum()

Answer 14

Count non-NA cells for each column or row.

Answer 15

.count() chained on at the end E.g. df.groupby("state").count()

Answer 16

* Call .agg, which takes a mapping from column name to aggregation functions * E.g. df.groupby("state").agg({"city" : "count", "population" : ["sum", "max"]})

Answer 17

[split-apply-combine pattern](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) * Splitting the data into groups based on some criteria * done via .groupby() itself * Applying a function to each group independently * done by calling apply() along with the specified input functions (which can be aggregation, transformation, filtration or a combination of them). * Combining the results into a data structure * done automatically on the returned values of apply.

Answer 18

pandas. Series(data=None, index=None, dtype=None) * data * array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. * index * array-like or Index (1d) Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. * dtype * str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.

Answer 19

remove it with a call to **dropna()**

Answer 20

All the values will be NaN

Answer 21

* .agg() can only aggregate data in each column separately * . apply is a more general version of agg that can handle multi-column operations while also performing filtration

Answer 22

* check that the columns which you perform groupby on have no empty values (np.nan). These will be ignored during groupby, resulting in potential loss of data.

Answer 23

* pd.concat([df1, df2]) * pd.concat([df1, df3], axis = 1)

Answer 24

* Concat Along rows -\> adds more rows * Concat Along columns -\> adds more columns

Answer 25

* Left join keeps all rows of the left table, add entries from right table that match the corresponding columns. * Right join is the like a left join but with the roles of the tables reversed. * Outer join returns all rows from both left and right join. * Inner join return the rows where the two joined columns contain the same value

Answer 26

* Usually use the merge method. * if dfs are indexed, use the join() method because it's faster * df1.merge(df2, left\_on = "col1", right\_on = "col1", how = "left") * If the column to merge on is an index, we need to use "left\_index = True" ( same for right\_index vs right\_on).

Answer 27

df.set\_index('some\_col', inplace=True)

Answer 28

* Only supports joining by indexes. * It's faster than the merge method because joining by indexes is faster than by column names

Answer 29

groupby(by\_label\_or\_list\_of\_labels)[about\_label\_or\_list\_of\_labels].aggregate\_function()

Answer 30

* 1 situation: passing in multiple groups. (I.e. when you pass a list of column labels to groupby) * pass as\_index=False to groupby (see screenshot)

Answer 31

* Useful when you need to either: * Segment and sort data values into bins. * Go from a continuous variable to a categorical variable. * could convert ages to groups of age ranges, then groupby that new column * df['new\_col\_name'] = pandas.cut(df['some\_column'], bins=num\_bins, labels=list\_of\_bin\_labels)

Answer 32

* Use the cut method to make a new column of discrete bins * groupby the new bins

Pandas Primer Flashcards

(56 cards)