Pandas Primer Flashcards

1
Q

How do you read a comma-delimited file in pandas?

A

df = pd.read_csv(filepath)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you create a dataframe in pandas? (2)

A
  • pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns = [‘col1’, ‘col2’, ‘col3’])
  • With a dictionary:
    • df = pd.DataFrame({“ID” : [1, 2, 3], “First Name” : [“John”, “Jim”, “Joe”], “Last Name” : [“Smith”, “Hendry”, “Wilson”]})
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you show a dataframe in pandas?

A

display(df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you access a cell in a dataframe?

A
  • .loc[‘cobra’]
  • .iloc[row_or_col_index]
  • .loc[row_label,col_label]
  • .iloc[row_index,col_index]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What selection features does pandas support? (2)

A
  • Slicing in .loc and .iloc with start_index:end_index
  • array indexing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the difference between .loc and .iloc?

A
  • loc is label-based, -> use row and column labels.
  • iloc is integer position-based -> use integer position values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how do you show the first few rows of a dataframe?

A

df.head(some_num_rows)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s one interesting thing about pandas slicing?

A
  • When you want an entire row or column, instead of including a “:”, can omit the row or column index entirely (along with the comma) by not using a loc or iloc function at all
  • E.g.
    • display(df.loc[:, “Last Name”]) is equivalent to display(df[“Last Name”])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you set an entry in a DataFrame?

A

df.loc[1,”Last Name”] = “some_val”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you set a row in pandas?

A

df.loc[3,:] = (100, “Andrew”, “Moore”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens if you try to set a row and the input index doesn’t exist?

A

new row is appeneded to the end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do we select a subset of the rows that satisfy some conditions from a dataframe?

A

df[(df[“First Name”] == “Jim”) & (df[“Last Name”] == “Kilter”)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Given this dataframe, how do you find rows where Last Name has 6 characters?

A

df[df[“Last Name”].str.len() == 6]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given this dataframe, how do you find rows where First Name contains the substring “Jo”?

A

df[df[“First Name”].str.contains(“Jo”)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Given this dataframe, how do you find rows where First Name is either “Jim” or “Kim”?

A

df[df[“First Name”].isin([“Jim”, “Kim”])]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do you do if you want to find rows that do not satisfy a certain condition?

A

Use the negation symbol ~

E.g. df[~df[“First Name”].isin([“Jim”, “Kim”])]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  • What’s one trick we can use to speed up the selecting of pandas rows?
  • What does it do?
A
  • Use a query string to select rows, which
    • df.query(‘(First Name == “John”) & (Last Name == “Smith”)’)
  • can avoid the creation of the intermediate boolean index and reduce runtime / memory usage:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What’s important to remember about querying? (3)

A
  1. The returned object of a query is a view of the original data frame.
  2. Modifying the view will not affect the original data frame, but will yield a warning.
  3. Unlike Numpy, Pandas preserves the original row index after filtering. E.g. if you make a copy of a slice that has rows 2 and 5 and try to select index 1 from that, an error will be thrown.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you copy a dataframe?

A

.copy()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

If our dataframe has no row with index 0, what do we do?

A

Call .reset_index(drop = True)

E.g. df_copy_reset_index = df_copy.reset_index(drop = True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How can we iterate over rows of a dataframe, from slowest to fastest?

A
  • Use .iloc along with row index.
  • Use iterrows method.
  • Use apply with axis=1.
  • Fastest: Use Pandas vectorization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can we iterate over columns of a dataframe? (iteration syntax and how you index the column)

A
  1. Call .columns to get the list of column names and iterate over it
  2. Use .iloc along with the column indexes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What’s important to note about iterrows()?

A
  • the row returned is only a copy of the data
  • so you cannot update the data frame during .iterrows()
24
Q

What is the default of the apply method?

A
  • To loop through the columns (same as axis=0)
  • To loop through rows, use axis=1
25
Q
  • What can be used with pandas vectorization?
A
  1. built-in pd.Series methods
  2. operations that are compatible with Numpy arrays, for example basic math operations or Boolean conditons
26
Q
  • What is the idea of pandas vectorization?
A

operating a procedure on the entire column array at once, instead of on individual column elements

27
Q

If we can’t vectorize an operation we want to apply to all rows, what should we do? Why?

A
  1. Use the apply method
  2. Why?
    • It’s the 2nd fastest row iteration approach of the 4
    • It can work with any input function
28
Q

If we have a vectorizable operation we want to apply over a set of rows, and the columns are numerical, what should we do?

A

apply vectorization on the underlying Numpy arrays (by calling df[column_name].to_numpy()) for an even greater speedup.

29
Q

What’s One primary advantage of Pandas over other data table packages?

A

its powerful data manipulation functions

30
Q

What’s important to remember about most pandas dataframe methods?

A

most do not modify the input dataframe and only return the output in a new dataframe

31
Q

How do you modify a pandas dataframe in-place? (2)

A

Either:

  1. Look for the parameter inplace in the method API
  2. Or reassign your dataframe to the method output. (e.g. df = df.fillna(0))
32
Q

How do you replace missing entries in a pandas dataframe?

A

fillna(new_val, inplace = True)

By default inplace = False I think

33
Q
  • What are the main forms of input data? (2)
  • Describe them.
  • What are the tradeoffs?
  • When should they be used?
A
  • Wide
    • every row represents a unique observation and every column represents a feature (e.g. rows are distinct countries and the columns are relevant attributes of each country)
  • Long
    • there is one column for the observation ID, one column for attribute name, and one for attribute value. (e.g. screenshot)
  • Tradeoffs
    • Long is often easier to implement, as addition of a new feature does not change the table structure
    • Long is harder to understand.
  • When they should be used:
    • The long format is useful when you are curating data and do not yet know what the final structure will be. When your data is ready for analysis, the wide format is preferred.
34
Q

How do we convert from wide to long?

A

Use .melt with four parameters:

  • id_vars: names of the columns with the observation IDs
  • value_vars: names of the feature columns
  • var_name: name of the new column that will contain the feature names
  • value_name: name of the new column that will contain the feature values.

E.g. df_wide.melt(id_vars = [“country”], value_vars = [“population_in_million”, “gdp_percapita”], var_name = “attribute”, value_name = “value”)

35
Q

How do we convert from long to wide?

A

Use .pivot_table with three parameters:

  • index: name of the column with the ids.
  • columns: name of the column that contains the feature names.
  • values: name of the column that contains the feature values.

E.g. df_long.pivot_table(index = “country”, columns = “attribute”, values = “value”)

36
Q

How do you partition a dataframe into groups based on the values at some columns?

A

groupby()

E.g. in screenshot

37
Q
  • What’s an aggregation function?
  • What are some examples? (4)
A
  • takes as input all the rows in a group and outputs one value
  • Examples
    • .count()
    • .max()
    • .min()
    • .sum()
38
Q

What is .count()?

A

Count non-NA cells for each column or row.

39
Q

What can be used with .groupby?

A

.count() chained on at the end

E.g. df.groupby(“state”).count()

40
Q

How do we apply multiple aggregation functions to each group?

A
  • Call .agg, which takes a mapping from column name to aggregation functions
  • E.g. df.groupby(“state”).agg({“city” : “count”, “population” : [“sum”, “max”]})
41
Q

What’s an important technique in manipulating pandas dataframes? Describe it.

A

split-apply-combine pattern

  • Splitting the data into groups based on some criteria
    • done via .groupby() itself
  • Applying a function to each group independently
    • done by calling apply() along with the specified input functions (which can be aggregation, transformation, filtration or a combination of them).
  • Combining the results into a data structure
    • done automatically on the returned values of apply.
42
Q

How do you define a series in pandas? Use 3 args

A

pandas. Series(data=None, index=None, dtype=None)
* data
* array-like, Iterable, dict, or scalar value

Contains data stored in Series. If data is a dict, argument order is maintained.

  • index
    • array-like or Index (1d)

Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.

  • dtype
    • str, numpy.dtype, or ExtensionDtype, optional

Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.

43
Q

What should we do if we have a row with only NAN values?

A

remove it with a call to dropna()

44
Q

What happens to a row if the operation in .appy() doesn’t return a value for that row?

A

All the values will be NaN

45
Q

What’s a key difference between .agg and .apply?

A
  • .agg() can only aggregate data in each column separately
  • . apply is a more general version of agg that can handle multi-column operations while also performing filtration
46
Q

What’s important to know about groupby?

A
  • check that the columns which you perform groupby on have no empty values (np.nan). These will be ignored during groupby, resulting in potential loss of data.
47
Q
  • How do you concatenate a df along rows?
  • What about columns?
A
  • pd.concat([df1, df2])
  • pd.concat([df1, df3], axis = 1)
48
Q

What does it mean to concatenate along?

A
  • Concat Along rows -> adds more rows
  • Concat Along columns -> adds more columns
49
Q

What are the joins supported by pandas (4) and their functions?

A
  • Left join keeps all rows of the left table, add entries from right table that match the corresponding columns.
  • Right join is the like a left join but with the roles of the tables reversed.
  • Outer join returns all rows from both left and right join.
  • Inner join return the rows where the two joined columns contain the same value
50
Q

How do you join two dataframes in pandas?

A
  • Usually use the merge method.
  • if dfs are indexed, use the join() method because it’s faster
  • df1.merge(df2, left_on = “col1”, right_on = “col1”, how = “left”)
  • If the column to merge on is an index, we need to use “left_index = True” ( same for right_index vs right_on).
51
Q

How do you set the index in pandas?

A

df.set_index(‘some_col’, inplace=True)

52
Q

What do we know about the .join() method? (2)

A
  • Only supports joining by indexes.
  • It’s faster than the merge method because joining by indexes is faster than by column names
53
Q

Describe the groupby syntax

A

groupby(by_label_or_list_of_labels)[about_label_or_list_of_labels].aggregate_function()

54
Q
  • When would you get a multi-index DF from groupby?
  • How can you make that not happen?
A
  • 1 situation: passing in multiple groups. (I.e. when you pass a list of column labels to groupby)
  • pass as_index=False to groupby (see screenshot)
55
Q
  • When is the cut method used? (2)
  • What’s an example of useful usage for us?
  • What’s the syntax?
A
  • Useful when you need to either:
    • Segment and sort data values into bins.
    • Go from a continuous variable to a categorical variable.
  • could convert ages to groups of age ranges, then groupby that new column
  • df[‘new_col_name’] = pandas.cut(df[‘some_column’], bins=num_bins, labels=list_of_bin_labels)
56
Q

What should you do if you have too many groups in your groupby?

A
  • Use the cut method to make a new column of discrete bins
  • groupby the new bins