Pandas Flashcards by Drew Moses

PD: In normal Pandas style, what is a row and what is a column?

A row is an item or point, and a column is a field or feature.

How well did you know this?

Not at all

Perfectly

PD: Why do most pandas methods return a df rather than editing the original df in place?

It allows for method chaining, or running several methods in a nest, which improves readability of code.

How well did you know this?

Not at all

Perfectly

PD: Do most pandas methods edit a df in place, or return a new df?

Return a new df

How well did you know this?

Not at all

Perfectly

PD: How do you read a table or tables from a web page?

pd.read_html(‘sampleurl.com’,…)

How well did you know this?

Not at all

Perfectly

PD: How do you read a table from a .csv?

pd.read_csv(\path, sep=” “,…)

How well did you know this?

Not at all

Perfectly

PD: How to combine the rows of 2 dfs with the same column headers?

pd.concat([df1,df2])

How well did you know this?

Not at all

Perfectly

PD: How to view the first 5 rows?

df.head()

How well did you know this?

Not at all

Perfectly

PD: How to view the last 5 rows?

df.tail()

How well did you know this?

Not at all

Perfectly

PD: How do we remove columns that aren’t useful to us from our dataframe?

to_drop = [“all”,”col”,”names”,”to”,”drop”]

df.drop(columns=to_drop, inplace=True)

(could remove inplace and do a df=df.drop())

How well did you know this?

Not at all

Perfectly

PD: What are the two ways to exclude columns that aren’t useful to us

Drop them using df.drop(), or don’t import then from the csv using inputs to pd.read_csv()

How well did you know this?

Not at all

Perfectly

PD: If I go df[“a”], what data structure comes out?

A pandas series

How well did you know this?

Not at all

Perfectly

PD: How to I tell if column “A” in a dataframe contains all unique values?

df[“A”].is_unique

How well did you know this?

Not at all

Perfectly

PD: How do I set a new column, such as column “A”, to be the index of a dataframe

df = df.set_index[“A”]

If you want a unique index for row names, first check that the column contains unique values using df[“A”].is_unique

How well did you know this?

Not at all

Perfectly

PD: What is a convenient way to standardize a column containing multiple formats or data types?

Regular expressions, or regex.

See the Pythonic Data Cleaning article for an example: https://realpython.com/python-data-cleaning-numpy-pandas/

How well did you know this?

Not at all

Perfectly

PD: How to find a series containing booleans for whether each value in column “A” is null or not in a dataframe?

df[“A”].isnull()

How well did you know this?

Not at all

Perfectly

PD: How to find the total number of nulls in column “A” in a dataframe?

df[“A”].isnull().sum()

How well did you know this?

Not at all

Perfectly

PD: How to rename cols in a dataframe?

df.rename(columns=listOfNewColNames, inplace=True)

How well did you know this?

Not at all

Perfectly

PD: What does it mean to do a split-apply-combine in pandas?

Say the rows of our dataframe fall into a few different “categories,” and we’d like to find a summary statistic or statistics within each of the categories. We can separate all of the categories into groups (split), then apply one or more summary functions to each group (apply), and lastly combine the summary statistics for each group into a new data frame where each row is now one of the categories (combine).

How well did you know this?

Not at all

Perfectly

PD: How can we, in one line of code, do all 3 parts of split-apply-combine?

df.groupby(“colNameToGroupBy”)[“colNameToFindSummaryStatsOn”] .agg([“list”,”of”,”summary”,”stat”,”functions”])

Video for reference: https://www.youtube.com/watch?v=qy0fDqoMJx8

How well did you know this?

Not at all

Perfectly

PD: How to sort the rows of a data frame based on column “A”, going from low to high

Study These Flashcards

df.sort_values(“A”) (ascending, or low to high, is default)

PD: How to sort the rows of a data frame based on column “A”, going from high to low

Study These Flashcards

df.sort_values(“A”, ascending=False)

PD: How to reset indices of DataFrame to row numbers, moving the index to a new column.

Study These Flashcards

df.reset_index()

PD: How do we display the data types of each column in our df?

Study These Flashcards

df.dtypes

PD: How to sample a subset of rows?

Study These Flashcards

df.sample()

(See details for info on syntax and replacement
)

PD: How to extract specific rows meeting a logical criterion?

df[(array of logical bools)] For specifics and examples, see cheat sheet

PD: How to get n rows with the largest values of a specific column?

df.nlargest(n,"nameOfColumn")

PD: How to get n rows with the smallest values of a specific column?

df.nsmallest(n,"nameOfColumn")

PD: What are the 2 ways to get a column from a df, given that the column name has no spaces or special characters

df["colName"] OR df.colName

PD: How to select multiple specific columns from a dataframe?

df[ ["colName1" ,"colName2" ,"colName3"] ]

PD: How, at a high level, can we pick several columns each matching specific criteria?

Regular expressions. (For how, see cheat sheet and also study regexes)

PD: What is the high-level difference between loc and iloc?

Loc takes a subset of rows, columns or both using *labels*. iloc takes a subset of rows, columns or both using *positions*.

PD: Would iloc more commonly be used for subsetting rows or columns? Why?

Rows. iloc subsets by position, which would be more common for rows, as we might want rows 10 through 20, for example. (But it's worth noting that head() and tail() can also do a lot of this functionality.)

PD: Would loc more commonly be used for subsetting rows or columns? Why?

Columns. Loc subsets by label, which would be more common for columns, which typically have column names as labels.

PD: How to find the number of unique values in columnA of a df?

df["A"].nunique()

PD: How to display the number of appearances for each unique value in column A of a df?

df["A"].value_counts()

PD: How to display the dimensions of a dataframe?

df.shape

PD: How to display the number of rows in a dataframe without using df.shape?

len(df)

PD: How to get basic summary statistics for each column in a df?

df.describe()

PD: How to get the mean, or max, or sum, of column A of a df? Works for lots of summary stats?

df["A"].mean() or df["A"].max() or df["A"].sum()

PD: How to drop all rows where any column has NAs or null data?

df.dropna()

PD: How to fill in all NAs with a value?

df.fillna(value)

PD: How do we group our df by column A, and then find the mean value of column B within each group?

df.groupby("A")["B"].mean() (So you can use the ["B"] notation, or .B notation too, to extract that column for all groups, and then an aggregation function like mean() to run the aggregation for all groups. This makes the syntax very similar to if you were finding the mean of B in a single df rather than a groupby object)

PD: How to get the size of each group in groupby object g?

g.size()

PD: How to make a new column C in df that is the product of columns A and B?

df["C"] = df["A"] *df["B"]

PD: At a high level, how to descretize a column into several equal-sized buckets based on quantiles?

qcut()

PD: Given 2 data frames adf and bdf with common column A, what is the general syntax to do a join on column A, be it a left outer, right outer, inner or full outer join?

pd.merge(adf, bdf, how = "type", on="A") | where type is "left", "right", "inner" or "outer", corresponding to the 4 join types we learned in SQL.

PD: Given 2 data frames adf and bdf with common column A, how do we find all rows in adf with a value in column A that is also present in bdf?

adf[adf.A.isin(bdf.A)]

PD: Given 2 data frames adf and bdf with common column A, how do we find all rows in adf with a value in column A that is NOT present in bdf?

adf[~adf.A.isin(bdf.A)]

PD: How does pandas store strings?

Pandas uses the object dtype for storing strings, not the str dtype!

Pandas Flashcards

(49 cards)