Pandas Pro Flashcards
What is chaining?
Perform multiple operations in a single line of code
df_updated = (df
.query(“release_year>2018”) # Get movies and shows only released after 2018
.loc[:, [“title”, “release_year”, “duration”]] # Get only these
.assign(over_three_hours=lambda dataframe: np.where(dataframe[“duration”] > 180, “Yes”, “No”)) # Create new column called over_three_hours depending on duration > 180
.groupby(by=[“release_year”, “over_three_hours”]) # Group by given columns
.count() # Get count of movies by release_year and over_three_hours
)
df_updated
nlargest and nsmallest
Insteadof usingsort_valuesto find the largest or smallest values in your data, consider usingnlargestandnsmallest. These functions are faster and more memory-efficient, making them a great choice for large datasets
df.nsmallest(3, “age”) # Youngest 3 passengers
df.nlargest(3, “age”) # Oldest 3 passengers
Filtering data with .query() method
Pandas’queryfunction allows you to filter your data using logical expressions. You can also use@symbols to refer to variables in your query, making it a convenient and powerful tool for filtering data.
df[“embark_town”].unique() # [‘Southampton’, ‘Cherbourg’, ‘Queenstown’, nan]
embark_towns = [“Southampton”, “Queenstown”] # Only want to select these towns
df.query(“age>21 & fare>250 & embark_town==@embark_towns”).head()
df.cut Method
Child - 0 to 9 years
The cut function is a useful tool for binning your data into discrete categories. This can be useful for visualizing your data or for transforming continuous variables into categorical ones.
# Teen - 10-19 years
# Young - 19 to 24 years
# Adult - 25 to 59
# Elderly > 59
bins = [0, 10, 19, 24, 59, float(‘inf’)]
labels = [“Child”, “Teen”, “Young”, “Adult”, “Elderly”]
df[“age”].hist()
plt.show()
df[“age_category”] = pd.cut(df[“age”], bins=bins, labels=labels)
sorted_df = df.sort_values (by=”age_category”)
sorted_df[“age_category”].hist()
plt.show()
Avoid using inplace
using inplace to remove the first row of the DataFrame directly
Theinplaceparameter in Pandas allows you to perform operations directly on your dataframe, but it can be dangerous to use, as it can make your code harder to read and debug. Instead, try to use the standard method of assigning the result of your operation to a new object.
# df.drop(0, inplace=True)
df = df.drop(0)
Avoid unnecessary apply
Calculate the win probability element-wise for each row using the specified formula
The apply function can be a powerful tool, but it can also be slow and memory-intensive. Try to avoid using apply when there are direct, faster and more efficient ways to accomplish your goal.
columns = [‘space_ship’, ‘galaxy’, ‘speed’,
‘maneuverability’, ‘pilot_skill’, ‘resource_management’]
df[‘win_prob’] = (df[‘speed’] * df[‘maneuverability’] * df[‘pilot_skill’]) / df[‘resource_management’]
# Using .apply()
# df[‘win_prob’] = df.apply(lambda row: (row[‘speed’] * row[‘maneuverability’] * row[‘pilot_skill’]) / row[‘resource_management’], axis=1)
df.sample(n)
It displays the random n number of rows in the sample data
df.shape
It displays the sample data’s rows and columns (dimensions).
(2823, 25)
df.describe()
Get the basic statistics of each column of the sample data
df.info()
Get the information about the various data types used and the non-null count of each column.
df.memory_usage()
It will tell you how much memory is being consumed by each column.
df.iloc[row_num]
It will select a particular row based on its index
For ex-,
df.iloc[0]
df[[‘col1’, ‘col2’]]
It will select multiple columns given
df.isnull()
This will identify the missing values in your dataframe.
df.dropna()
This will remove the rows containing missing values in any column.