Pandas Pro Flashcards
What is chaining?
Perform multiple operations in a single line of code
df_updated = (df
.query(“release_year>2018”) # Get movies and shows only released after 2018
.loc[:, [“title”, “release_year”, “duration”]] # Get only these
.assign(over_three_hours=lambda dataframe: np.where(dataframe[“duration”] > 180, “Yes”, “No”)) # Create new column called over_three_hours depending on duration > 180
.groupby(by=[“release_year”, “over_three_hours”]) # Group by given columns
.count() # Get count of movies by release_year and over_three_hours
)
df_updated
nlargest and nsmallest
Insteadof usingsort_valuesto find the largest or smallest values in your data, consider usingnlargestandnsmallest. These functions are faster and more memory-efficient, making them a great choice for large datasets
df.nsmallest(3, “age”) # Youngest 3 passengers
df.nlargest(3, “age”) # Oldest 3 passengers
Filtering data with .query() method
Pandas’queryfunction allows you to filter your data using logical expressions. You can also use@symbols to refer to variables in your query, making it a convenient and powerful tool for filtering data.
df[“embark_town”].unique() # [‘Southampton’, ‘Cherbourg’, ‘Queenstown’, nan]
embark_towns = [“Southampton”, “Queenstown”] # Only want to select these towns
df.query(“age>21 & fare>250 & embark_town==@embark_towns”).head()
df.cut Method
Child - 0 to 9 years
The cut function is a useful tool for binning your data into discrete categories. This can be useful for visualizing your data or for transforming continuous variables into categorical ones.
# Teen - 10-19 years
# Young - 19 to 24 years
# Adult - 25 to 59
# Elderly > 59
bins = [0, 10, 19, 24, 59, float(‘inf’)]
labels = [“Child”, “Teen”, “Young”, “Adult”, “Elderly”]
df[“age”].hist()
plt.show()
df[“age_category”] = pd.cut(df[“age”], bins=bins, labels=labels)
sorted_df = df.sort_values (by=”age_category”)
sorted_df[“age_category”].hist()
plt.show()
Avoid using inplace
using inplace to remove the first row of the DataFrame directly
Theinplaceparameter in Pandas allows you to perform operations directly on your dataframe, but it can be dangerous to use, as it can make your code harder to read and debug. Instead, try to use the standard method of assigning the result of your operation to a new object.
# df.drop(0, inplace=True)
df = df.drop(0)
Avoid unnecessary apply
Calculate the win probability element-wise for each row using the specified formula
The apply function can be a powerful tool, but it can also be slow and memory-intensive. Try to avoid using apply when there are direct, faster and more efficient ways to accomplish your goal.
columns = [‘space_ship’, ‘galaxy’, ‘speed’,
‘maneuverability’, ‘pilot_skill’, ‘resource_management’]
df[‘win_prob’] = (df[‘speed’] * df[‘maneuverability’] * df[‘pilot_skill’]) / df[‘resource_management’]
# Using .apply()
# df[‘win_prob’] = df.apply(lambda row: (row[‘speed’] * row[‘maneuverability’] * row[‘pilot_skill’]) / row[‘resource_management’], axis=1)
df.sample(n)
It displays the random n number of rows in the sample data
df.shape
It displays the sample data’s rows and columns (dimensions).
(2823, 25)
df.describe()
Get the basic statistics of each column of the sample data
df.info()
Get the information about the various data types used and the non-null count of each column.
df.memory_usage()
It will tell you how much memory is being consumed by each column.
df.iloc[row_num]
It will select a particular row based on its index
For ex-,
df.iloc[0]
df[[‘col1’, ‘col2’]]
It will select multiple columns given
df.isnull()
This will identify the missing values in your dataframe.
df.dropna()
This will remove the rows containing missing values in any column.
df.fillna(val)
df.fillna(val): This will fill the missing values with val given in the argument.
df[‘col’].astype(new_data_type)
It can convert the data type of the selected columns to a different data type
Aggregation Functions:
You can group a column by its name and then apply some aggregation functions like sum, min/max, mean, etc.
df.groupby(“col_name_1”).agg({“col_name_2”: “sum”})
For ex-,
df.groupby(“CITY”).agg({“SALES”: “sum”})
If you want to apply multiple aggregations at a single time, you can write them like that.
For ex-,
aggregation = df.agg({“SALES”: “sum”, “QUANTITYORDERED”: “mean”})
Output:
SALES 1.003263e+07
QUANTITYORDERED 3.509281e+01
dtype: float64
Filtering Data
Filtering Data:
We can filter the data in rows based on a specific value or a condition.
For ex-,
df[df[“SALES”] > 5000]
Displays the rows where the value of sales is greater than 5000
You can also filter the dataframe using thequery()function. It will also generate a similar output as above.
For ex,
df.query(“SALES” > 5000)
Pivot Tables
Master of Science in Informations Systems
Master of Science in Business Analytics
10 Essential Pandas Functions Every Data Scientist Should Know
This article contains ten Pandas functions that are important as well as handy for every data scientist.
By Aryan Garg, KDnuggets on November 10, 2023 in Data Science
FacebookTwitterLinkedInRedditEmailShare
10 Essential Pandas Functions Every Data Scientist Should Know
Image by Author
In today’s data-driven world, data analysis and insights help you get the most out of it and help you make better decisions. From a company’s perspective, it gives a Competitive Advantage and personaliz?s the whole process.
This tutorial will explore the most potent Python library pandas, and we will discuss the most important functions of this library that are important for data analysis. Beginners can also follow this tutorial due to its simplicity and efficiency. If you don’t have python installed in your system, you can use Google Colaboratory.
Importing Data
You can download the dataset from that link.
import pandas as pd
df = pd.read_csv(“kaggle_sales_data.csv”, encoding=”Latin-1”) # Load the data
df.head() # Show first five rows
Output:
10 Essential Pandas Functions Every Data Scientist Should Know
Data Exploration
In this section, we will discuss various functions that help you to get more about your data. Like viewing it or getting the mean, average, min/max, or getting information about the dataframe.
- Data Viewing
df.head(): It displays the first five rows of the sample data
10 Essential Pandas Functions Every Data Scientist Should Know
df.tail(): It displays the last five rows of the sample data
10 Essential Pandas Functions Every Data Scientist Should Know
df.sample(n): It displays the random n number of rows in the sample data
df.sample(6)
10 Essential Pandas Functions Every Data Scientist Should Know
df.shape: It displays the sample data’s rows and columns (dimensions).
(2823, 25)
It signifies that our dataset has 2823 rows, each containing 25 columns.
- Statistics
This section contains the functions that help you perform statistics like average, min/max, and quartiles on your data.
df.describe(): Get the basic statistics of each column of the sample data
10 Essential Pandas Functions Every Data Scientist Should Know
df.info(): Get the information about the various data types used and the non-null count of each column.
10 Essential Pandas Functions Every Data Scientist Should Know
df.corr(): This can give you the correlation matrix between all the integer columns in the data frame.
10 Essential Pandas Functions Every Data Scientist Should Know
df.memory_usage(): It will tell you how much memory is being consumed by each column.
10 Essential Pandas Functions Every Data Scientist Should Know
- Data Selection
You can also select the data of any specific row, column, or even multiple columns.
df.iloc[row_num]: It will select a particular row based on its index
For ex-,
df.iloc[0]
df[col_name]: It will select the particular column
For ex-,
df[“SALES”]
Output:
10 Essential Pandas Functions Every Data Scientist Should Know
df[[‘col1’, ‘col2’]]: It will select multiple columns given
For ex-,
df[[“SALES”, “PRICEEACH”]]
Output:
10 Essential Pandas Functions Every Data Scientist Should Know
- Data Cleaning
These functions are used to handle the missing data. Some rows in the data contain some null and garbage values, which can hamper the performance of our trained model. So, it is always better to correct or remove these missing values.
df.isnull(): This will identify the missing values in your dataframe.
df.dropna(): This will remove the rows containing missing values in any column.
df.fillna(val): This will fill the missing values with val given in the argument.
df[‘col’].astype(new_data_type): It can convert the data type of the selected columns to a different data type.
For ex-,
df[“SALES”].astype(int)
We are converting the data type of the SALES column from float to int.
10 Essential Pandas Functions Every Data Scientist Should Know
- Data Analysis
Here, we will use some helpful functions in data analysis, like grouping, sorting, and filtering.
Aggregation Functions:
You can group a column by its name and then apply some aggregation functions like sum, min/max, mean, etc.
df.groupby(“col_name_1”).agg({“col_name_2”: “sum”})
For ex-,
df.groupby(“CITY”).agg({“SALES”: “sum”})
It will give you the total sales of each city.
10 Essential Pandas Functions Every Data Scientist Should Know
If you want to apply multiple aggregations at a single time, you can write them like that.
For ex-,
aggregation = df.agg({“SALES”: “sum”, “QUANTITYORDERED”: “mean”})
Output:
SALES 1.003263e+07
QUANTITYORDERED 3.509281e+01
dtype: float64
Filtering Data:
We can filter the data in rows based on a specific value or a condition.
For ex-,
df[df[“SALES”] > 5000]
Displays the rows where the value of sales is greater than 5000
You can also filter the dataframe using the query() function. It will also generate a similar output as above.
For ex,
df.query(“SALES” > 5000)
Sorting Data:
You can sort the data based on a specific column, either in the ascending order or in the descending order.
For ex-,
df.sort_values(“SALES”, ascending=False) # Sorts the data in descending order
Pivot Tables:
We can create pivot tables that summarize the data using specific columns. This is very useful in analyzing the data when you only want to consider the effect of particular columns.
For ex-,
pd.pivot_table(df, values=”SALES”, index=”CITY”, columns=”YEAR_ID”, aggfunc=”sum”)
Let me break this for you.
values: It contains the column for which you want to populate the table’s cells.
index: The column used in it will become the row index of the pivot table, and each unique category of this column will become a row in the pivot table.
columns: It contains the headers of the pivot table, and each unique element will become the column in the pivot table.
aggfunc: This is the same aggregator function we discussed earlie
Combining Data Frames
We can combine and merge several data frames either horizontally or vertically. It will concatenate two data frames and return a single merged data frame.
For ex-,
combined_df = pd.concat([df1, df2])
You can merge two data frames based on a common column. It is useful when you want to combine two data frames that share a common identifier.
For ex,
merged_df = pd.merge(df1, df2, on=”common_col”)
Applying Custom Functions
Applying Custom Functions
You can apply custom functions according to your needs in either a row or a column.
For ex-,
def cus_fun(x):
return x * 3
df[“Sales_Tripled”] = df[“SALES”].apply(cus_fun, axis=0)
We have written a custom function that will triple the sales value for each row. axis=0 means that we want to apply the custom function on a column, and axis=1 implies that we want to apply the function on a row.
In the earlier method you have to write a separate function and then to call it from the apply() method. Lambda function helps you to use the custom function inside the apply() method itself. Let’s see how we can do that.
df[“Sales_Tripled”] = df[“SALES”].apply(lambda x: x * 3)
Applymap
We can also apply a custom function to every element of the dataframe in a single line of code. But a point to remember is that it is applicable to all the elements in the dataframe.
For ex-,
df = df.applymap(lambda x: str(x))
It will convert the data type to a string of all the elements in the dataframe.
Time Series Analysis
Time Series Analysis
In mathematics, time series analysis means analyzing the data collected over a specific time interval, and pandas have functions to perform this type of analysis.
Conversion to DateTime Object Model:
We can convert the date column into a datetime format for easier data manipulation.
For ex-,
df[“ORDERDATE”] = pd.to_datetime(df[“ORDERDATE”])
Calculate Rolling Average
Using this method, we can create a rolling window to view data. We can specify a rolling window of any size. If the window size is 5, then it means a 5-day data window at that time. It can help you remove fluctuations in your data and help identify patterns over time.
For ex-
rolling_avg = df[“SALES”].rolling(window=5).mean()
Cross Tabulation
We can perform cross-tabulation between two columns of a table. It is generally a frequency table that shows the frequency of occurrences of various categories. It can help you to understand the distribution of categories across different regions.
For ex-,
Getting a cross-tabulation between the COUNTRY and DEALSIZE.
cross_tab = pd.crosstab(df[“COUNTRY”], df[“DEALSIZE”])
Handling Outliers
Q1 = df[“SALES”].quantile(0.25) Q3 = df[“SALES”].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df[“SALES”] < lower_bound) | (df[“SALES”] > upper_bound)]