Dataframes Flashcards
How to import data from delimited text files?
Comma seperated files:
* pd.read_csv(“filename.csv”)
Tab seperated files:
* pd.read_csv(“filename.tsv”, delimiter=”\t”)
General Text files:
* pd.read_csv(“filename.txt”, delimiter=””)
How to view the top n entries in the dataframe?
df.head(n)
How to retrieve the number of rows and columns in the dataframe?
Rows:
df.shape[0]
Columns:
df.shape[1]
Both Rows & Columns:
df.shape
or
df.info() – Shows number of records or data entries, data types and memory usage
How to print the names of all the columns in the dataframe?
df.columns
How to check the indexing of the dataframe?
df.index
What does RangeIndex(start=0, stop=2622, step=1) mean?
It means that the index starts at 0 and stops before 2622. It does not include 2622.
This means that the index runs from 0 to 2622 which is 2622 rows in total.
Total Rows = stop - start = 2622 - 0 = 2,622 rows.
So the dataframe has 2622 rows.
How to group by and sort data in a dataframe?
Basic Syntax:
df.groupby(‘column_name’, as_index=False).agg({‘column_name’:’sum’, ‘column_name1’:’sum’})
df.sorted_values(‘column_name’, ascending=False)
Sorting without assignment:
df_grouped.sort_values(‘column_to_sort’, ascending=False, inplace=True)
Grouping by multiple columns:
df.groupby([‘col1’, ‘col2’], as_index=False).agg({‘col_to_sum’:’sum’})
as_index=False → Keeps grouped column as a column instead of an index
OR
df.groupby([‘col1’, ‘col2’]).agg({‘col_to_sum’: ‘sum’}).reset_index()
Use **reset_index() **if you forgot to set as_index=False or need to convert an index to a column later.
What Does dtype(‘O’) mean in Pandas?
O stands for Object data type.
It is used to represent strings or mixed types in a dataframe.
How to check the data type of a column in pandas?
df[‘col1’].dtype
How to convert string values in a column to numeric, handling errors?
df[‘col’] = pd.to_numeric(df[‘col’], errors=’coerce’)
What is the equivalent of the DISTINCT COUNT function in pandas?
df[‘column_name’].nunique()
This returns the count of unique values in a column.
df.nunique()
If you want the distinct count for multiple columns
Example:
data = {‘Category’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘B’, ‘C’]}
df = pd.DataFrame(data)
distinct_count = df[‘Category’].nunique()
print(distinct_count) # Output: 3
What’s the difference between .nunique() and .value_counts().count() in Pandas?
.nunique() → Directly counts the number of unique values in a column (more efficient).
.value_counts().count() → First creates a frequency table, then counts the unique entries (less efficient).
✅ Use .nunique() when you just need the count of unique values. It is more direct and efficient for just getting the number of unique values.
✅ Use .value_counts() if you also need the frequency of each unique value. It is less efficient but useful if you also want frequency counts of each unique value.
How to find the sum and average of a column?
Sum: df[‘column_name’].sum()
Average: df[‘column_name’].mean()
Find the average revenue per order
df[‘revenue’].sum() / df[‘order_id’].nunique()
Alternate approach:
If the table has multiple line items per order, grouping by Order_ID first ensures that you sum up the revenue for each order before calculating the average.
df.groupby([‘order_id’], as_index = False)[‘revenue’].sum().mean()
groupby() ensures we sum revenue per order first before averaging.
How to set a specific column in the dataset as the index column, while reading data from a delimited text file for example?
pd.read_csc(‘file.tsv’, delimiter=’|’, index_col=’column_name’)
How to view the first and last entries in the dataframe?
df.head()
df.tail()
How to summarize the basics stats about the data in the dataframe?
**df.describe()
By default, this function only describes numeric columns
df.describe(include=’all’)
To include summary all columns in the dataframe
df.describe(include=’object’)
To include only object columns
Note: A column with the data type “object” typically contains text (strings), but it can also store mixed data types, such as numbers, lists, dictionaries, or even Python objects.
How to summarize a specific column in the dataframe?
df[‘column_name’].describe()
How to get general information about the dataframe?
df.info()
How to check the number of missing values in each column in a dataframe?
df.isnull().sum()
or
df[‘column_name’].isnull().sum()
How to find the name of the nth column in the dataframe?
df.columns[n-1]
Example:
What is the name of the 105th column?
df.columns[104]
How to find the datatype of the nth column in the dataframe?
df[‘column_name’].dtypes
How to find the value in a specifc location (row and column) in the dataframe?
df.values[row_number][column_number]
or
.loc[row_label, column_label]
Label-based indexing - Labels (index, column names)
or
.iloc[row_index, col_index]
** Position-based indexing - Integer positions (0-based)**
Excludes the end index
How do you group by a column, sum another column, and keep the grouping column as a regular column (not index)?
A:
df.groupby(“item_name”, as_index=False).agg({“quantity”: sum})
✅ as_index=False → keeps item_name as a column
✅ .agg({“quantity”: sum}) → aggregates using sum
Bonus – Sort by highest total:
df.groupby(“item_name”, as_index=False).agg({“quantity”: sum}).sort_values(by=”quantity”, ascending
What’s the difference between these two group by statements in pandas?
- chipo.groupby(“item_name”)[“quantity”].sum()
- chipo.groupby(“item_name”).agg({“quantity”: “sum”})
Returns a Series:
chipo.groupby(“item_name”)[“quantity”].sum()
Result: Series
Index: item_name
Values: Sum of quantity
You cannot use sort_values(by=”quantity”) because there’s no column name — just values.
Returns a DataFrame:
chipo.groupby(“item_name”).agg({“quantity”: “sum”})
Result: DataFrame
Index: item_name
Column: quantity
You can sort with: .sort_values(by=”quantity”)
Quick Tip:
Use [“quantity”].sum() when you want a quick Series.
Use .agg(…) when you want a DataFrame and more flexibility (like sorting by column name or adding multiple aggregations).
What is a lambda function and when do you use it?
A lambda function is a small, anonymous function in Python used for short, simple operations.
🧾 Syntax:
lambda arguments: expression
It behaves like a regular function, but it’s:
Defined in one line
Not given a name
Often used inside methods like apply(), map(), or filter()
🔍 Example:
square = lambda x: x * x
square(4) # Output: 16
Used in pandas:
df[“new_col”] = df[“old_col”].apply(lambda x: x + 10)
💡 Use when:
You need a quick, one-time-use function
Especially handy inside pandas’ apply() for column-wise or row-wise logic
How do you use a lambda function to convert strings like “$2.39” in a column to floats?
df[“item_price”] = df[“item_price”].apply(lambda x: float(x.replace(‘$’, ‘’)))
.apply(lambda x: …) → applies the function to each value
x.replace(‘$’, ‘’) → removes the $ symbol
float(…) → converts the cleaned string to a float
⚠️ Why x[1:-1] is risky:
lambda x: float(x[1:-1])
Removes both the first and last characters
Might chop off a digit (e.g., “$2.39” → “2.3” instead of “2.39”)
🔄 Safer Alternative:
lambda x: float(x[1:])
Removes only the first character ($)
Still assumes there’s no unwanted character at the end
💡 Best practice: Use .replace() for clarity and flexibility.
What’s the difference between value_counts() and count() in pandas? What does chipo[“item_name”].value_counts().count() do?
A:
value_counts()
➤ Returns the count of unique values in a Series.
➤ Output: A Series with unique values as the index and their counts as values.
✅ Example: chipo[“item_name”].value_counts() tells you how many times each menu item was ordered.
count()
➤ Returns the number of non-null entries in a Series.
➤ If used after value_counts(), it tells you how many unique values exist.
So, what does chipo[“item_name”].value_counts().count() do?
✅ It returns the number of unique items ordered — i.e., how many different kinds of item_name exist in the data.