Dataframes Flashcards by Divya Michael

How to import data from delimited text files?

Comma seperated files:
* pd.read_csv(“filename.csv”)
Tab seperated files:
* pd.read_csv(“filename.tsv”, delimiter=”\t”)
General Text files:
* pd.read_csv(“filename.txt”, delimiter=””)

How well did you know this?

Not at all

Perfectly

How to view the top n entries in the dataframe?

df.head(n)

How well did you know this?

Not at all

Perfectly

How to retrieve the number of rows and columns in the dataframe?

Rows:
df.shape[0]
Columns:
df.shape[1]
Both Rows & Columns:
df.shape
or
df.info() – Shows number of records or data entries, data types and memory usage

How well did you know this?

Not at all

Perfectly

How to print the names of all the columns in the dataframe?

df.columns

How well did you know this?

Not at all

Perfectly

How to check the indexing of the dataframe?

df.index

How well did you know this?

Not at all

Perfectly

What does RangeIndex(start=0, stop=2622, step=1) mean?

It means that the index starts at 0 and stops before 2622. It does not include 2622.

This means that the index runs from 0 to 2622 which is 2622 rows in total.

Total Rows = stop - start = 2622 - 0 = 2,622 rows.

So the dataframe has 2622 rows.

How well did you know this?

Not at all

Perfectly

How to group by and sort data in a dataframe?

Basic Syntax:
df.groupby(‘column_name’, as_index=False).agg({‘column_name’:’sum’, ‘column_name1’:’sum’})

df.sorted_values(‘column_name’, ascending=False)

Sorting without assignment:
df_grouped.sort_values(‘column_to_sort’, ascending=False, inplace=True)

Grouping by multiple columns:
df.groupby([‘col1’, ‘col2’], as_index=False).agg({‘col_to_sum’:’sum’})
as_index=False → Keeps grouped column as a column instead of an index

df.groupby([‘col1’, ‘col2’]).agg({‘col_to_sum’: ‘sum’}).reset_index()
Use **reset_index() **if you forgot to set as_index=False or need to convert an index to a column later.

How well did you know this?

Not at all

Perfectly

What Does dtype(‘O’) mean in Pandas?

O stands for Object data type.
It is used to represent strings or mixed types in a dataframe.

How well did you know this?

Not at all

Perfectly

How to check the data type of a column in pandas?

df[‘col1’].dtype

How well did you know this?

Not at all

Perfectly

How to convert string values in a column to numeric, handling errors?

df[‘col’] = pd.to_numeric(df[‘col’], errors=’coerce’)

How well did you know this?

Not at all

Perfectly

What is the equivalent of the DISTINCT COUNT function in pandas?

df[‘column_name’].nunique()
This returns the count of unique values in a column.

df.nunique()
If you want the distinct count for multiple columns

Example:
data = {‘Category’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘B’, ‘C’]}
df = pd.DataFrame(data)

distinct_count = df[‘Category’].nunique()
print(distinct_count) # Output: 3

How well did you know this?

Not at all

Perfectly

What’s the difference between .nunique() and .value_counts().count() in Pandas?

.nunique() → Directly counts the number of unique values in a column (more efficient).

.value_counts().count() → First creates a frequency table, then counts the unique entries (less efficient).

✅ Use .nunique() when you just need the count of unique values. It is more direct and efficient for just getting the number of unique values.
✅ Use .value_counts() if you also need the frequency of each unique value. It is less efficient but useful if you also want frequency counts of each unique value.

How well did you know this?

Not at all

Perfectly

How to find the sum and average of a column?

Sum: df[‘column_name’].sum()
Average: df[‘column_name’].mean()

How well did you know this?

Not at all

Perfectly

Find the average revenue per order

df[‘revenue’].sum() / df[‘order_id’].nunique()

Alternate approach:
If the table has multiple line items per order, grouping by Order_ID first ensures that you sum up the revenue for each order before calculating the average.

df.groupby([‘order_id’], as_index = False)[‘revenue’].sum().mean()

groupby() ensures we sum revenue per order first before averaging.

How well did you know this?

Not at all

Perfectly

How to set a specific column in the dataset as the index column, while reading data from a delimited text file for example?

pd.read_csc(‘file.tsv’, delimiter=’|’, index_col=’column_name’)

How well did you know this?

Not at all

Perfectly

How to view the first and last entries in the dataframe?

Study These Flashcards

df.head()
df.tail()

How to summarize the basics stats about the data in the dataframe?

Study These Flashcards

**df.describe()
By default, this function only describes numeric columns

df.describe(include=’all’)
To include summary all columns in the dataframe

df.describe(include=’object’)
To include only object columns

Note: A column with the data type “object” typically contains text (strings), but it can also store mixed data types, such as numbers, lists, dictionaries, or even Python objects.

How to summarize a specific column in the dataframe?

Study These Flashcards

df[‘column_name’].describe()

How to get general information about the dataframe?

Study These Flashcards

df.info()

How to check the number of missing values in each column in a dataframe?

Study These Flashcards

df.isnull().sum()
or
df[‘column_name’].isnull().sum()

How to find the name of the nth column in the dataframe?

Study These Flashcards

df.columns[n-1]

Example:
What is the name of the 105th column?
df.columns[104]

How to find the datatype of the nth column in the dataframe?

Study These Flashcards

df[‘column_name’].dtypes

How to find the value in a specifc location (row and column) in the dataframe?

Study These Flashcards

df.values[row_number][column_number]
or
.loc[row_label, column_label]
Label-based indexing - Labels (index, column names)
or
.iloc[row_index, col_index]
** Position-based indexing - Integer positions (0-based)**
Excludes the end index

How do you group by a column, sum another column, and keep the grouping column as a regular column (not index)?

Study These Flashcards

df.groupby(“item_name”, as_index=False).agg({“quantity”: sum})
✅ as_index=False → keeps item_name as a column
✅ .agg({“quantity”: sum}) → aggregates using sum

Bonus – Sort by highest total:

df.groupby(“item_name”, as_index=False).agg({“quantity”: sum}).sort_values(by=”quantity”, ascending

What's the difference between these two group by statements in pandas? - chipo.groupby("item_name")["quantity"].sum() - chipo.groupby("item_name").agg({"quantity": "sum"})

Returns a Series: chipo.groupby("item_name")["quantity"].sum() Result: Series Index: item_name Values: Sum of quantity You cannot use sort_values(by="quantity") because there’s no column name — just values. --------------------------------------------------------------------------------------------- Returns a DataFrame: chipo.groupby("item_name").agg({"quantity": "sum"}) Result: DataFrame Index: item_name Column: quantity You can sort with: .sort_values(by="quantity") Quick Tip: Use ["quantity"].sum() when you want a quick Series. Use .agg(...) when you want a DataFrame and more flexibility (like sorting by column name or adding multiple aggregations).

What is a lambda function and when do you use it?

A lambda function is a small, anonymous function in Python used for short, simple operations. 🧾 Syntax: lambda arguments: expression It behaves like a regular function, but it's: Defined in one line Not given a name Often used inside methods like apply(), map(), or filter() 🔍 Example: square = lambda x: x * x square(4) # Output: 16 Used in pandas: df["new_col"] = df["old_col"].apply(lambda x: x + 10) 💡 Use when: You need a quick, one-time-use function Especially handy inside pandas' apply() for column-wise or row-wise logic

How do you use a lambda function to convert strings like "$2.39" in a column to floats?

df["item_price"] = df["item_price"].apply(lambda x: float(x.replace('$', ''))) .apply(lambda x: ...) → applies the function to each value x.replace('$', '') → removes the $ symbol float(...) → converts the cleaned string to a float ⚠️ Why x[1:-1] is risky: lambda x: float(x[1:-1]) Removes both the first and last characters Might chop off a digit (e.g., "$2.39" → "2.3" instead of "2.39") 🔄 Safer Alternative: lambda x: float(x[1:]) Removes only the first character ($) Still assumes there's no unwanted character at the end 💡 Best practice: Use .replace() for clarity and flexibility.

What’s the difference between value_counts() and count() in pandas? What does chipo["item_name"].value_counts().count() do?

A: value_counts() ➤ Returns the count of unique values in a Series. ➤ Output: A Series with unique values as the index and their counts as values. ✅ Example: chipo["item_name"].value_counts() tells you how many times each menu item was ordered. count() ➤ Returns the number of non-null entries in a Series. ➤ If used after value_counts(), it tells you how many unique values exist. So, what does chipo["item_name"].value_counts().count() do? ✅ It returns the number of unique items ordered — i.e., how many different kinds of item_name exist in the data.

Dataframes Flashcards

(28 cards)