Dataframes Flashcards

1
Q

How to import data from delimited text files?

A

Comma seperated files:
* pd.read_csv(“filename.csv”)
Tab seperated files:
* pd.read_csv(“filename.tsv”, delimiter=”\t”)
General Text files:
* pd.read_csv(“filename.txt”, delimiter=””)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to view the top n entries in the dataframe?

A

df.head(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to retrieve the number of rows and columns in the dataframe?

A

Rows:
df.shape[0]
Columns:
df.shape[1]
Both Rows & Columns:
df.shape
or
df.info() – Shows number of records or data entries, data types and memory usage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to print the names of all the columns in the dataframe?

A

df.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to check the indexing of the dataframe?

A

df.index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does RangeIndex(start=0, stop=2622, step=1) mean?

A

It means that the index starts at 0 and stops before 2622. It does not include 2622.

This means that the index runs from 0 to 2622 which is 2622 rows in total.

Total Rows = stop - start = 2622 - 0 = 2,622 rows.

So the dataframe has 2622 rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to group by and sort data in a dataframe?

A

Basic Syntax:
df.groupby(‘column_name’, as_index=False).agg({‘column_name’:’sum’, ‘column_name1’:’sum’})

df.sorted_values(‘column_name’, ascending=False)

Sorting without assignment:
df_grouped.sort_values(‘column_to_sort’, ascending=False, inplace=True)

Grouping by multiple columns:
df.groupby([‘col1’, ‘col2’], as_index=False).agg({‘col_to_sum’:’sum’})
as_index=False → Keeps grouped column as a column instead of an index

OR

df.groupby([‘col1’, ‘col2’]).agg({‘col_to_sum’: ‘sum’}).reset_index()
Use **reset_index() **if you forgot to set as_index=False or need to convert an index to a column later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What Does dtype(‘O’) mean in Pandas?

A

O stands for Object data type.
It is used to represent strings or mixed types in a dataframe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to check the data type of a column in pandas?

A

df[‘col1’].dtype

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to convert string values in a column to numeric, handling errors?

A

df[‘col’] = pd.to_numeric(df[‘col’], errors=’coerce’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the equivalent of the DISTINCT COUNT function in pandas?

A

df[‘column_name’].nunique()
This returns the count of unique values in a column.

df.nunique()
If you want the distinct count for multiple columns

Example:
data = {‘Category’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘B’, ‘C’]}
df = pd.DataFrame(data)

distinct_count = df[‘Category’].nunique()
print(distinct_count) # Output: 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What’s the difference between .nunique() and .value_counts().count() in Pandas?

A

.nunique() → Directly counts the number of unique values in a column (more efficient).

.value_counts().count() → First creates a frequency table, then counts the unique entries (less efficient).

✅ Use .nunique() when you just need the count of unique values. It is more direct and efficient for just getting the number of unique values.
✅ Use .value_counts() if you also need the frequency of each unique value. It is less efficient but useful if you also want frequency counts of each unique value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to find the sum and average of a column?

A

Sum: df[‘column_name’].sum()
Average: df[‘column_name’].mean()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Find the average revenue per order

A

df[‘revenue’].sum() / df[‘order_id’].nunique()

Alternate approach:
If the table has multiple line items per order, grouping by Order_ID first ensures that you sum up the revenue for each order before calculating the average.

df.groupby([‘order_id’], as_index = False)[‘revenue’].sum().mean()

groupby() ensures we sum revenue per order first before averaging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to set a specific column in the dataset as the index column, while reading data from a delimited text file for example?

A

pd.read_csc(‘file.tsv’, delimiter=’|’, index_col=’column_name’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to view the first and last entries in the dataframe?

A

df.head()
df.tail()

17
Q

How to summarize the basics stats about the data in the dataframe?

A

**df.describe()
By default, this function only describes numeric columns

df.describe(include=’all’)
To include summary all columns in the dataframe

df.describe(include=’object’)
To include only object columns

Note: A column with the data type “object” typically contains text (strings), but it can also store mixed data types, such as numbers, lists, dictionaries, or even Python objects.

18
Q

How to summarize a specific column in the dataframe?

A

df[‘column_name’].describe()

19
Q

How to get general information about the dataframe?

20
Q

How to check the number of missing values in each column in a dataframe?

A

df.isnull().sum()
or
df[‘column_name’].isnull().sum()

21
Q

How to find the name of the nth column in the dataframe?

A

df.columns[n-1]

Example:
What is the name of the 105th column?
df.columns[104]

22
Q

How to find the datatype of the nth column in the dataframe?

A

df[‘column_name’].dtypes

23
Q

How to find the value in a specifc location (row and column) in the dataframe?

A

df.values[row_number][column_number]
or
.loc[row_label, column_label]
Label-based indexing - Labels (index, column names)
or
.iloc[row_index, col_index]
** Position-based indexing - Integer positions (0-based)**
Excludes the end index

24
Q

How do you group by a column, sum another column, and keep the grouping column as a regular column (not index)?

A

A:

df.groupby(“item_name”, as_index=False).agg({“quantity”: sum})
✅ as_index=False → keeps item_name as a column
✅ .agg({“quantity”: sum}) → aggregates using sum

Bonus – Sort by highest total:

df.groupby(“item_name”, as_index=False).agg({“quantity”: sum}).sort_values(by=”quantity”, ascending

25
Q

What’s the difference between these two group by statements in pandas?

  • chipo.groupby(“item_name”)[“quantity”].sum()
  • chipo.groupby(“item_name”).agg({“quantity”: “sum”})
A

Returns a Series:

chipo.groupby(“item_name”)[“quantity”].sum()
Result: Series

Index: item_name
Values: Sum of quantity

You cannot use sort_values(by=”quantity”) because there’s no column name — just values.

Returns a DataFrame:

chipo.groupby(“item_name”).agg({“quantity”: “sum”})
Result: DataFrame

Index: item_name
Column: quantity

You can sort with: .sort_values(by=”quantity”)

Quick Tip:
Use [“quantity”].sum() when you want a quick Series.

Use .agg(…) when you want a DataFrame and more flexibility (like sorting by column name or adding multiple aggregations).

26
Q

What is a lambda function and when do you use it?

A

A lambda function is a small, anonymous function in Python used for short, simple operations.

🧾 Syntax:

lambda arguments: expression
It behaves like a regular function, but it’s:
Defined in one line
Not given a name

Often used inside methods like apply(), map(), or filter()

🔍 Example:

square = lambda x: x * x
square(4) # Output: 16
Used in pandas:

df[“new_col”] = df[“old_col”].apply(lambda x: x + 10)
💡 Use when:
You need a quick, one-time-use function

Especially handy inside pandas’ apply() for column-wise or row-wise logic

27
Q

How do you use a lambda function to convert strings like “$2.39” in a column to floats?

A

df[“item_price”] = df[“item_price”].apply(lambda x: float(x.replace(‘$’, ‘’)))
.apply(lambda x: …) → applies the function to each value

x.replace(‘$’, ‘’) → removes the $ symbol

float(…) → converts the cleaned string to a float

⚠️ Why x[1:-1] is risky:

lambda x: float(x[1:-1])
Removes both the first and last characters

Might chop off a digit (e.g., “$2.39” → “2.3” instead of “2.39”)

🔄 Safer Alternative:

lambda x: float(x[1:])
Removes only the first character ($)

Still assumes there’s no unwanted character at the end

💡 Best practice: Use .replace() for clarity and flexibility.

28
Q

What’s the difference between value_counts() and count() in pandas? What does chipo[“item_name”].value_counts().count() do?

A

A:

value_counts()
➤ Returns the count of unique values in a Series.
➤ Output: A Series with unique values as the index and their counts as values.
✅ Example: chipo[“item_name”].value_counts() tells you how many times each menu item was ordered.

count()
➤ Returns the number of non-null entries in a Series.
➤ If used after value_counts(), it tells you how many unique values exist.

So, what does chipo[“item_name”].value_counts().count() do?
✅ It returns the number of unique items ordered — i.e., how many different kinds of item_name exist in the data.