Cleaning dataframes Flashcards

Question 1

Q

How do you drop rows with missing values but leave the columns they’re in in tact?

Answer

A

df_cleaned = df.dropna()

(axis=0) specifies rows should be dropped, default

dropna(): Removes rows that contain at least one missing value (NaN).

Question 2

Q

How do you drop columns with missing values?

Answer

A

df_cleaned = df.dropna(axis=1)

Question 3

Q

How do you create a new dataframe without any missing values?

Answer

A

df.dropna(inplace=True)

If you want to modify the DataFrame in place without creating a new one

The operation is done ‘in place’- BE CAREFUL WHEN USING IN PLACE, CHANGES CANNOT BE UNDONE

Question 4

Q

What function lets you see what kind of data you have in your dataframe?

Answer

A

df.dtypes

Question 5

Q

Common datatypes in python and what they mean

Answer

A

int: Integer values (whole numbers without decimals).
* Example: 5, -2, 1000
* float: Floating-point numbers (decimals or real numbers).
* Example: 3.14, -2.7, 100.00
* str: Strings (text or sequences of characters).
* Example: “Hello”, “123”, ‘a’
* bool: Boolean values representing True or False.
* Example: True, False
category: Categorical data, often used for columns with a limited number of possible values (like factors).
* int64 and float64: Specialized integer and floating-point types that pandas uses for large datasets.

Question 6

Q

When might you need to convert different data-types?

Question 7

Q

What is a variable?

Answer

A

(Column) A variable is a specific characteristic or attribute being measured. For instance, in a plant study, variables might include height, leaf area, or the number of stomatal complexes. A variable contains all values that measure the same underlying attribute. Each column in a dataset typically represents a variable, and each variable can have many values across different observations.

Question 8

Q

What is an observation?

Answer

A

(Row) An observation represents a single entity or case in the dataset, often corresponding to a single row. In a plant dataset, an observation could be one individual plant, one measurement site, or one experimental condition (like one sample of rice). Each observation has a set of values for each variable measured.

Question 9

Q

What two ways are values organised?

Answer

A

Every value belongs to a variable and an observation. And each value must have it’s own cell.

Question 10

Q

What is melting?

Answer

A

Melting data is a data transformation technique that converts a wide-format dataset into a long-format dataset. This process is often used to make datasets more compatible with statistical and plotting tools that prefer long data formats, eg seaborn in Python.

Question 11

Q

How do you melt data in python?

Answer

A

df.melt()

df.melt(df, id_vars=[‘Identifyable column’], value_vars= [‘a’, ‘b’, ‘c’], var _name=’column’, value_name=’value’)

id_vars are what we are plotting our new columns against eg if our coloumns are ‘Species’ then ‘height’, ‘weight’ etc, and we were measuring differences between species we might want to use ‘Species’ as our identifyable variable (id_var) and our value_var’s are the names of those variables we are measuring our new onw against. So in this example, height and weight.

Question 12

Q

What is casting data?

Answer

A

Casting data is the opposite of melting: it transforms long-format data back into a wide-format dataset. This operation is especially useful when you want to rearrange data so that each variable has its own column again, making it easier to compare variables side by side or perform certain analyses.

Question 13

Q

How do you cast data in python?

Answer

A

df.pivot ()
df2 = pd.pivot(df, index=[‘id’, ‘date’], columns=’element’, values=’value’)

Question 14

Q

Briefly summarise variables observations and values

Question 15

Q

When dropping data using df.drop(1), does it default to rows or columns?

Answer

A

rows- 1 being the row number, if you want to drop columns you must specify this with df = df.drop(columns = [‘COLUMN_NAME’])

It is indx based, so the first row is in the index 0, the second is 1.

Question 16

Q

How would you orginise data in one column into ascending order (increasing as you go down)?

Answer

A

df.sort_values(by=’Collumn name’, ascending=True)

Question 17

Q

How would you orginise data in one column into descending order (decreasing as you go down)?

Answer

A

df.sort_values(by=’Column name’, ascending=False)

Question 18

Q

Whats the fastest way to find the mean/ std/ minimum value/ maximum value in a column?

Answer

A

df.describe

Question 19

Q

How would you quickly get a list of column names in a string?

Answer

A

df.columns

Question 20

Q

What does .groupby do?

Answer

A

Function: Used to split data into groups based on a column or set of columns, enabling grouped operations (e.g., calculating statistics within each group).

example: df.groupby(‘column_name’).mean() calculates the mean for each group defined by unique values in column_name

Works best with data in long_format

Question 21

Q

How do you split values in one column into two?

Answer

A

Series.str.split()

Series.str.split(pat=None, n=-1, expand=False)

Used when you need to separate values contained in a single string into multiple columns. pat is the deliminater to slpit the string. n is number of splits. Expand: if True, returns a dataset with seperate columns is false a series of lists.

Question 22

Q

How do you rearrange column order?

Answer

A

Gives the current list order of the columns

cols = df.columns.tolist()
cols

cols_new = [cols[1], cols[3], cols[2], cols[0]]
cols_new

df = df[cols_new]
df

Question 23

Q

How to sort multiple columns in ascending order

Answer

A

df_sorted = df.sort_values(by=[‘Name of column’, ‘Name of other column’], ascending=[True, True])
df_sorted

Question 24

Q

What is transposing data and how do you do it?

Answer

A

df.transpose()

Columns and rows switch. Works both ways.

Question 25

Q

Saving a pandas dataframe to an excel file

Answer

A

df.to_excel(‘output.xlsx’, index=False)

output=filename

Question 26

Q

Brainscape's Knowledge GenomeTM

Cleaning dataframes Flashcards

Brainscape's Knowledge Genome^TM