Cleaning dataframes Flashcards

1
Q

How do you drop rows with missing values but leave the columns they’re in in tact?

A

df_cleaned = df.dropna()

(axis=0) specifies rows should be dropped, default

  • dropna(): Removes rows that contain at least one missing value (NaN).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you drop columns with missing values?

A

df_cleaned = df.dropna(axis=1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you create a new dataframe without any missing values?

A

df.dropna(inplace=True)

If you want to modify the DataFrame in place without creating a new one

The operation is done ‘in place’- BE CAREFUL WHEN USING IN PLACE, CHANGES CANNOT BE UNDONE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What function lets you see what kind of data you have in your dataframe?

A

df.dtypes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Common datatypes in python and what they mean

A

int: Integer values (whole numbers without decimals).
* Example: 5, -2, 1000
* float: Floating-point numbers (decimals or real numbers).
* Example: 3.14, -2.7, 100.00
* str: Strings (text or sequences of characters).
* Example: “Hello”, “123”, ‘a’
* bool: Boolean values representing True or False.
* Example: True, False
category: Categorical data, often used for columns with a limited number of possible values (like factors).
* int64 and float64: Specialized integer and floating-point types that pandas uses for large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When might you need to convert different data-types?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a variable?

A

(Column) A variable is a specific characteristic or attribute being measured. For instance, in a plant study, variables might include height, leaf area, or the number of stomatal complexes. A variable contains all values that measure the same underlying attribute. Each column in a dataset typically represents a variable, and each variable can have many values across different observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an observation?

A

(Row) An observation represents a single entity or case in the dataset, often corresponding to a single row. In a plant dataset, an observation could be one individual plant, one measurement site, or one experimental condition (like one sample of rice). Each observation has a set of values for each variable measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What two ways are values organised?

A

Every value belongs to a variable and an observation. And each value must have it’s own cell.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is melting?

A

Melting data is a data transformation technique that converts a wide-format dataset into a long-format dataset. This process is often used to make datasets more compatible with statistical and plotting tools that prefer long data formats, eg seaborn in Python.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you melt data in python?

A

df.melt()

df.melt(df, id_vars=[‘Identifyable column’], value_vars= [‘a’, ‘b’, ‘c’], var _name=’column’, value_name=’value’)

id_vars are what we are plotting our new columns against eg if our coloumns are ‘Species’ then ‘height’, ‘weight’ etc, and we were measuring differences between species we might want to use ‘Species’ as our identifyable variable (id_var) and our value_var’s are the names of those variables we are measuring our new onw against. So in this example, height and weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is casting data?

A

Casting data is the opposite of melting: it transforms long-format data back into a wide-format dataset. This operation is especially useful when you want to rearrange data so that each variable has its own column again, making it easier to compare variables side by side or perform certain analyses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you cast data in python?

A

df.pivot ()
df2 = pd.pivot(df, index=[‘id’, ‘date’], columns=’element’, values=’value’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Briefly summarise variables observations and values

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When dropping data using df.drop(1), does it default to rows or columns?

A

rows- 1 being the row number, if you want to drop columns you must specify this with df = df.drop(columns = [‘COLUMN_NAME’])

It is indx based, so the first row is in the index 0, the second is 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How would you orginise data in one column into ascending order (increasing as you go down)?

A

df.sort_values(by=’Collumn name’, ascending=True)

17
Q

How would you orginise data in one column into descending order (decreasing as you go down)?

A

df.sort_values(by=’Column name’, ascending=False)

18
Q

Whats the fastest way to find the mean/ std/ minimum value/ maximum value in a column?

A

df.describe

19
Q

How would you quickly get a list of column names in a string?

A

df.columns

20
Q

What does .groupby do?

A

Function: Used to split data into groups based on a column or set of columns, enabling grouped operations (e.g., calculating statistics within each group).

example: df.groupby(‘column_name’).mean() calculates the mean for each group defined by unique values in column_name

Works best with data in long_format

21
Q

How do you split values in one column into two?

A

Series.str.split()

Series.str.split(pat=None, n=-1, expand=False)

Used when you need to separate values contained in a single string into multiple columns. pat is the deliminater to slpit the string. n is number of splits. Expand: if True, returns a dataset with seperate columns is false a series of lists.

22
Q

How do you rearrange column order?

A

Gives the current list order of the columns

cols = df.columns.tolist()
cols

cols_new = [cols[1], cols[3], cols[2], cols[0]]
cols_new

df = df[cols_new]
df

23
Q

How to sort multiple columns in ascending order

A

df_sorted = df.sort_values(by=[‘Name of column’, ‘Name of other column’], ascending=[True, True])
df_sorted

24
Q

What is transposing data and how do you do it?

A

df.transpose()

Columns and rows switch. Works both ways.

25
Q

Saving a pandas dataframe to an excel file

A

df.to_excel(‘output.xlsx’, index=False)

output=filename

26
Q
A