Cleaning dataframes Flashcards
How do you drop rows with missing values but leave the columns they’re in in tact?
df_cleaned = df.dropna()
(axis=0) specifies rows should be dropped, default
- dropna(): Removes rows that contain at least one missing value (NaN).
How do you drop columns with missing values?
df_cleaned = df.dropna(axis=1)
How do you create a new dataframe without any missing values?
df.dropna(inplace=True)
If you want to modify the DataFrame in place without creating a new one
The operation is done ‘in place’- BE CAREFUL WHEN USING IN PLACE, CHANGES CANNOT BE UNDONE
What function lets you see what kind of data you have in your dataframe?
df.dtypes
Common datatypes in python and what they mean
int: Integer values (whole numbers without decimals).
* Example: 5, -2, 1000
* float: Floating-point numbers (decimals or real numbers).
* Example: 3.14, -2.7, 100.00
* str: Strings (text or sequences of characters).
* Example: “Hello”, “123”, ‘a’
* bool: Boolean values representing True or False.
* Example: True, False
category: Categorical data, often used for columns with a limited number of possible values (like factors).
* int64 and float64: Specialized integer and floating-point types that pandas uses for large datasets.
When might you need to convert different data-types?
What is a variable?
(Column) A variable is a specific characteristic or attribute being measured. For instance, in a plant study, variables might include height, leaf area, or the number of stomatal complexes. A variable contains all values that measure the same underlying attribute. Each column in a dataset typically represents a variable, and each variable can have many values across different observations.
What is an observation?
(Row) An observation represents a single entity or case in the dataset, often corresponding to a single row. In a plant dataset, an observation could be one individual plant, one measurement site, or one experimental condition (like one sample of rice). Each observation has a set of values for each variable measured.
What two ways are values organised?
Every value belongs to a variable and an observation. And each value must have it’s own cell.
What is melting?
Melting data is a data transformation technique that converts a wide-format dataset into a long-format dataset. This process is often used to make datasets more compatible with statistical and plotting tools that prefer long data formats, eg seaborn in Python.
How do you melt data in python?
df.melt()
df.melt(df, id_vars=[‘Identifyable column’], value_vars= [‘a’, ‘b’, ‘c’], var _name=’column’, value_name=’value’)
id_vars are what we are plotting our new columns against eg if our coloumns are ‘Species’ then ‘height’, ‘weight’ etc, and we were measuring differences between species we might want to use ‘Species’ as our identifyable variable (id_var) and our value_var’s are the names of those variables we are measuring our new onw against. So in this example, height and weight.
What is casting data?
Casting data is the opposite of melting: it transforms long-format data back into a wide-format dataset. This operation is especially useful when you want to rearrange data so that each variable has its own column again, making it easier to compare variables side by side or perform certain analyses.
How do you cast data in python?
df.pivot ()
df2 = pd.pivot(df, index=[‘id’, ‘date’], columns=’element’, values=’value’)
Briefly summarise variables observations and values
When dropping data using df.drop(1), does it default to rows or columns?
rows- 1 being the row number, if you want to drop columns you must specify this with df = df.drop(columns = [‘COLUMN_NAME’])
It is indx based, so the first row is in the index 0, the second is 1.