Combining datasets Flashcards
What are the three main ways to combine data?
concat()
merge()
join()
What is concat and what do all the components mean within it?
concat()
is used to append one (or more) dataframes one below the other (or next
to each other, depending on whether the axis option is set to 0 or 1).
The function takes the form pd.concat([dataframes], axis, join, keys...)
.
-
[dataframes]
is the list of dataframes you want to concatenate. -
axis
specifies the axis to concatenate along. -
join
is the type of join (inner or outer). The default forpd.concat()
is outer. -
keys
allows you to add labels to the resulting dataframe so you can determine where the data came from.
What kind of indexing does python use?
Python uses zero based indexing.
*In Python’s pandas library, an index is a label that identifies each row in a DataFrame.
Types of joins
What does df.merge do?
Joins columns or dataframes with an inner join (as in keeping only the overlapping data) by default.
What does df.join() do?
Joins on indexes by default and gives an outer join (showing all the dfs) df1.join(df2)
Whats the difference between df.loc and df.iloc?
df.iloc[start row:end row, start column :end column]
* .loc: Uses label-based indexing, meaning you specify rows and columns using their labels (names). * .iloc: Uses integer-based indexing, meaning you specify rows and columns by their numerical positions.
What does it mean to filter with conditional masks in dataframes mean?
You are applying a condition to the rows and removing anything without them: eg if you wanted to remove any ages under 18 in column age:
df_old = df[df[‘Age’] > 18]
What do keys do
It is often useful to add a label to our data, so that we know which dataset it originated from. df = pd.concat([infected, control], keys = [“infected”, “control”], axis = 0)
df
How to find specific values in a dataset
.isin
Another useful way to filter data-frames is to extract rows that contain values within a specified list. To do this, we use theisincommand. For example, we could select the rows from the count dataframe that contain the soils ‘Clay’ or ‘Loam’ usingcount[count[‘Soil’].isin([‘Clay’, ‘Loam’])].
How do you concatenate vertically?
axis = 0
aka default
How do you concatenate data horizontally?
axis = 1