DataFrames Flashcards
Array
- one-dimensional
- unordered collection
- contains only one data type
Each item has an index and a value.
Three main components of tables
rows, columns, index
Rows of a table
“entry” or “observation”
Columns of a table
Each column of a table represents some attribute that entries (rows) have.
Index of a table
- The first column
- Meaningful or arbitrary
- Unique values
- Identify rows
Series
- The most basic pandas object
- Has two sections: the index and the values
- Under the hood, columns of a Series are actually NumPy arrays
DataFrames
- Pandas table object
- Contains an Index, Rows, and Columns
- Each column is a Series
How do you read a DataFrame?
pd.read_csv(filepath)
df.loc[]
- Accesses rows/columns by label
- Loc slicing is right-inclusive
- Syntax: df.loc[A:B, C:D]
filtering using df.loc[]
Ex: movies.loc[movies[“Year”] < 1950]
- You can also filter by more than one condition using
condition1 = movies[“Year”] >= 2000
condition2 = movies[“Studio”] == “Fox”
filtered_or = movies.loc[condition1 | condition2]
filtered_and = movies.loc[condition1 & condition2]
How do you assign columns to a DataFrame?
Using indexing/loc:
- df[“column”] data
Using df.assign():
- new_df = df.assign(label=data)
How do you sort DataFrames?
df.sort_values()
Ex: movies.sort_values(“Studio”, ascending=True)
Ex: movies.sort_values(“Year”, ascending=False)
df.groupby()
Creates new df grouped by certain column(s)
Ex: df.groupby([col1, col2, …])
Ways of grouping by 2 columns
- using df.groupby()
Ex: movies.groupby([“Year”, “Studio”])[“Title”].count().to_frame() - using df.pivot_table()
Ex: pt = movies.pivot_table(values=”Title”, index=”Year”, columns=”Studio”, aggfunc=”count”)
Merging DataFrames
pd.merge()
Inner Join:
- This will only include rows with a match in both DataFrames.
Ex: pd.merge(adf, bdf, how=”inner”, on=”x1”)
Outer Join:
- This will retain all rows in both DataFrames.
Ex: pd.merge(adf, bdf, how=”outer”, on=”x1”)
Left Join:
Use all rows form the First DataFrame
Right Join:
Use all rows from the second DataFrame