2. Data Frames Flashcards
Data Frames
mutli dimensional data
Excel is 2 dimensional, to get a data you need two points of reference
auto convert to float
if there is any NaN value, even integer will be converted to float
df basic shared attributes
df. index
df. values: is a numpy object
df. dtypes: all columns
df.shapre => tuple
df specific attributes
df. columns
df. axes
df. info() => count non-null
df shared methods
df. sum() => row total for each columns; axis default to be 0
rev. sum(axis = “columns”) or rev.sum(axis = 1) => sum by row across “columns”
extracting column in df
df.[“Col1”]
df.Col1 => return series
but doesn’t work with column name with space
so do not use
extracting multiple columns
nba[[“Name”,”Team”]] => multiple columns
nba[mylsit], basically put a list in Series or df
adding new column
- Assignment
nba[“new column”] = nba[“salary”] /2 - nba.insert(loc, Name ,values)
cannot repeat column name
axis
0 = row 1 = column
NaN detection and handle
dropna(), fillna()
isnull(), notnull()
df.dropna(): by default remove any row with any NaN
can pass in a subset of column(s)
df.dropna(how=”all”): remove any row with all NaN values
df[“Col1”].fillna(“abc”)
df[“Team”].isnull()
df = pd.read_csv(“chicago.csv”).dropna(how = “all”)
change dtype and category
nba[“Salary”] = nba[“Salary”].astype(“int”)
nba[“Position”] = nba[“Position”].astype(“category”)
this reduce the data usage
sort
nba. sort_values(“Name”)
nba. sort_index()
“object” in pandas
i.e. string
date time object
df[“Start Date”] = pd.to_datetime(df[“Start Date”])
df = pd.read_csv(“employees.csv”, parse_dates = [“Start Date”,”Last Login Time”])
filtering (masking)
df[df[“Gender”] == “Male”]
basically parse in a boolean series into a df
but hard to read
so better use
mask = df[“Start Date”] > “2000-3-31”
df[mask], where mask is a boolean series
And: df[mask1 & mask2]
Or: df[mask1 | mask2]
df[(mask1 & mask2) | mask3]