2. Data Frames Flashcards
Data Frames
mutli dimensional data
Excel is 2 dimensional, to get a data you need two points of reference
auto convert to float
if there is any NaN value, even integer will be converted to float
df basic shared attributes
df. index
df. values: is a numpy object
df. dtypes: all columns
df.shapre => tuple
df specific attributes
df. columns
df. axes
df. info() => count non-null
df shared methods
df. sum() => row total for each columns; axis default to be 0
rev. sum(axis = “columns”) or rev.sum(axis = 1) => sum by row across “columns”
extracting column in df
df.[“Col1”]
df.Col1 => return series
but doesn’t work with column name with space
so do not use
extracting multiple columns
nba[[“Name”,”Team”]] => multiple columns
nba[mylsit], basically put a list in Series or df
adding new column
- Assignment
nba[“new column”] = nba[“salary”] /2 - nba.insert(loc, Name ,values)
cannot repeat column name
axis
0 = row 1 = column
NaN detection and handle
dropna(), fillna()
isnull(), notnull()
df.dropna(): by default remove any row with any NaN
can pass in a subset of column(s)
df.dropna(how=”all”): remove any row with all NaN values
df[“Col1”].fillna(“abc”)
df[“Team”].isnull()
df = pd.read_csv(“chicago.csv”).dropna(how = “all”)
change dtype and category
nba[“Salary”] = nba[“Salary”].astype(“int”)
nba[“Position”] = nba[“Position”].astype(“category”)
this reduce the data usage
sort
nba. sort_values(“Name”)
nba. sort_index()
“object” in pandas
i.e. string
date time object
df[“Start Date”] = pd.to_datetime(df[“Start Date”])
df = pd.read_csv(“employees.csv”, parse_dates = [“Start Date”,”Last Login Time”])
filtering (masking)
df[df[“Gender”] == “Male”]
basically parse in a boolean series into a df
but hard to read
so better use
mask = df[“Start Date”] > “2000-3-31”
df[mask], where mask is a boolean series
And: df[mask1 & mask2]
Or: df[mask1 | mask2]
df[(mask1 & mask2) | mask3]
filtering (method)
.between
df[df[“Salary”].between(60000,70000)]
isin() method:
df[df[“Team”].isin([“Legal”,”Sales”,”Product”])]
isnull()
notnull()
extraction rows and columns
df. loc[]: slicing is inclusive, use index label i.e. string or date time
df. iloc[]: slicing is non-inclusive, only index position “index” loc
Note use [] not () because it’s “retrieving”
df.loc[Index] -> return series / return df if multiple result
df. loc[“Moonraker”,”Director”]
df. loc[mask,”Year”:”Director”]
df. loc[[“Moonraker”,”A View to a Kill”],”Director”:”Budget”]
df. loc[:,”Director”:”Budget”]
df. iloc[0:5,0:5]
check duplicate
df[[“First Name”]].duplicated(): first one mark as True, subsequent duplicate mark as False
df[[“First Name”]].duplicated(keep=False): all duplicate mark as False
df.drop_duplicates(subset = [“First Name”], keep = False)
negation of a Series
~df[“First Name”].duplicated(keep=False)
unique
nba[“Position”].nunique() => give you the number of unique, parameter to set to count NaN or not
.unique() give you an array
adjusting index
set_index
reset_index
read_csv can specify index_col
editing data
df[“col1”][“something”] = “abc” doesnot work as it is a new df but not connect to the df; use df.loc instead
can use mask as well
df.loc[mask,ColA:ColE] = “abc” or mylist with same shape
rename column
df.rename(index = mydict) => change index
df.rename(columns = mydict) by default on index
or columns asignment
df.columns = [‘Yearabc’, ‘Actor’, ‘Director’, ‘Box Office’, ‘Budget’, ‘Bond Actor Salary’]
df. rename(mapper = mydict) by default on index
df. rename(mapper = {“Year”:”Release Date”,”Box Office”:”Revenue”},axis=1) =>
delete row or columns
df.drop(“row index”)
df.drop(““,axis = 1)
or
df.pop (inplace and return)
or
del df[“col1”]
sample
random
df.random(5)
min max
df. nlargest(3,columns=”Box Office”)
df. nsmallest
can be applie to series as well
df[“Col1”].nlargest(3)
bulk replace column name by “_”
df.columns = [x.replace(“ “,”_”) for x in df.columns]
columns assignment + list comprehension
df.columns = df.columns.str.replace(“ “,”_”)
query
df.query(‘Actor == “Sean Connery”’)
note that parse in string, no need “” for Column name and use and’/or
df. query(‘Actor == “Sean Connery” and Director ==”Guy Hamilton”’)
df. query(‘Actor in mylist and Director ==”Guy Hamilton”’)
apply(myfunc)
columns = [“Box_Office”, “Budget”, “Bond_Actor_Salary”]
for col in columns:
df[col] = df[col].apply(mycon)
apply method for parsing a row
df.apply(good_move,axis =1)
“apply” moving along column
copy method
df2 = df1.copy()
split the ref
read in csv with no header
df = pd.read_csv(file_path, usecols=[3,6], names=[‘colA’, ‘colB’], header=None)
read csv with range of columns
pd.read_csv(“GOOG_1min_sample.txt”, header=None, usecols=[*range(0,3)])
the * It’s like somewhat “unpacking” the iterator returned by the range function. For example, [range(0, 3), 3] == [[0, 1, 2], 3] but [*range(0, 3), 3] == [0, 1, 2, 3]
writing back value to df
df[“Col”] = “abc” won’t work as it is a copy of the df
use
df2.loc[:,”local_min”]=”abc”