1. Assignment nba["new column"] = nba["salary"] /2 2. nba.insert(loc, Name ,values) cannot repeat column name

2. Data Frames Flashcards by Kyle T

Data Frames

mutli dimensional data

Excel is 2 dimensional, to get a data you need two points of reference

How well did you know this?

Not at all

Perfectly

auto convert to float

if there is any NaN value, even integer will be converted to float

How well did you know this?

Not at all

Perfectly

df basic shared attributes

df. index
df. values: is a numpy object
df. dtypes: all columns

df.shapre => tuple

How well did you know this?

Not at all

Perfectly

df specific attributes

df. columns
df. axes
df. info() => count non-null

How well did you know this?

Not at all

Perfectly

df shared methods

df. sum() => row total for each columns; axis default to be 0
rev. sum(axis = “columns”) or rev.sum(axis = 1) => sum by row across “columns”

How well did you know this?

Not at all

Perfectly

extracting column in df

df.[“Col1”]

df.Col1 => return series
but doesn’t work with column name with space
so do not use

How well did you know this?

Not at all

Perfectly

extracting multiple columns

nba[[“Name”,”Team”]] => multiple columns

nba[mylsit], basically put a list in Series or df

How well did you know this?

Not at all

Perfectly

adding new column

Assignment
nba[“new column”] = nba[“salary”] /2
nba.insert(loc, Name ,values)
cannot repeat column name

How well did you know this?

Not at all

Perfectly

axis

0 = row
1 = column

How well did you know this?

Not at all

Perfectly

NaN detection and handle

dropna(), fillna()
isnull(), notnull()

df.dropna(): by default remove any row with any NaN
can pass in a subset of column(s)
df.dropna(how=”all”): remove any row with all NaN values

df[“Col1”].fillna(“abc”)

df[“Team”].isnull()

df = pd.read_csv(“chicago.csv”).dropna(how = “all”)

How well did you know this?

Not at all

Perfectly

change dtype and category

nba[“Salary”] = nba[“Salary”].astype(“int”)
nba[“Position”] = nba[“Position”].astype(“category”)
this reduce the data usage

How well did you know this?

Not at all

Perfectly

sort

nba. sort_values(“Name”)

nba. sort_index()

How well did you know this?

Not at all

Perfectly

“object” in pandas

i.e. string

How well did you know this?

Not at all

Perfectly

date time object

df[“Start Date”] = pd.to_datetime(df[“Start Date”])

df = pd.read_csv(“employees.csv”, parse_dates = [“Start Date”,”Last Login Time”])

How well did you know this?

Not at all

Perfectly

filtering (masking)

df[df[“Gender”] == “Male”]
basically parse in a boolean series into a df
but hard to read

so better use
mask = df[“Start Date”] > “2000-3-31”
df[mask], where mask is a boolean series

And: df[mask1 & mask2]
Or: df[mask1 | mask2]
df[(mask1 & mask2) | mask3]

How well did you know this?

Not at all

Perfectly

filtering (method)

Study These Flashcards

.between
df[df[“Salary”].between(60000,70000)]

isin() method:
df[df[“Team”].isin([“Legal”,”Sales”,”Product”])]

isnull()
notnull()

extraction rows and columns

Study These Flashcards

df. loc[]: slicing is inclusive, use index label i.e. string or date time
df. iloc[]: slicing is non-inclusive, only index position “index” loc

Note use [] not () because it’s “retrieving”

df.loc[Index] -> return series / return df if multiple result

df. loc[“Moonraker”,”Director”]
df. loc[mask,”Year”:”Director”]
df. loc[[“Moonraker”,”A View to a Kill”],”Director”:”Budget”]
df. loc[:,”Director”:”Budget”]
df. iloc[0:5,0:5]

check duplicate

Study These Flashcards

df[[“First Name”]].duplicated(): first one mark as True, subsequent duplicate mark as False
df[[“First Name”]].duplicated(keep=False): all duplicate mark as False

df.drop_duplicates(subset = [“First Name”], keep = False)

negation of a Series

Study These Flashcards

~df[“First Name”].duplicated(keep=False)

unique

Study These Flashcards

nba[“Position”].nunique() => give you the number of unique, parameter to set to count NaN or not

.unique() give you an array

adjusting index

Study These Flashcards

set_index
reset_index

read_csv can specify index_col

editing data

Study These Flashcards

df[“col1”][“something”] = “abc” doesnot work as it is a new df but not connect to the df; use df.loc instead

can use mask as well
df.loc[mask,ColA:ColE] = “abc” or mylist with same shape

rename column

Study These Flashcards

df.rename(index = mydict) => change index
df.rename(columns = mydict) by default on index
or columns asignment
df.columns = [‘Yearabc’, ‘Actor’, ‘Director’, ‘Box Office’, ‘Budget’, ‘Bond Actor Salary’]

df. rename(mapper = mydict) by default on index
df. rename(mapper = {“Year”:”Release Date”,”Box Office”:”Revenue”},axis=1) =>

delete row or columns

Study These Flashcards

df.drop(“row index”)
df.drop(““,axis = 1)
or
df.pop (inplace and return)
or
del df[“col1”]

sample

random | df.random(5)

min max

df. nlargest(3,columns="Box Office") df. nsmallest can be applie to series as well df["Col1"].nlargest(3)

bulk replace column name by "_"

df.columns = [x.replace(" ","_") for x in df.columns] columns assignment + list comprehension df.columns = df.columns.str.replace(" ","_")

query

df.query('Actor == "Sean Connery"') note that parse in string, no need "" for Column name and use and'/or df. query('Actor == "Sean Connery" and Director =="Guy Hamilton"') df. query('Actor in mylist and Director =="Guy Hamilton"')

apply(myfunc)

columns = ["Box_Office", "Budget", "Bond_Actor_Salary"] for col in columns: df[col] = df[col].apply(mycon)

apply method for parsing a row

df.apply(good_move,axis =1) | "apply" moving along column

copy method

df2 = df1.copy() | split the ref

read in csv with no header

df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)

read csv with range of columns

pd.read_csv("GOOG_1min_sample.txt", header=None, usecols=[*range(0,3)]) the * It's like somewhat "unpacking" the iterator returned by the range function. For example, [range(0, 3), 3] == [[0, 1, 2], 3] but [*range(0, 3), 3] == [0, 1, 2, 3]

writing back value to df

df["Col"] = "abc" won't work as it is a copy of the df use df2.loc[:,"local_min"]="abc"

2. Data Frames Flashcards

(34 cards)