2. Data Frames Flashcards

1
Q

Data Frames

A

mutli dimensional data

Excel is 2 dimensional, to get a data you need two points of reference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

auto convert to float

A

if there is any NaN value, even integer will be converted to float

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

df basic shared attributes

A

df. index
df. values: is a numpy object
df. dtypes: all columns

df.shapre => tuple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

df specific attributes

A

df. columns
df. axes
df. info() => count non-null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

df shared methods

A

df. sum() => row total for each columns; axis default to be 0
rev. sum(axis = “columns”) or rev.sum(axis = 1) => sum by row across “columns”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

extracting column in df

A

df.[“Col1”]

df.Col1 => return series
but doesn’t work with column name with space
so do not use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

extracting multiple columns

A

nba[[“Name”,”Team”]] => multiple columns

nba[mylsit], basically put a list in Series or df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

adding new column

A
  1. Assignment
    nba[“new column”] = nba[“salary”] /2
  2. nba.insert(loc, Name ,values)
    cannot repeat column name
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

axis

A
0 = row
1 = column
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

NaN detection and handle

A

dropna(), fillna()
isnull(), notnull()

df.dropna(): by default remove any row with any NaN
can pass in a subset of column(s)
df.dropna(how=”all”): remove any row with all NaN values

df[“Col1”].fillna(“abc”)

df[“Team”].isnull()

df = pd.read_csv(“chicago.csv”).dropna(how = “all”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

change dtype and category

A

nba[“Salary”] = nba[“Salary”].astype(“int”)
nba[“Position”] = nba[“Position”].astype(“category”)
this reduce the data usage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

sort

A

nba. sort_values(“Name”)

nba. sort_index()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

“object” in pandas

A

i.e. string

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

date time object

A

df[“Start Date”] = pd.to_datetime(df[“Start Date”])

df = pd.read_csv(“employees.csv”, parse_dates = [“Start Date”,”Last Login Time”])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

filtering (masking)

A

df[df[“Gender”] == “Male”]
basically parse in a boolean series into a df
but hard to read

so better use
mask = df[“Start Date”] > “2000-3-31”
df[mask], where mask is a boolean series

And: df[mask1 & mask2]
Or: df[mask1 | mask2]
df[(mask1 & mask2) | mask3]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

filtering (method)

A

.between
df[df[“Salary”].between(60000,70000)]

isin() method:
df[df[“Team”].isin([“Legal”,”Sales”,”Product”])]

isnull()
notnull()

17
Q

extraction rows and columns

A

df. loc[]: slicing is inclusive, use index label i.e. string or date time
df. iloc[]: slicing is non-inclusive, only index position “index” loc

Note use [] not () because it’s “retrieving”

df.loc[Index] -> return series / return df if multiple result

df. loc[“Moonraker”,”Director”]
df. loc[mask,”Year”:”Director”]
df. loc[[“Moonraker”,”A View to a Kill”],”Director”:”Budget”]
df. loc[:,”Director”:”Budget”]
df. iloc[0:5,0:5]

18
Q

check duplicate

A

df[[“First Name”]].duplicated(): first one mark as True, subsequent duplicate mark as False
df[[“First Name”]].duplicated(keep=False): all duplicate mark as False

df.drop_duplicates(subset = [“First Name”], keep = False)

19
Q

negation of a Series

A

~df[“First Name”].duplicated(keep=False)

20
Q

unique

A

nba[“Position”].nunique() => give you the number of unique, parameter to set to count NaN or not

.unique() give you an array

21
Q

adjusting index

A

set_index
reset_index

read_csv can specify index_col

22
Q

editing data

A

df[“col1”][“something”] = “abc” doesnot work as it is a new df but not connect to the df; use df.loc instead

can use mask as well
df.loc[mask,ColA:ColE] = “abc” or mylist with same shape

23
Q

rename column

A

df.rename(index = mydict) => change index
df.rename(columns = mydict) by default on index
or columns asignment
df.columns = [‘Yearabc’, ‘Actor’, ‘Director’, ‘Box Office’, ‘Budget’, ‘Bond Actor Salary’]

df. rename(mapper = mydict) by default on index
df. rename(mapper = {“Year”:”Release Date”,”Box Office”:”Revenue”},axis=1) =>

24
Q

delete row or columns

A

df.drop(“row index”)
df.drop(““,axis = 1)
or
df.pop (inplace and return)
or
del df[“col1”]

25
Q

sample

A

random

df.random(5)

26
Q

min max

A

df. nlargest(3,columns=”Box Office”)
df. nsmallest

can be applie to series as well
df[“Col1”].nlargest(3)

27
Q

bulk replace column name by “_”

A

df.columns = [x.replace(“ “,”_”) for x in df.columns]
columns assignment + list comprehension

df.columns = df.columns.str.replace(“ “,”_”)

28
Q

query

A

df.query(‘Actor == “Sean Connery”’)

note that parse in string, no need “” for Column name and use and’/or

df. query(‘Actor == “Sean Connery” and Director ==”Guy Hamilton”’)
df. query(‘Actor in mylist and Director ==”Guy Hamilton”’)

29
Q

apply(myfunc)

A

columns = [“Box_Office”, “Budget”, “Bond_Actor_Salary”]

for col in columns:
df[col] = df[col].apply(mycon)

30
Q

apply method for parsing a row

A

df.apply(good_move,axis =1)

“apply” moving along column

31
Q

copy method

A

df2 = df1.copy()

split the ref

32
Q

read in csv with no header

A

df = pd.read_csv(file_path, usecols=[3,6], names=[‘colA’, ‘colB’], header=None)

33
Q

read csv with range of columns

A

pd.read_csv(“GOOG_1min_sample.txt”, header=None, usecols=[*range(0,3)])

the * It’s like somewhat “unpacking” the iterator returned by the range function. For example, [range(0, 3), 3] == [[0, 1, 2], 3] but [*range(0, 3), 3] == [0, 1, 2, 3]

34
Q

writing back value to df

A

df[“Col”] = “abc” won’t work as it is a copy of the df
use
df2.loc[:,”local_min”]=”abc”