Python Pandas - Udemy Flashcards
What are Variables?
Placeholders
Are the terms list and array the same in python?
True! Yes!
What does len(df) return?
The number of elements in the list or array
What is a Dictionary?
A data type that stores keys and corresponding values.
A dictionary is represented by { }
What is a Series?
A series is a one dimensional labeled array
How do you convert a list to a series object?
pd.Series(list)
What is the difference between a list and series?
The index of a list can be only numeriv values and the index of a series can be abything you like it to be.
what does series.values give us?
All the values in the series as an array
What does series. index give us?
The index of the series
What do the
series. sum()
series. product()
series. mean()
return?
the
sum
product
mean
of the series
What does pd.read_csv(usecols=’abc’, squeeze=True) do?
It selects a single column ‘abc’ from a dataframe and converts it into a series.
What does x=df.head() or df.tail() do?
head() or tail() methods actually create a new series from the original dataframe so the variable ‘X’ will contain the new series
what does dir(s) do? ( where ‘s’ is a series)
gives you a list of attributes and methods available with that series.
what does sort ( series ) do?
sort all the values in the series in ascending order
what does
list(series)
dict(series)
do?
list(series) turns the series into a list
dict(series) turns the series into a dictionary
what does
series.is_unique
do?
returns True or False to show if all values in the series are unique
What does
series.sort_values do?
sorts the series in ascending order and returns a brand new series. You can also run it’s own methods on the newly returned series
eg. series.sort_values().head() will return the top 5 values of the newly created series.
What does the inplace=True parameter do?
makes changes to the series in place
What does the statement:
‘abc’ in series do?
returns a boolean value by checking for ‘abc’ in the index of the series. If you want to check for ‘abc’ in the values of the series you must use:
‘abc’ in series.values
what does
series[-30 : - 10]
return?
returns all the values from the -30 to the -10 position.
What is the difference between
len(series) and series.count() ?
len(series) returns the length of the series including the rows having nan values.
series.count() only returns a count of the rows that have values and excludes rows that have NANs
Good to rememember:
What are some of the mathematical functions available with series?
series. sum()
series. mean()
series. std()
series. median()
series. describe()
What does
series. idxmax()
series. idxmin()
retuen?
returns the index of the position that holds the min and max values in the series.
Nice way of using this is:
series[series.idxmax()]
will return the same value as
series.max()
What does the
series.values_counts()
do?
returns the number of times all the unique values occur.
series.value_counts().sum()
will retutn the lenght of the string same as len(series).
Good to remember the value_counts() has the ascending=True/False parameter
What does series.apply()
do?
series.apply() accepts a function as a parameter and then applies that function to all the values in the series.
eg:
series.apply( lambda stockprice : stockprice + 1)
What does the
series.map()
do?
performs a v lookup type function on 2 seperate series.
I need to explore this further
True or False:
The index labels in a panda Series must be unique
False
What are pandas DataFrames?
DataFrames are 2 dimensional array. What does 2 dimensions mean : it means you need 2 pieces of info. to access a particular value i.e row and column #
A csv file contains integer values but when you read it into a dataframe it shows up as a float….When?
If some of the values in the columns are NANs pandas DataFrames converts the entire column into Floats…reason not yet known
What does
df.info()
return?
Basic info about the dataframe as well as the number of non null values in each column.
What does
df.axes()
return?
returns the combined result of
df.index() and df.columns
df.sum(axis=1)
or
df.sum(axis=”columns”)
return the horizontal left to right total of a dataframe
what does
df.sum(axis=1)
return?
How to extract a single column ‘abc’ from a Dataframe df?
df [“abc”]
this command returns a series
How do you extract multiple columns from a DataFrame?
df [[“abc”,”def”] ]
or
select = [“abc”,”def”]
df [select]
both the above return the same resulting DataFrame
How do you insert a new column ‘Sport’ in a Dataframe?
df [“Sport”] = “ Basket Ball”
inserts the column Sport at the end of the DataFrame and populates all rows with the value ‘Basket Ball’
df.insert( 5, column = “Sport”, value= “Basket Ball”)
this inserts the ‘Sport’ column in the 5’th position with the value ‘Basket Ball’ in all rows
How do you add 20 to every value in the column ‘Salary’ of a dataframe?
df [“Salary”].add(20)
or
df [“Salary”] + 20
These are called Broadcast methods and can be used with all the other mathematical functions as well.
How do you use value_counts() with a DataFrame?
df [“abc”].value_counts().head().
Imp: value_counts() can only be used on series objects
How do you remove rows with null values in a DataFrame?
df.dropna()
by default this method will remove all rows even with a single nan value in the columns
How do you remove a row from a DataFrame only when a particular column has a Nan value?
df.dropna( subset= [“column name”, “column name 2”] )
How do you replace a nan value with a particular value in a column of a DataFrame?
df [” abc”].fillna( “Hello”, inplace=True)
This will fill all NaN values in the ‘abc’ column with the string “Hello”
How do your convert the ‘Salary’ column from float to intiger in a DataFrame?
df [“Salary”] .astype (“int”)
- Must remember that all NaNs must be removed or replaced for this method to work
- There is no inplace parameter so you must assign value to a variable for the change to be permanent
How do you sort a dataframe?
An entire DataFrame can be sorted only by a particular column.
df.sort_values(“Salary”)
If the column has NaN values they will be at the end of the dataframe or will occupy the last position.
How do you do a sort on multiple columns in a dataframe?
df.sort_values( [” col 1” , “ col 2”], ascending = [True , False] )
This sorts the dataframe first based on col 1 and then col 2. Col 1 in ascending order and col 2 in descending order
How do you convert a string to a Date type?
df [“String_Date”] = pd.to_datetime (df [“String_Date”])
How do you convert a string type to a category type?
df[” Management”] = df [” Management”].astype(“Category”)
How do you filter a datframe so that only the columns where gender = ‘Male’ is returned as a dataframe?
df [df [“Gender”] == “ Male”]
or
filter = df [” Gender”] = “ Male”
df [filter]
How to filter a dataframe using more than one condition eg
Gender = Male
Team = Marketing
?
filter 1 = df [” Gender”] == ‘ Male’
filter 2 = df [” Team”] == ‘ Marketing’
df [filter 1 & filter 2]
Write code to filter a dataframe where Team = ‘ Legal’, ‘Marketing’ or ‘Sales’
filter = df [” Team “].isin( [” Legal” , “ Sales”, “ Marketing “] )
df [filter]
You can also pass a series into the isin() method
eg. df [” Team”].isin ( df2 [” Team”] )
What do the isnull() and notnull() methods do?
isnull() returns True if a given column is a NaN else False.
notnull() returns True if a given column is not a Nan else False.
Write code to filter Salary >= 60,000 and <= 70,000?
df [” Salary”].between( 60000, 70000)
or
x= df [” Salary”] > = 60000
y= df [” Salary”] < = 70000
df [x & y]
What does the ~ symbol do?
It returns the reverse of a Boolean value.
i.e. True becomes False
False becomes True
What does df [” Name”].duplicated( ) return?
Returns the boolean value True for all duplicate values of the Name column except for the first occurance which returns False.
If the are 4 Toms it will returns 1 False and 3 Trues
Remove duplicate valued from dataframe where Name and Team are duplicates
df.drop_duplicates (subset= [” Name”, “ Team”],keep=False,inplace=True)
What do the unique( ) and nunique ( ) do?
unique () returns an array of unique values that will also count NaN as unique.
nunique( ) will return an inter of the count of unique values. This will not count the NaN as the parameter dropna=True is set by default
How do you set a particular column as the index of a dataframe?
df.set_index( “Col_name”)
to reverse change
df.reset_index()
What does the df.loc[] method do?
extract rows using index labels
Extract rows from a dataframe between index 18 and 35?
df.iloc [18 : 36]
note index 36 will not be returned in iloc[]
What is the df.ix[] method ?
It is a combination of the iloc[] and the loc[] methods. It accepts both string labels as well as integer indexes as arguments.
Note :
When using labels in ix[] and you specify a range or a list and one of the labels does not exist in the dataframe python returns a NaN value for the missing label.
BUT
When using index values in ix[] and you specify a range or a list and one of the indeces does not exist in the dataframe python returns an error value for the entire query.
How do you write a value to a given row and column in a dataframe using the ix[]?
df.ix[“James”, “Salary”] = 80000
This changes the James row and Salary column to 80000
filter = df [” Team”] == “Marketing”
df.ix [filter, “Team”] = “ Online Marketing”
What does this piece of code do?
Finds all instances where Team = Marketing and then replaces ‘ Marketing’ with ‘ Online Marketing’ in the dataframe
How do you change the name of columns in a dataframe?
df.rename ( { “ Team”:”Dept”, “ Salary”: “ Compensation”}, inplace=True)
The rename ( ) accepts a dictionary as a parameter.
What are the 3 methods to delete columns from a dataframe?
df.drop( “ Team”,inplace=True)
or
df.pop( “Team”)
This method removes “Team” from the dataframe and returns the column team as a series.
or
del df.Team
How do you extract 5 random rows from your dataset? Also how do you extract 25% of your data set randomly?
df.sample( n=5)
and
df.sample(frac =.25)
How to find the 5 highest values in the ‘Revenue’ column without using sort method?
df.nlargest(5,”Revenue”)
or
df [“Revenue”].nlargest(5)
The same syntax cane be used for the nsamllest() as well
How do you use the string methods on a column of a dataframe?
all methods must be prefixed with the .str. name
eg
df [” Name”].str.len()
df [” Name”].str.upper()
df [” Name”].str.lower()
df [” Name”].str.title()
Write code to replace ‘Mkt’ with ‘Marketing’ in the ‘Team’ Column?
df [“Team”] = df [“Team”].str.replace( “ Mkt”, “ Marketing”)
What do the following methods do?
- str.contains()
- str.startswith()
- str.endswith()
df [“Name”].str.lower().str.contains(“john”)
returns all rows where Name contains ‘john’ irrespective of the position
df [“Name”].str.lower().str.startswith(“john”)
returns all rows where Name begins with ‘john’
df [“Name”].str.lower().str.endsswith(“john”)
returns all rows where Name endss with ‘john’
What does
.str.strip()
.str.lstrip()
.str.rstrip()
do?
Removes spaces from left and right,left and right of a string
Give an example each of using the string methods on the index and columns of a dataframe?
String methods are called in the same way on the index and columns as well.
eg.
df.index.str.upper()
and
df.columns.str.upper()
How do you extract the last name from the ‘ Name’ column that has both last name and first name and is seperated by a space?
df [“Name”].str.split(“, “).str.get(0).value_counts().head()
Write code to extract the first name from the ‘Name’ column of a dataframe?
df[“Name”].str.split(“,”).str.get(1).str.strip().str.split(“ “).str.get(0).value_counts().head(10)
Good to remember about the str.split( )
the str.split( expand = True,n=2 )
has a parameter expand when set to True returns a dataframe
n determines the number of splits
How do you convert a series into a list and a dataframe?
x=df[“NM”].tolist()
y=df[“NM”].to_frame()
How do you export a dataframe to a csv file?
df.to_csv(“Tial and Error”,index=False,Columns=[“BRTH_YR”,”NM”])
index=False does not copy the index
Columns=[] allows you to copy only certain columns if you so desire
How do you read an excel file with multiple worksheets?
df= pd.read_excel(‘C:/Users/SHAWN/Desktop/Python Pandas/Data - Multiple Worksheets.xlsx’,sheetname=None).
The resulting output will be a dictionary.
How do you set multiple indexes to a dataframe?
df.set_index( [“Date”,”Country”],inplace=True)
OR
You can do it directly while importing the csv file like this
df= pd.read_csv(‘C:/Users/SHAWN/Desktop/bigmac.csv’,index_col= [“Date”, “Country”] )
How do you access the values in a multi index dataframe?
df.index.get_level_values(0)
or
df.index.get_level_values(“Date”)
How do you change the name of an index in an multi level dataframe?
df.index.set_index( [“Day”, “Location”] )
Tip:
Assume you want the first index to stay the same but change the second level,then just pass the same index name in the arguments.
How to extract a row from a multi index dataframe?
df.loc [( “ 2016-10-10”, “ China)]
for a multi index the .loc [] accepts a tupule as an argument
How do you interchange the rows and columns in a dataframe?
df.transpose()
How do you swap the index levels in a multi index dataframe?
df.swaplevel()
What do the stack() and unstack() methods?
stack() takes the columns and stacks the columns as rows.
unstack()
unstacks the rows and makes them columns
How do you use the groupby () on a dataframe and group by department?
Group=df.groupby( “Dept.”)
The groupby () creates a separate groupby object. Groupby by itself is meaning less until you call methods on it.
How do you find out the number of dataframes in a group called G1?
len(G1)
How do you find the number of rows within each group in G1?
g1.size()
What does the following command return G1.groups where G1 is a group dataframe?
It returns the index value of all of the rows that fall within each group
How do you extract all the rows from the ‘Marketing’ department?
G1= df.groupby(“Dept”)
G1.get_group(“Marketing”)
What do the following methods do where ‘sectors’ is the group object?
sectors[“Revenue”].sum()
sectors[“Profits”].max()
sectors[“Profits”].min()
sectors[“Employees”].mean()
sectors[[“Revenue”, “Profits”]].sum()
- Returns sum of ‘Revenue’ column for N groups present in sectors
- Returns Max of ‘Profits’ column for N groups present in sectors
- Returns Min of ‘Profits’ column for N groups present in sectors
- Returns average no of ‘Employees’ column for N groups present in sectors
- This is how you choose more than one column and return their sum.
How do you group by multiple columns?
sectors = df.groupby([“Sector”, “Industry”])
What are the 2 ways to use the .agg() with the groupby object?
sectors.agg([“size”, “sum”, “mean”])
There are 2 ways to use the .agg() by :
- Passing a dictionary as a parameter
- Passing a list as a parameter
sectors.agg ({“Revenue” : [“sum”, “mean”],
“Profits” : “sum”,
“Employees” : “mean”})
and
sectors.agg([“size”, “sum”, “mean”])
fortune = pd.read_csv(“fortune1000.csv”, index_col = “Rank”)
sectors = fortune.groupby(“Sector”)
fortune.head(3)
for sector, data in sectors:
highest_revenue_company_in_group = data.nlargest(1, “Revenue”)
df = df.append(highest_revenue_company_in_group
What does this code accomplish?