Pandas Flashcards
Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences
Axis values can have string labels, not just numeric ones.
Dataframes can contain columns with multiple data types: including integer, float, and string.
Read CSV with pandas
import pandas as pd
f500 = pd.read_csv(‘f500.csv’,index_col=0)
DataFrame.shape attribute to assign the shape of f500 to f500_shape.
f500_shape = f500.shape
Use Python’s type() function to assign the type of f500 to f500_type.
f500_type = type(f500)
se the head() method to select the first 6 rows of f500. Assign the result to f500_head.
f500_head = f500.head(6)
Use the tail() method to select the last 8 rows of f500. Assign the result to f500_tail.
f500_tail = f500.tail(8)
Use the DataFrame.info() method to display information about the f500 dataframe.
f500.info()
Select the industry column of f500. Assign the result to the variable name industries.
industries=f500[“industry”]
In order, select the revenues and years_on_global_500_list columns. Assign the result to the variable name revenues_years.
revenues_years=f500[[“revenues”, “years_on_global_500_list”]]
In order, select all columns from ceo up to and including sector. Assign the result to the variable name ceo_to_sector.
ceo_to_sector=f500.loc[:,”ceo”:”sector”]
By selecting data from f500:
Create a new variable toyota, with:
Just the row with index Toyota Motor.
All columns.
toyota=f500.loc[“Toyota Motor”]
By selecting data from f500: Create a new variable, drink_companies, with:
Rows with indicies Anheuser-Busch InBev, Coca-Cola, and Heineken Holding, in that order.
All columns.
drink_companies=f500.loc[[“Anheuser-Busch InBev”, “Coca-Cola”, “Heineken Holding”]]
By selecting data from f500: Create a new variable, middle_companies with All rows with indicies from Tata Motorsto Nationwide, inclusive.
All columns from rank to country, inclusive.
middle_companies=f500.loc[“Tata Motors” : “Nationwide”, “rank”: “country”]
We’ve already saved a selection of data from f500 to a dataframe named f500_sel.
Find the counts of each unique value in the country column in the f500_sel dataframe.
Select the country column in the f500_sel dataframe. Assign it to a variable named countries.
Use the Series.value_counts() method to return the value counts for countries. Assign the results to country_counts.
countries=f500_sel[“country”]
country_counts=countries.value_counts()
print(country_counts)
From the pandas series countries_counts:
Select the item at index label India. Assign the result to the variable name india.
countries = f500['country'] countries_counts = countries.value_counts()
india = countries_counts.loc[“India”]
From the pandas series countries_counts: In order, select the items with index labels USA, Canada, and Mexico. Assign the result to the variable name north_america.
north_america= countries_counts[[“USA”,”Canada”,”Mexico”]]
By selecting data from f500:
Create a new variable big_movers, with:
Rows with indices Aviva, HP, JD.com, and BHP Billiton, in that order.
The rank and previous_rank columns, in that order.
big_movers = f500.loc[[“Aviva”, “HP”, “JD.com”, “BHP Billiton”], [“rank”,”previous_rank”]]
By selecting data from f500:
Create a new variable, bottom_companies with:
All rows with indices from National Gridto AutoNation, inclusive.
The rank, sector, and country columns.
bottom_companies = f500.loc[“National Grid”:”AutoNation”, [“rank”,”sector”,”country”]]
in f500 Subtract the values in the rank column from the values in the previous_rank column. Assign the result to rank_change.
rank_change=f500[“previous_rank”]-f500[“rank”]
Use the Series.max() method to find the maximum value for the rank_change series. Assign the result to the variable rank_change_max.
rank_change = f500["previous_rank"] - f500["rank"] rank_change_max = rank_change.max()
Use the Series.min() method to find the minimum value for the rank_change series. Assign the result to the variable rank_change_min.
rank_change_min = rank_change.min()
Return a series of descriptive statistics for the rank column in f500.
Select the rank column. Assign it to a variable named rank.
Use the Series.describe() method to return a series of statistics for rank. Assign the result to rank_desc.
rank = f500["rank"] rank_desc = rank.describe()
Use Series.value_counts() and Series.loc to return the number of companies with a value of 0 in the previous_rank column in the f500 dataframe. Assign the results to zero_previous_rank.
zero_previous_rank = f500[“previous_rank”].value_counts().loc[0]
Use the DataFrame.max() method to find the maximum value for only the numeric columns from f500 (you may need to check the documentation). Assign the result to the variable max_f500.
max_f500 = f500.max(numeric_only=True)
f500_desc = f500.describe()
Return a dataframe of descriptive statistics for all of the numeric columns in f500. Assign the result to f500_desc.
The company “Dow Chemical” has named a new CEO. Update the value where the row label is Dow Chemical and for the ceo column to Jim Fitterling in the f500 dataframe.
f500.loc[“Dow Chemical”,”ceo”] = “Jim Fitterling”
Create a boolean series, motor_bool, that compares whether the values in the industry column from the f500 dataframe are equal to “Motor Vehicles and Parts”
motor_bool = f500[“industry”] == “Motor Vehicles and Parts”
Use the motor_bool boolean series to index the country column. Assign the result to motor_countries.
motor_countries = f500.loc[motor_bool, “country”]
Use boolean indexing to update values in the previous_rank column of the f500 dataframe:
There should now be a value of np.nan where there previously was a value of 0.
import numpy as np
prev_rank_before = f500[“previous_rank”].value_counts(dropna=False).head()
f500.loc[f500[“previous_rank”] == 0, “previous_rank”] = np.nan
Create a new pandas series, prev_rank_after, using the same syntax that was used to create the prev_rank_before series.
prev_rank_after = f500[“previous_rank”].value_counts(dropna=False).head()
Add a new column named rank_change to the f500 dataframe by subtracting the values in the rank column from the values in the previous_rank column.
f500["rank_change"] = f500["previous_rank"] - f500["rank"] rank_change_desc = f500["rank_change"].describe()
industry_usa = f500[“industry”][f500[“country”] == “USA”].value_counts().head(2)
sector_china = f500[“sector”][f500[“country”] == “China”].value_counts().head(3)