pandas lesson 1 Flashcards
Read in excel file ‘movies’ and look at the first five columns from the sheet 1900s.
import pandas as pd
excel_df = pd.read_excel(‘C:/Users/User/Documents/CFG_DATA/Data_files/movies.xls’, sheet_name = ‘1900s’)
excel_df.head(5)
Show me a sample set of 5 rows from the excel file ‘movies’. Only select the following columns: [‘Title’, ‘Year’, ‘Duration’].
columns_to_select = [‘Title’, ‘Year’, ‘Duration’]
excel_df[columns_to_select].sample(n=5)
Filter the table to just the following columns and show the first 5 rows¶
[‘Title’, ‘Year’, ‘Genres’, ‘Language’, ‘Country’, ‘Content Rating’, ‘Budget’, ‘IMDB Score’]
test_df = excel_df[[‘Title’, ‘Year’, ‘Genres’, ‘Language’, ‘Country’, ‘Content Rating’, ‘Budget’, ‘IMDB Score’]]
test_df.head(5)
Check if the year 1920 is true or false
test_df[‘Year’] == 1920
Filter the test_df to rows where the IMDB Score is greater than 5
test_df[test_df[‘IMDB Score’] > 5]
Explain why two sets of square brackets are needed when filtering columns in pandas
The first set of square brackets ([]) contains a boolean expression that evaluates to a series of True or False values. The expression test_df[‘IMDB Score’] > 5 returns a series of True or False values depending on whether the ‘IMDB Score’ in each row is greater than 5 or not.
The second set of square brackets ([]) is used to select the rows of the dataframe where the boolean expression in the first set of square brackets evaluates to True. In other words, it selects all the rows where the ‘IMDB Score’ is greater than 5.
The reason why we need two sets of square brackets is because the first set returns a boolean series, which cannot be used to index a dataframe directly. However, by passing the boolean series inside another set of square brackets, we can use it to select only the rows of the dataframe where the condition is True.
Filter the test_df to films from the USA
test_df[test_df[‘Country’] == ‘USA’]
reset the test_df so that you only show the columns [‘Title’, ‘Year’, ‘Genres’, ‘Language’, ‘Country’, ‘Content Rating’, ‘Budget’, ‘IMDB Score’]. Then create three conditions where c1 country is usa, c2 country is uk and c3 country is germany. create a test_df where countries are from usa or UK.
test_df = excel_df[[‘Title’, ‘Year’, ‘Genres’, ‘Language’, ‘Country’, ‘Content Rating’, ‘Budget’, ‘IMDB Score’]]
condition_1 = (test_df[‘Country’] == ‘USA’)
condition_2 = (test_df[‘Country’] == ‘UK’)
condition_3 = (test_df[‘Country’] == ‘Germany’)
test_df= condition_1 | condition_2
print(test_df)
Filter the test_df to films which were made in the 1920s
test_df[(test_df[‘Year’] >= 1920) & (test_df[‘Year’] <=1929)]
what function can i use to get a numeric sense of the data
test_df.describe()
How can I find how many rows and columns we have
.shape
How do I find the minimum, maximum, mean
.mean()
.min()
.max()
What code do you write if you want to specify that the CSV file that you are reading the file contains column headers.
header = 0
What does parse_dates=True mean?
The date column should be parsed as a date-time object, so it will not be treated as a string column.