Lesson7 Numpy_Pandas analysis Flashcards
Create an array of 10 zeros and ensure they are integers.
np.zeros(10, dtype=’int’)
Create a matrix with a predefined value of 5.45 with 3 rows and 5 cols.
np.full((3,5),5.45)
Create an array of even space between 0 and 2. Do this for 5 numbers.
np.linspace(0, 2, 5)
create a 3x3 array with random numbers (0-1) with a normal distribution. Specify that they have a mean 0 and standard deviation 1.
np.random.normal(0, 1, (3,3))
Combine the following arrays x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [21,21,21]
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [21,21,21]
np.concatenate([x, y,z])
Concatenate the grid array twice grid = np.array([[1,2,3],[4,5,6]]).
grid = np.array([[1,2,3],[4,5,6]])
np.concatenate([grid,grid])
Create a dataframe using a dictionary with the columns: Fruit and Items (the values list for items is 121,40,100,130,11] and the values for fruit Fruit’: [‘Peach’,’Apple’,’Pear’,’Plum’,’Kiwi’.
data = pd.DataFrame({‘Fruit’: [‘Peach’,’Apple’,’Pear’,’Plum’,’Kiwi’],
‘Items’:[121,40,100,130,11]})
How do you get complete information on the dataset
data.info()
Make a dataframe with the column name group, kg. Group values: ‘a’, ‘a’, ‘a’, ‘b’,’b’, ‘b’, ‘c’, ‘c’,’c’, kg values: 4, 3, 12, 6, 7.5, 8, 3, 5, 6
data = pd.DataFrame({‘group’:[‘a’, ‘a’, ‘a’, ‘b’,’b’, ‘b’, ‘c’, ‘c’,’c’],’kg’:[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
Sort the values in the data df by kg. Do this for ascending and change the original df.
data = pd.DataFrame({‘kg’: [‘a’,’a’,’a’,’b’,’b’,’b’,’c’,’c’,’c’], ‘kg values’: [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data.sort_values(by=[‘kg’],ascending=True,inplace=True)
Sort by multiple columns - do this for data. Sort group by ascending order and kg by descending order. Make sure you don’t modify the original dataset.
data.sort_values(by=[‘group’,’kg’],ascending=[True,False],inplace=False)
data = pd.DataFrame({‘names’:[‘Mila’]3 + [‘Igor’]4, ‘Age’:[3,2,1,3,3,4,4]})
remove duplicates
data.drop_duplicates()
Remove duplicate values from the name column
data = pd.DataFrame({‘names’:[‘Mila’]3 + [‘Igor’]4, ‘Age’:[3,2,1,3,3,4,4]})
data.drop_duplicates(subset=’names’)
for the farm shop df (data) create a new column animal 2 that shows the result of the meat to animal. Ensure they are all lowercase.
data[‘animal’] = data[‘food’].map(str.lower).map(meat_to_animal)
Remove animal 2 from dataset (series only).
data.drop(‘animal2’,axis=’columns’,inplace=True)
Make a new series using assign
data.assign(new_variable = data[‘kg’]*10)
Make a dataframe that has values 1-11, in a matrix of 3 rows and 4 columns. Use the index names
index=[‘London’, ‘Manchester’, ‘Brighton’],
columns=[‘one’, ‘two’, ‘three’, ‘four’])
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=[‘London’, ‘Manchester’, ‘Brighton’],
columns=[‘one’, ‘two’, ‘three’, ‘four’])
Rename Manchester to Cardiff and in the columns one to one_p and two to two_p for the dataframe data. Make sure to change the original df.
data.rename(index = {‘Manchester’:’Cardiff’}, columns={‘one’:’one_p’,’two’:’two_p’},inplace=True)
convert the index to capital letters and columns to title.
data.rename(index = str.upper, columns=str.title,inplace=True)
Create categories for this variable ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]. Use the bins bins = [18, 25, 35, 60, 100]
categories = pd.cut(ages, bins)
Include the left bin value
pd.cut(ages,bins,right=False)
See how many observations (the frequency or count of observations that belong to each bin) fall under each bin. Do this for the categories variable.
pd.value_counts(categories)
Add unique name to each category then check how many observations fall under each bin. bin_names = [‘Youth’, ‘Early 20s’, ‘Middle Age’, ‘Senior’]
bin_names = [‘Youth’, ‘Early 20s’, ‘Middle Age’, ‘Senior’]
new_cats = pd.cut(ages, bins,labels=bin_names)
pd.value_counts(new_cats)
Create a df date starting from 20210701 with a length of 7 periods. Then create a pandas DataFrame with 7 rows and 4 columns, with random values generated from a normal distribution the row index is set to the ‘dates’ variable created above and the columns are labeled ‘A’, ‘B’, ‘C’, and ‘D’
dates = pd.date_range(‘20210701’,periods=7)
df = pd.DataFrame(np.random.randn(7,4),index=dates,columns=list(‘ABCD’))
df
Get the first 3 rows from the df
df[:3]
Slice df based on date range 20210703 to 20210705
df[‘20210703’:’20210705’]
Slice df on the column names A and B
df.loc[:,[‘A’,’B’]]
Slice df based on the dates 20210703 to 20210705 and the column names A and B.
df.loc[‘20210701’:’20210705’,[‘A’,’B’]]
Slice the df based on the second index of row
df.iloc[2]
Return a specific range of rows based on index. Return the rows 2-4 for the first two columns.
df.iloc[2:4, 0:2]
Return specific rows (second and sixth row) and columns (first and third) using lists containing columns or row indexes.
df.iloc[[1,5],[0,2]]
Copy the dataframe df and add a new column E. Name it df2.
df2 = df.copy()
df2[‘E’]=[‘one’, ‘one’,’two’,’three’,’four’,’three’,’two’]
Select rows based on column values. Select anything from column E that are in the rows that contain two or four. Use df2.
df2[df2[‘E’].isin([‘two’,’four’])]
select all rows in column E except those with two and four. Use the df df2.
df2[~df2[‘E’].isin([‘two’,’four’])]
Make a series which has random integers from range 1-10 with total of 40 numbers. Then make a dataframe using this series and change it to 8 rows and 5 columns.
ser = pd.Series(np.random.randint(1, 10, 40))
df = pd.DataFrame(ser.values.reshape(8,5))
Create a dataframe of two column headings called name and age where the values for the names and ages are:
names = [‘Alice’, ‘Bob’, ‘Charlie’]
ages = [25, 30, 35]
names = [‘Alice’, ‘Bob’, ‘Charlie’]
ages = [25, 30, 35]
Create DataFrame
df = pd.DataFrame({‘Name’: names, ‘Age’: ages})