IMPORT relevant MODULES
DO THIS to ensure all modules used are imported so that the code runs/examiner will not have any issues when running the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set_style(“darkgrid”)
plt.rcParams[“figure.figsize”] = (6,2.5)
What happens if the code does not run?
YOUR WORK WILL BE PENALISED IF YOUR CODE DOES NOT RUN; SO IMPORT THE RIGHT MODULES
What is the marking criteria?
| Use appropriate Pandas / Matplotlib / Seaborn functionality for data visualisation |
What should be the first step when starting with the data?
NOTE: The first step should not be immediately calculating descriptive statistics or visualising the data.
How do you load the data and what is important to do when loading the data?
USE THE RELATIVE PATH to make it work:
- Ensure that it opens properly
- Do not necessarily use additional arguments yet
salaries = pd.read_csv(‘data/salaries_aggregated.csv’)
response_rates = pd.read_csv(‘data/response_rates_aggregated.csv’)
How can you have a quick look at the data before wrangling?
You can use the name of the file or .head(), .tail(), .sample() as well.
Reminder: Please ensure all outputs of code are displayed in the notebook.
SEE IF THE AGGREGATED INFOMATION IS ALREADY AVAILABLE OF NOT ALREADY
- Just aggregating the infomation may be WRONG
What is a common mistake with looking at the data?
Students may use salaries.groupby(‘column1’)[‘Average salary’].mean()
*NOTE: From previous years:
1. Aggregated data rather than original data
–> Provided summary statistics about central tendency (e.g. mean and median) but not dispersion (e.g. range or standard deviation) or other aspects of the distribution
How can we get more infomation from the dataset?
.info()
salaries.info()
NOTE: Seems fine to have the first 3 to have a non-quantitative type. For the last 3, they are currently float due to na values.
How can you check the data types/check for missing data?
dtypes() and isna()
What is important following looking at the data?
REMINDER: Use the markdown cell for text answers.
How can you get more descriptive statistics for the data?
salaries.describe(include=’all’)
How can you also understand the range of values to spot if there’s any errors?
salaries[‘Median salary’].sort_values().unique()
EXAMPLE:
It does not seem to be rounded to a multiple of 5 due to data protection purposes, as we have something like 30034
How can you count the different levels of study in the column to understand distribution?
salaries.value_counts(‘Level of study’)
salaries.value_counts(‘Graduating year’)
OUT:
Level of study
All 75
PG 75
UG 50
Name: count, dtype: int64
#We see many fewer UG than PG rows.
OUT:
Graduating year
2018/19 67
All 67
2017/18 66
Name: count, dtype: int64
# We can also see we have slightly fewer observations for 2017/18.
How do you create variable for the different parts of the data?
pg_depts = pd.Series(salaries.loc[salaries[‘Level of study’] == ‘PG’,
‘Department’].unique())
ug_depts = pd.Series(salaries.loc[salaries[‘Level of study’] == ‘UG’,
‘Department’].unique())
pg_only_depts = pg_depts[~pg_depts.isin(ug_depts)]
pg_only_depts]
ug_depts[~ug_depts.isin(pg_depts)] # no dept with only UG from the data
NOTE: There were 8 departments with PG graduates but no UG graduates.
How can you check for missing data and what should be consider?
Try to understand more what is REALLY missing:
* It seems even if the information is missing at a more detailed level, the “aggregate” rows may still have the information.
EXAMPLE1:
salaries[salaries.isna().any(axis=1)].sort_values([‘Department’, ‘Level of study’,
‘Graduating year’])
EXAMPLE2:
salaries[(salaries[‘Department’] == ‘Sociology’)].sort_values([‘Graduating year’,
‘Level of study’])
NOTE: Given the minimum of “Number of graduates paid in £” is 10, missing values can be done to a low number of observations for the corresponding cohort of students.
How can you handle missing data?
While one may try to deduce the missing data (e.g. Use “all” - “PG” to get “UG” values), the values are likely to be imprecise due to small sample size, cumulated rounding error, etc
How can we help quantify how representative the data is?
Use the Response rate
Here we take a look at response rate:
response_rates
OBSERVATION:
* The first few rows of the table give us some idea about how representative the data is - for 2017/18 and 2018/19 combined for all students, the response rate is 46% with (rounded) students filling the survey
* This also prompts the question of why we have a much smaller number in Number of graduates paid in £ from salaries - it is (rounded) for 2017/18 and 2018/19 combined for all students
How can we answer the question: Is salary different for different departments? What are the observations and limitations?
CODE:
salary_per_dept = salaries.loc[(salaries[‘Graduating year’] == ‘All’) &
(salaries[‘Level of study’] == ‘All’)]
df_to_plot = salary_per_dept.loc[salaries[‘Department’] != ‘All departments’,
[‘Department’, “Average salary”]]
ax = df_to_plot.sort_values(‘Average salary’, ascending=False).set_index(‘Department’).\
plot.bar(legend=False, ylabel=’Average salary (£)’, \
title=’Average salary per department (2017/18 and 2018/19, UG and PG combined)’)
overall_avg_salary = salary_per_dept.\
loc[salary_per_dept[‘Department’] == ‘All departments’,’Average salary’].values[0]
plt.axhline(overall_avg_salary, color=’red’, lw=0.5)
plt.text(12, overall_avg_salary+1000,
f’School average salary: £{round(overall_avg_salary)}’)
plt.text(df_to_plot.shape[0]-1, 0,’x’, ha=’center’, va=’center’);
ax.figure
OBSERVATIONS:
“Quantitative” and/or “Business”-related subjects all have a higher-than-average average salary
Two of the highest average salary departments have only PG but no UG students
Law is also among the high average salary departments
LIMITATIONS:
Other factors like levels of study are not taken into account
Use of average salary
Representativeness of data. For example:
Not responded to the survey
Not work for a job with £
Missing data may cause bias.
How does the initial exploration help and what should you look out for?
What is wrong with this:
salaries.groupby(‘Department’)[[‘Average salary’]].mean().head()
For this dataset, while the values will not be calculated correctly, they will not be off too much.
How do you calculate the median?
df_to_plot = salary_per_dept.loc[salaries[‘Department’] != ‘All departments’,
[‘Department’, “Median salary”]]
ax = df_to_plot.sort_values(‘Median salary’, ascending=False).set_index(‘Department’) \
.plot.bar(legend=False, ylabel=’Median salary (£)’, \
title=’Median salary per department (2017/18 and 2018/19, UG and PG combined)’)
overall_median_salary=salary_per_dept.\
loc[salary_per_dept[‘Department’] == ‘All departments’,’Median salary’].values[0]
plt.axhline(overall_median_salary, color=’red’, lw=0.5)
plt.text(12,overall_median_salary+1000,
f’School median salary: £{round(overall_median_salary)}’)
plt.text(df_to_plot.shape[0]-1, 0,’x’, ha=’center’, va=’center’);
ax.figure
How may the median help?
It may give us some idea about:
* The skewness or outliers of the data
* Whether we are likely to get similar results with the use of average and median salary
What else could you consider with the level of study?
What can be improved:
salaries.groupby(‘Department’)[‘Median salary’].mean().plot.bar();