SUM1 Flashcards by Edgar Relf

IMPORT relevant MODULES

DO THIS to ensure all modules used are imported so that the code runs/examiner will not have any issues when running the code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set_style(“darkgrid”)

plt.rcParams[“figure.figsize”] = (6,2.5)

How well did you know this?

Not at all

Perfectly

What happens if the code does not run?

YOUR WORK WILL BE PENALISED IF YOUR CODE DOES NOT RUN; SO IMPORT THE RIGHT MODULES

How well did you know this?

Not at all

Perfectly

What is the marking criteria?

Sanity of code
* Correctness of the code
* Use appropriate Pandas functionality for data wrangling (if required)
* Use appropriate Pandas / Matplotlib / Seaborn functionality for data visualisation
Appropriate data exploration
* Get some idea about the data (e.g. via summary statistics and reading the background information), figure out if there is any possible issue with the data, what kind of questions you can answer, etc
Appropriate data wrangling
* Handle issues with data, clean the data, extract data, rearrange data, merge data, etc (when appropriate)
Good EDA
* Use appropriate statistics and plots to get insights about the data and answer the questions
Effective visualisation
* Truthfully plot the data and use appropriate and effective types of plots and channels
* Appropriate graph size and use of colours, labels, scale, titles, etc. Graphs should be self-contained
* Carefully constructed with good captions and annotations (if appropriate) for plots that are for explanatory purposes
Overall analysis quality
* Motivative and appropriate questions to ask, sensible approach to answer the questions, good observations and insights
* Sensible conclusion, discussion on limitations of the analysis (based on what you have observed), evidence of critical thinking
* Analysis easy to follow
Use of Jupyter Notebook * Use code cell for code and markdown cell for text
* All code must be run and output from code must be displayed in the notebooks
Others
* Use relative path instead of absolute path

| Use appropriate Pandas / Matplotlib / Seaborn functionality for data visualisation |

How well did you know this?

Not at all

Perfectly

What should be the first step when starting with the data?

Read the metadata / background information about the data
* Understand how the data is collected, reliability of the data, etc (last session)
Have a look at the data
* Ensure data is loaded in correctly (including the data types)
* See if anything surprising
* Determine what we are working with (bias, missing data, homogeneity data…)
* Check if it is the RAW ORIGINAL data (i.e. has it already been calculated into the mode/median/mean/average).

NOTE: The first step should not be immediately calculating descriptive statistics or visualising the data.

How well did you know this?

Not at all

Perfectly

How do you load the data and what is important to do when loading the data?

USE THE RELATIVE PATH to make it work:
- Ensure that it opens properly
- Do not necessarily use additional arguments yet

salaries = pd.read_csv(‘data/salaries_aggregated.csv’)
response_rates = pd.read_csv(‘data/response_rates_aggregated.csv’)

How well did you know this?

Not at all

Perfectly

How can you have a quick look at the data before wrangling?

You can use the name of the file or .head(), .tail(), .sample() as well.

Reminder: Please ensure all outputs of code are displayed in the notebook.

SEE IF THE AGGREGATED INFOMATION IS ALREADY AVAILABLE OF NOT ALREADY
- Just aggregating the infomation may be WRONG

How well did you know this?

Not at all

Perfectly

What is a common mistake with looking at the data?

Students may use salaries.groupby(‘column1’)[‘Average salary’].mean()

This may aggregate different educational levels, student numbers, not distinguish between other factors
Is the median or mean the best descriptive statistic?
You may already have the aggregated information available. USE FILTERS:
salaries.loc[(salaries[‘Graduating year’] == ‘All’)&(salaries[‘Level of study’] == ‘All’)]
Data may be weighted average already
CHATGPT/COPILOT may group it all together

*NOTE: From previous years:
1. Aggregated data rather than original data
–> Provided summary statistics about central tendency (e.g. mean and median) but not dispersion (e.g. range or standard deviation) or other aspects of the distribution

Looks fine, but interesting to have the “all” rows
Number of graduates seem to be all rounded to a multiple of 5 as stated

How well did you know this?

Not at all

Perfectly

How can we get more infomation from the dataset?

.info()

salaries.info()

NOTE: Seems fine to have the first 3 to have a non-quantitative type. For the last 3, they are currently float due to na values.

How well did you know this?

Not at all

Perfectly

How can you check the data types/check for missing data?

dtypes() and isna()

How well did you know this?

Not at all

Perfectly

What is important following looking at the data?

REMINDER: Use the markdown cell for text answers.

While many students did use head(), info(), dtypes(), etc, they merely show the output from the code without making useful observations or providing textual answers
Many students unfortunately did not explore or look closer enough to realise that there are aggregate rows in the data
Because of that, they perform the data analysis wrongly…

How well did you know this?

Not at all

Perfectly

How can you get more descriptive statistics for the data?

salaries.describe(include=’all’)

NOTE: given there are aggregated rows, the descriptive statistics here are not really correct, but it can still be useful to have a quick understanding of the data.

How well did you know this?

Not at all

Perfectly

How can you also understand the range of values to spot if there’s any errors?

salaries[‘Median salary’].sort_values().unique()

EXAMPLE:
It does not seem to be rounded to a multiple of 5 due to data protection purposes, as we have something like 30034

How well did you know this?

Not at all

Perfectly

How can you count the different levels of study in the column to understand distribution?

salaries.value_counts(‘Level of study’)
salaries.value_counts(‘Graduating year’)

OUT:
Level of study
All 75
PG 75
UG 50
Name: count, dtype: int64
#We see many fewer UG than PG rows.

OUT:
Graduating year
2018/19 67
All 67
2017/18 66
Name: count, dtype: int64
# We can also see we have slightly fewer observations for 2017/18.

How well did you know this?

Not at all

Perfectly

How do you create variable for the different parts of the data?

pg_depts = pd.Series(salaries.loc[salaries[‘Level of study’] == ‘PG’,
‘Department’].unique())
ug_depts = pd.Series(salaries.loc[salaries[‘Level of study’] == ‘UG’,
‘Department’].unique())
pg_only_depts = pg_depts[~pg_depts.isin(ug_depts)]
pg_only_depts]

ug_depts[~ug_depts.isin(pg_depts)] # no dept with only UG from the data

NOTE: There were 8 departments with PG graduates but no UG graduates.

How well did you know this?

Not at all

Perfectly

How can you check for missing data and what should be consider?

Try to understand more what is REALLY missing:
* It seems even if the information is missing at a more detailed level, the “aggregate” rows may still have the information.

EXAMPLE1:
salaries[salaries.isna().any(axis=1)].sort_values([‘Department’, ‘Level of study’,
‘Graduating year’])

EXAMPLE2:
salaries[(salaries[‘Department’] == ‘Sociology’)].sort_values([‘Graduating year’,
‘Level of study’])

NOTE: Given the minimum of “Number of graduates paid in £” is 10, missing values can be done to a low number of observations for the corresponding cohort of students.

How well did you know this?

Not at all

Perfectly

How can you handle missing data?

Study These Flashcards

While one may try to deduce the missing data (e.g. Use “all” - “PG” to get “UG” values), the values are likely to be imprecise due to small sample size, cumulated rounding error, etc

We will leave the missing data for now (I.E. KEEP THE DATA IN FOR NOW)
MAYBE DO NOT REVERSE ENGINEER to get the value as it may be imprecise

How can we help quantify how representative the data is?

Study These Flashcards

Use the Response rate

Here we take a look at response rate:

response_rates

OBSERVATION:
* The first few rows of the table give us some idea about how representative the data is - for 2017/18 and 2018/19 combined for all students, the response rate is 46% with (rounded) students filling the survey
* This also prompts the question of why we have a much smaller number in Number of graduates paid in £ from salaries - it is (rounded) for 2017/18 and 2018/19 combined for all students

How can we answer the question: Is salary different for different departments? What are the observations and limitations?

Study These Flashcards

First, consider “all” levels of study, “all” years and average salary (don’t use the ALL department as a department, but only as a impersotion line to make it easier to compare to each department).

CODE:
salary_per_dept = salaries.loc[(salaries[‘Graduating year’] == ‘All’) &
(salaries[‘Level of study’] == ‘All’)]
df_to_plot = salary_per_dept.loc[salaries[‘Department’] != ‘All departments’,
[‘Department’, “Average salary”]]
ax = df_to_plot.sort_values(‘Average salary’, ascending=False).set_index(‘Department’).\
plot.bar(legend=False, ylabel=’Average salary (£)’, \
title=’Average salary per department (2017/18 and 2018/19, UG and PG combined)’)
overall_avg_salary = salary_per_dept.\
loc[salary_per_dept[‘Department’] == ‘All departments’,’Average salary’].values[0]
plt.axhline(overall_avg_salary, color=’red’, lw=0.5)
plt.text(12, overall_avg_salary+1000,
f’School average salary: £{round(overall_avg_salary)}’)
plt.text(df_to_plot.shape[0]-1, 0,’x’, ha=’center’, va=’center’);

ax.figure

OBSERVATIONS:
“Quantitative” and/or “Business”-related subjects all have a higher-than-average average salary
Two of the highest average salary departments have only PG but no UG students
Law is also among the high average salary departments

LIMITATIONS:
Other factors like levels of study are not taken into account
Use of average salary
Representativeness of data. For example:
Not responded to the survey
Not work for a job with £
Missing data may cause bias.

How does the initial exploration help and what should you look out for?

Study These Flashcards

Find out that we should use the aggregate rows directly if we are considering all levels of study and all graduating years
-> Rounding issues make it difficult to do the aggregation precisely ourselves
-> May need to do “weighted” average by ourselves
-> Less missing data for the aggregate rows
Know what is missing and what is not
-> Missing data may not be as much as some students thought if they make use of the “all” rows

What is wrong with this:
salaries.groupby(‘Department’)[[‘Average salary’]].mean().head()

Study These Flashcards

For this dataset, while the values will not be calculated correctly, they will not be off too much.

How do you calculate the median?

Study These Flashcards

df_to_plot = salary_per_dept.loc[salaries[‘Department’] != ‘All departments’,
[‘Department’, “Median salary”]]
ax = df_to_plot.sort_values(‘Median salary’, ascending=False).set_index(‘Department’) \
.plot.bar(legend=False, ylabel=’Median salary (£)’, \
title=’Median salary per department (2017/18 and 2018/19, UG and PG combined)’)
overall_median_salary=salary_per_dept.\
loc[salary_per_dept[‘Department’] == ‘All departments’,’Median salary’].values[0]
plt.axhline(overall_median_salary, color=’red’, lw=0.5)
plt.text(12,overall_median_salary+1000,
f’School median salary: £{round(overall_median_salary)}’)
plt.text(df_to_plot.shape[0]-1, 0,’x’, ha=’center’, va=’center’);

ax.figure

How may the median help?

Study These Flashcards

A better measure than the mean/average

It may give us some idea about:
* The skewness or outliers of the data
* Whether we are likely to get similar results with the use of average and median salary

What else could you consider with the level of study?

Study These Flashcards

You could split UG and PG, as they have different levels of study and results/outcomes
Observations and remarks:
Unlike the plot for all levels of study, Law has the highest median salary and Economic History ranks much higher.
‘x’ reminds us there is no UG Finance data
You can calculate the difference between the average and median salaries

What can be improved:
salaries.groupby(‘Department’)[‘Median salary’].mean().plot.bar();

Study These Flashcards

There is no order to the graph, they are scattered (it is difficult to read)
Difficult to interpret in relation to to the school average

What could be improved: df_to_plot = salaries.groupby('Department')['Median salary'].mean().sort_values() df_to_plot.plot(figsize=(15,3)) plt.xticks(np.arange(0, len(df_to_plot.index), 1.0), df_to_plot.index, rotation=90);

* Not appropriate graph (line graph) as data is not ordered

Suppose that you split between Quantitative and Qualitative departments. What is wrong with this: sns.boxplot(ug_salary_per_dept[ug_salary_per_dept['Department']!='All departments'].\ sort_values('Quantitative', ascending=False), x='Quantitative', y='Median salary'); plt.title('Median salary on Quant vs non-quant (2017/18 & 2018/19, UG only)') plt.xlabel('"Quantitative" department?'); plt.ylabel('Median salary (£)');

* How should we interpret the boxplot? What is "median of median"? Uses summary statistics to create the box plots, (range, median, IQR). It may not make sense in this context * Only 3 data points are used to create the "True" box... * The categorisation may not be accurate TO IMPROVE: * It is better to use scatter plot/raw data, as there are only 3 data points in quantiative * Maybe Quantitative vs Qualitative is a bad way to group the department * Perhaps use colour to show different segregation of departments (i.e. orange for quantitative and blue for qualitative subjects)

What is wrong with this: ug_salary_per_dept.groupby('Quantitative')['Median salary'].mean().plot.bar(rot=0);

* Data is not very informative * The plot does not provide much more than a table:

SUM1 Flashcards

(27 cards)