SUM1 Flashcards

1
Q

IMPORT relevant MODULES

A

DO THIS to ensure all modules used are imported so that the code runs/examiner will not have any issues when running the code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set_style(“darkgrid”)

plt.rcParams[“figure.figsize”] = (6,2.5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What happens if the code does not run?

A

YOUR WORK WILL BE PENALISED IF YOUR CODE DOES NOT RUN; SO IMPORT THE RIGHT MODULES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the marking criteria?

A
  1. Sanity of code
    * Correctness of the code
    * Use appropriate Pandas functionality for data wrangling (if required)
    * Use appropriate Pandas / Matplotlib / Seaborn functionality for data visualisation
  2. Appropriate data exploration
    * Get some idea about the data (e.g. via summary statistics and reading the background information), figure out if there is any possible issue with the data, what kind of questions you can answer, etc
  3. Appropriate data wrangling
    * Handle issues with data, clean the data, extract data, rearrange data, merge data, etc (when appropriate)
  4. Good EDA
    * Use appropriate statistics and plots to get insights about the data and answer the questions
  5. Effective visualisation
    * Truthfully plot the data and use appropriate and effective types of plots and channels
    * Appropriate graph size and use of colours, labels, scale, titles, etc. Graphs should be self-contained
    * Carefully constructed with good captions and annotations (if appropriate) for plots that are for explanatory purposes
  6. Overall analysis quality
    * Motivative and appropriate questions to ask, sensible approach to answer the questions, good observations and insights
    * Sensible conclusion, discussion on limitations of the analysis (based on what you have observed), evidence of critical thinking
    * Analysis easy to follow
  7. Use of Jupyter Notebook * Use code cell for code and markdown cell for text
    * All code must be run and output from code must be displayed in the notebooks
  8. Others
    * Use relative path instead of absolute path

| Use appropriate Pandas / Matplotlib / Seaborn functionality for data visualisation |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What should be the first step when starting with the data?

A
  1. Read the metadata / background information about the data
    * Understand how the data is collected, reliability of the data, etc (last session)
  2. Have a look at the data
    * Ensure data is loaded in correctly (including the data types)
    * See if anything surprising
    * Determine what we are working with (bias, missing data, homogeneity data…)
    * Check if it is the RAW ORIGINAL data (i.e. has it already been calculated into the mode/median/mean/average).

NOTE: The first step should not be immediately calculating descriptive statistics or visualising the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you load the data and what is important to do when loading the data?

A

USE THE RELATIVE PATH to make it work:
- Ensure that it opens properly
- Do not necessarily use additional arguments yet

salaries = pd.read_csv(‘data/salaries_aggregated.csv’)
response_rates = pd.read_csv(‘data/response_rates_aggregated.csv’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can you have a quick look at the data before wrangling?

A

You can use the name of the file or .head(), .tail(), .sample() as well.

Reminder: Please ensure all outputs of code are displayed in the notebook.

SEE IF THE AGGREGATED INFOMATION IS ALREADY AVAILABLE OF NOT ALREADY
- Just aggregating the infomation may be WRONG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a common mistake with looking at the data?

A

Students may use salaries.groupby(‘column1’)[‘Average salary’].mean()

  • This may aggregate different educational levels, student numbers, not distinguish between other factors
  • Is the median or mean the best descriptive statistic?
  • You may already have the aggregated information available. USE FILTERS:
    salaries.loc[(salaries[‘Graduating year’] == ‘All’)&(salaries[‘Level of study’] == ‘All’)]
  • Data may be weighted average already
  • CHATGPT/COPILOT may group it all together

*NOTE: From previous years:
1. Aggregated data rather than original data
–> Provided summary statistics about central tendency (e.g. mean and median) but not dispersion (e.g. range or standard deviation) or other aspects of the distribution

  1. Looks fine, but interesting to have the “all” rows
  2. Number of graduates seem to be all rounded to a multiple of 5 as stated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can we get more infomation from the dataset?

A

.info()

salaries.info()

NOTE: Seems fine to have the first 3 to have a non-quantitative type. For the last 3, they are currently float due to na values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you check the data types/check for missing data?

A

dtypes() and isna()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is important following looking at the data?

A

REMINDER: Use the markdown cell for text answers.

  • While many students did use head(), info(), dtypes(), etc, they merely show the output from the code without making useful observations or providing textual answers
  • Many students unfortunately did not explore or look closer enough to realise that there are aggregate rows in the data
    Because of that, they perform the data analysis wrongly…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you get more descriptive statistics for the data?

A

salaries.describe(include=’all’)

  • NOTE: given there are aggregated rows, the descriptive statistics here are not really correct, but it can still be useful to have a quick understanding of the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you also understand the range of values to spot if there’s any errors?

A

salaries[‘Median salary’].sort_values().unique()

EXAMPLE:
It does not seem to be rounded to a multiple of 5 due to data protection purposes, as we have something like 30034

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you count the different levels of study in the column to understand distribution?

A

salaries.value_counts(‘Level of study’)
salaries.value_counts(‘Graduating year’)

OUT:
Level of study
All 75
PG 75
UG 50
Name: count, dtype: int64
#We see many fewer UG than PG rows.

OUT:
Graduating year
2018/19 67
All 67
2017/18 66
Name: count, dtype: int64
# We can also see we have slightly fewer observations for 2017/18.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you create variable for the different parts of the data?

A

pg_depts = pd.Series(salaries.loc[salaries[‘Level of study’] == ‘PG’,
‘Department’].unique())
ug_depts = pd.Series(salaries.loc[salaries[‘Level of study’] == ‘UG’,
‘Department’].unique())
pg_only_depts = pg_depts[~pg_depts.isin(ug_depts)]
pg_only_depts]

ug_depts[~ug_depts.isin(pg_depts)] # no dept with only UG from the data

NOTE: There were 8 departments with PG graduates but no UG graduates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you check for missing data and what should be consider?

A

Try to understand more what is REALLY missing:
* It seems even if the information is missing at a more detailed level, the “aggregate” rows may still have the information.

EXAMPLE1:
salaries[salaries.isna().any(axis=1)].sort_values([‘Department’, ‘Level of study’,
‘Graduating year’])

EXAMPLE2:
salaries[(salaries[‘Department’] == ‘Sociology’)].sort_values([‘Graduating year’,
‘Level of study’])

NOTE: Given the minimum of “Number of graduates paid in £” is 10, missing values can be done to a low number of observations for the corresponding cohort of students.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you handle missing data?

A

While one may try to deduce the missing data (e.g. Use “all” - “PG” to get “UG” values), the values are likely to be imprecise due to small sample size, cumulated rounding error, etc

  • We will leave the missing data for now (I.E. KEEP THE DATA IN FOR NOW)
  • MAYBE DO NOT REVERSE ENGINEER to get the value as it may be imprecise
17
Q

How can we help quantify how representative the data is?

A

Use the Response rate

Here we take a look at response rate:

response_rates

OBSERVATION:
* The first few rows of the table give us some idea about how representative the data is - for 2017/18 and 2018/19 combined for all students, the response rate is 46% with (rounded) students filling the survey
* This also prompts the question of why we have a much smaller number in Number of graduates paid in £ from salaries - it is (rounded) for 2017/18 and 2018/19 combined for all students

18
Q

How can we answer the question: Is salary different for different departments? What are the observations and limitations?

A
  1. First, consider “all” levels of study, “all” years and average salary (don’t use the ALL department as a department, but only as a impersotion line to make it easier to compare to each department).

CODE:
salary_per_dept = salaries.loc[(salaries[‘Graduating year’] == ‘All’) &
(salaries[‘Level of study’] == ‘All’)]
df_to_plot = salary_per_dept.loc[salaries[‘Department’] != ‘All departments’,
[‘Department’, “Average salary”]]
ax = df_to_plot.sort_values(‘Average salary’, ascending=False).set_index(‘Department’).\
plot.bar(legend=False, ylabel=’Average salary (£)’, \
title=’Average salary per department (2017/18 and 2018/19, UG and PG combined)’)
overall_avg_salary = salary_per_dept.\
loc[salary_per_dept[‘Department’] == ‘All departments’,’Average salary’].values[0]
plt.axhline(overall_avg_salary, color=’red’, lw=0.5)
plt.text(12, overall_avg_salary+1000,
f’School average salary: £{round(overall_avg_salary)}’)
plt.text(df_to_plot.shape[0]-1, 0,’x’, ha=’center’, va=’center’);

ax.figure

OBSERVATIONS:
“Quantitative” and/or “Business”-related subjects all have a higher-than-average average salary
Two of the highest average salary departments have only PG but no UG students
Law is also among the high average salary departments

LIMITATIONS:
Other factors like levels of study are not taken into account
Use of average salary
Representativeness of data. For example:
Not responded to the survey
Not work for a job with £
Missing data may cause bias.

19
Q

How does the initial exploration help and what should you look out for?

A
  • Find out that we should use the aggregate rows directly if we are considering all levels of study and all graduating years
    -> Rounding issues make it difficult to do the aggregation precisely ourselves
    -> May need to do “weighted” average by ourselves
    -> Less missing data for the aggregate rows
  • Know what is missing and what is not
    -> Missing data may not be as much as some students thought if they make use of the “all” rows
20
Q

What is wrong with this:
salaries.groupby(‘Department’)[[‘Average salary’]].mean().head()

A

For this dataset, while the values will not be calculated correctly, they will not be off too much.

21
Q

How do you calculate the median?

A

df_to_plot = salary_per_dept.loc[salaries[‘Department’] != ‘All departments’,
[‘Department’, “Median salary”]]
ax = df_to_plot.sort_values(‘Median salary’, ascending=False).set_index(‘Department’) \
.plot.bar(legend=False, ylabel=’Median salary (£)’, \
title=’Median salary per department (2017/18 and 2018/19, UG and PG combined)’)
overall_median_salary=salary_per_dept.\
loc[salary_per_dept[‘Department’] == ‘All departments’,’Median salary’].values[0]
plt.axhline(overall_median_salary, color=’red’, lw=0.5)
plt.text(12,overall_median_salary+1000,
f’School median salary: £{round(overall_median_salary)}’)
plt.text(df_to_plot.shape[0]-1, 0,’x’, ha=’center’, va=’center’);

ax.figure

22
Q

How may the median help?

A
  • A better measure than the mean/average

It may give us some idea about:
* The skewness or outliers of the data
* Whether we are likely to get similar results with the use of average and median salary

23
Q

What else could you consider with the level of study?

A
  • You could split UG and PG, as they have different levels of study and results/outcomes
  • Observations and remarks:
    Unlike the plot for all levels of study, Law has the highest median salary and Economic History ranks much higher.
    ‘x’ reminds us there is no UG Finance data
  • You can calculate the difference between the average and median salaries
24
Q

What can be improved:
salaries.groupby(‘Department’)[‘Median salary’].mean().plot.bar();

A
  • There is no order to the graph, they are scattered (it is difficult to read)
  • Difficult to interpret in relation to to the school average
25
What could be improved: df_to_plot = salaries.groupby('Department')['Median salary'].mean().sort_values() df_to_plot.plot(figsize=(15,3)) plt.xticks(np.arange(0, len(df_to_plot.index), 1.0), df_to_plot.index, rotation=90);
* Not appropriate graph (line graph) as data is not ordered
26
Suppose that you split between Quantitative and Qualitative departments. What is wrong with this: sns.boxplot(ug_salary_per_dept[ug_salary_per_dept['Department']!='All departments'].\ sort_values('Quantitative', ascending=False), x='Quantitative', y='Median salary'); plt.title('Median salary on Quant vs non-quant (2017/18 & 2018/19, UG only)') plt.xlabel('"Quantitative" department?'); plt.ylabel('Median salary (£)');
* How should we interpret the boxplot? What is "median of median"? Uses summary statistics to create the box plots, (range, median, IQR). It may not make sense in this context * Only 3 data points are used to create the "True" box... * The categorisation may not be accurate TO IMPROVE: * It is better to use scatter plot/raw data, as there are only 3 data points in quantiative * Maybe Quantitative vs Qualitative is a bad way to group the department * Perhaps use colour to show different segregation of departments (i.e. orange for quantitative and blue for qualitative subjects)
27
What is wrong with this: ug_salary_per_dept.groupby('Quantitative')['Median salary'].mean().plot.bar(rot=0);
* Data is not very informative * The plot does not provide much more than a table: