Python questions Flashcards
Write SQL query to find the top 5 products with the highest sales
SELECT productID, SUM(price*quantity) AS totalSales
FROM DATA
GROUP BY productId
ORDER BY(totalSales)
LIMIT 5;
Difference between normalization and denormalization
Normalization: Process of organizing data in a database to reduce redundancy and improve data integrity
Denormalization: Process of combining normalized tables in order to improve read performance thus reducing the number of joins required when querying.
How would you deal with missing data?
If there are only a few rows with missing data then I would probably use the drop function in pandas. However, if there are a substantial amount of missing data one might consider dropping an entire column.
List 6 built in python data types
Int, float, string, tuple, dict, bool
Which python data types are mutable
Lists, sets and dictionaries
How would you approach dataset with outliers
I would analyze the data to determine whether or not these outliers are due to an error in data entry or if they are true variations in the data. For example if I was given the incomes of households in a neighborhood and there was a household whose annual income was significantly less than that of mean of the incomes then I would have to determine whether this is due to a data entry error or if this is due to the household in question being a single income household in a neighborhood where majority of the houses are dual income.
Describe a time when you used data visualization to communicate a complex idea. What tools did you use?
I used Tableau and Seaborn to highlight aspects such as which GDP per capita by emissions per capita by country to determine which countries do a good job of being both relatively developed and ecofriendly and which ones don’t. Furthermore I investigated whether or not countries with high emissions face the brunt of the consequences of climate change (Using metrics such as increase in temperature, and frequency of natural disasters)
Bar graph vs Histogram
Bar graph: Visualizes categorical data by frequency
Histogram: Visualizes continuous data
Correlation vs Causation
Correlation: A numerical value between -1 & 1 which indicates the strength of relation between two variables
Causation: A concept in which one variable is directly responsible for the other occurring
Linear regression & Assumptions
Statistical method used to model relationship between dependent variable and one or more independent variables.
The assumptions:
Relationship between dep & inde is linear
Observations are independent
Residuals should have constant variance at each level
How do you prioritize your tasks when working on multiple projects with tight deadlines?
I use the priority square which splits tasks by urgency & importance then prioritize tasks which were both urgent & important
Give an example of how you explained a technical concept to a non-technical stakeholder
Simplicity
Relevance
Visual aids
Engagement
Why do you want to work at RBC?
As the biggest bank in Canada, I’d have the pleasure to work with some of the most brilliant minds in the country particularly in the field of data science which will open me up to an unprecedented level of growth
Where do you see yourself in five years in the field of data analysis?
With the latest version of chatgpt being released to the public it is abundantly clear that just being a data analyst is not enough. Thus, playing a pivotal role in perhaps launching an AI banking assistant for RBC