Basic Technical Flashcards
What ETL tools do you have experience with? What tools do you prefer?
What are some challenges unique to cloud computing?
Security - When we are connecting to an API it could be a hacked API. In the last couple of years facebook, twitter, venmo and the USPS have been recipients of API attacks. So this can be through code injection DDOS attack, Stolen Authentication from the source so i see that as being a huge problem for us since we may be connecting to companies with not great security.
Cost - so this is why its important to be doing Power BI reporting. Are we making sure that we are efficient with our spend and processes?
What’s the difference between structured and unstructured data?
Structured data is highly specific and is stored in a predefined format. Structured data takes advantage of schema-on-write
Unstructured data is a conglomeration of many varied types of data that are stored in their native formats. Unstructured data employs schema-on-read.
Semi-structured data refers to what would normally be considered unstructured data, but that also has metadata that identifies certain characteristics
What are the key features of Hadoop?
Hadoop is one of the first if not the first Big Data open frameworks. It uses MapReduce which can be written in Python or Java to process data in what it calls its HDFS, Hadoop data file system. Another part of its is YARN which stands for yet another resource negotiator and is used to split up computing resources and organize them.
Hadoop also needs to be installed and runs locally with an organizations computing power and resources.
- How would you design a data pipeline?
A broad, beginner case study question like this wants to know how you approach a problem. With all case study questions, you should ask clarifying questions like:
What type of data is processed?
How will the information be used?
What are the requirements for the project?
How much will data be pulled? How frequently?
These questions will provide insights into the type of response the interviewer seeks. Then, you can describe your design process, starting with choosing data sources and data ingestion strategies, before moving into your developing data processing and implementation plans.
How would you design a data warehouse given X criteria?
This example is a fundamental case study question in data engineering, and it requires you to provide a high-level design for a database based on criteria. To answer questions like this:
Start with clarifying questions and state your assumptions
Provide a hypothesis or high-level overview of your design
Then describe how your design would work
BooksBnB Data warehouse design
https://miro.medium.com/max/720/1*MIbLBofXvyEkasVQ-_5jxQ.png
Hardest SQL Question you have recently worked on?
Ya so I had a hard one but a really interesting one recently.
I had to find of the top 10% of net profit properties in Texas from enterprise clients over the last 3 years. how has their revenue changed per year.
So to give some context and I think y’all will find this interesting as people that like data. Basically, some of the clients we have are concerned that their revenue spiked during COVID for drive to airbnb markets. So basically in Texas there are a-lot of markets that people don’t necessarily fly to but you drive to from other places in Texas. So they got a revenue spike they bought a bunch of properties and they want to know if they want to continue that strategy or not.
The hard part if this is the many steps it takes to get the properties plus the month over month difference.
1) take the top 10% of properties with net profit in Texas from enterprise clients
2)
Take me through the full ETL of one of your pipelines?
Sure, so let’s say that someone uploads a CSV of financial data to our site. 2 things are going to happen in parallel. One we are going to extract the relevant data in postgresSQL, then transform it and reshow some analytics on the front end for them. This process we use a ETL.
The other process that is happening is that file is going to be shot in its raw format into our data lake. The way we have our data lake setup i believe is called the medallion format. There are 3 s3 buckets. One is bronze for raw data, the other is silver for converted data and the last is gold for refined data.
we use AWS glue to setup the tables and schema that we want. All of the metadata is stored in AWS glue data catalog. Then we use glue jobs to finally place all of the data into the raw bucket.
Once the data is in the raw bucket we refine it using AWS Athena which can be used to create more tables and also to create views for our data scientists.
What is something you do day to day?
Yesterday I had meetings with Lodgify on the best way to read their api and vice versa. This is to setup real time pricing for them using apache kafka which I wish I could answer questions on but its completely knew to me im just wrapping my head around it.
Then our data scientist needed the data in the silver bucket to have occupancy per property ranked by zip code from before 2019 to after 2022. That is a difficult query because you have to access properties, occupancy and separate by dates.
Finally there is a bug that is killing us in CSV uploads. Airbnb changed how they write refunds in the CSV and also refunds can come in like a month after a reservation. So making sure that comes in and updates correctly on both platform and refined bucket is very difficult.
Tell me about some of the tables in your database and why they are setup that way?
So there are four places in our ecosystem where we have relational tables. One is table under our CRM, the other under the platform, then our silver and gold s3 buckets.
the silver bucket is setup for data scientists to access information. I am not a machine learning expert or novice really but they tell me what they want and I get them that data. There is a per property table and a per zip code table. Then there is also the just business side of things where we have a customers table and a revenue table. Any property managers with over 100 properties our considered our enterprise partners. For enterprise partners they are partitioned by client this is because the data sceintists work right now as a service for them.
What is an example of a data transformation you do?
Sure so in our postgresSQL which is right below essentially our user platform.
We will take the data in and store it into 5 tables. Users, Properties, Reservations, Revenue, and Expenses. All the tables are partitioned by user_id to speed up query’s since this DB is only for the platform and at that level all analytics are done per user.
Something I might work on in is making sure that data is extracted correctly to the 5 tables and that when a CSV comes in with a row we don’t recognize that it is logged so I can later debug it. I also work with the software engineers to make sure they have the base data they need to write analytical code based on userinputs.
At this stage of our pipeline we only store new data so a huge thing I work on is making sure we deduplicate data that is already existing. This database is strictly for the user not to store all of the raw data we have.
At the other end of the pipeline when data is coming to our data scientists I work in Athena to transform revenue by reservation to revenue by day. The reason this is important is because when you
What is an example of a data transformation you do?
Sure so in our postgresSQL which is right below essentially our user platform.
We will take the data in and store it into 5 tables. Users, Properties, Reservations, Revenue, and Expenses. All the tables are partitioned by user_id to speed up query’s since this DB is only for the platform and at that level all analytics are done per user.
Something I might work on in is making sure that data is extracted correctly to the 5 tables and that when a CSV comes in with a row we don’t recognize that it is logged so I can later debug it. I also work with the software engineers to make sure they have the base data they need to write analytical code based on userinputs.