ETL Flashcards
- You have two ETL jobs that feed into a single production table each day. What problems might this cause?
Off the top of my head the first one would be whichever one is secondary could be overwriting the initial ETL job.
Also you can have write skew where you are updating old data. So if you have a query that inserts something into a table but that table is not the current version then you will have write skew.
- What’s the difference between ETL and ELT?
So ETL transforms data on a seperate processing server. While ELT process the data in the dara warehouse. ELT sends raw daya right into a datawarehouse.
ETL is slower.
Generally organization should be using ELT now which is more efficient and sclaable. Also you get to store the raw data which we need.
So the one thing I would say to this and maybe you guys run into this. Is data compliance. So sometimes you have to transform the data to protect customer information. If thats so you are going to have to do some kind of Transform before you load the data or you’ll break compliance.
What is an initial load in ETL? What about full load
Full load - All the data dumps when the source loads into the warehouse.
Initial load - Data is dumped between the source and target at regular intervals. Lastly, extract dates are stored; only records are added for the extract date load. This load can be either streaming (better for small volume) or batch (better for large volume).
With what ETL tools are you most familiar?
SQL, Python, Airflow, AWS Glue, I understand you guys use informatica
What are partitions? Why might you increase the number of partitions?
partitions are subdividers that improve data perfomance.
You can think of them like signs in Home depot
What are database snapshots? What’s their importance?
read only static views of a database at a certain point in time. I use snapshots to make sure we have something to fall back on if there is a data error.
What are views in ETL? What is used to build them?
Views are a SQL query that is a step in a transformation proess. You can store them in DB management ools.
What could be potential bottlenecks in the ETL process?
Very long inefficient query’s with lots of joins. Make sure you are loading incrementally and not fully updating. Large tables that are not partitioned. Something airflow is reat at is processing in parallel.
How would you triage an ETL failure?
First thing is can I replicate the ETL failure? So figure out if it is replicable or not replicable.
If it is replicable then I can narrow it down. If not replicable that might take some time. But I would check the logs of the tools in the ETL process and see where it is breaking to narrow that down.
Also keeping edge cases in mind that usually break software. Null values, zeros. End of the month. Special characters in a string. Capitlised and non capitalised strings. extra spaces in strings and lists. things like that can help.
Describe how to use an operational data store.
An operational data store, or ODS, is a database that provides interim storage for data before it’s sent to a warehouse
AN ODS typically integrates data from multiple sources and provides an area for efficient data processing activities like operational reporting
. Create an ETL query for an aggregate table called lifetime_plays that records each user’s song count by date.
okay so how often do you want this to update?
Great so I would do an insert into to add rows to this table and set it to update every second. I would say though that it might be pointless to do it by second because a song is like a couple minutes. So we are running alot of queries to not pull anything. So maybe i would do minimum like 2 minute time. Cut donw on queries maybe similair efificiency.
So i would be accessing like maybe a songs_played table for created_at, date, user_ud, song_id
Give some examples of uses for linked lists.
a linked list consists of nodes where each node has a head with the data and a refenrecre to the next node in the list. The end is null.
If you want to able to insert items in the middle of the list, memory is not a concern, and you arent iterating on it then linkedlists are prefferable.
If the list is absolutely massive use a linked list.
so maybe if you hit shuffle songs on spotify it creates a linked list becase it is essentially an endless list of shuffled songs.
How would you implement a queue using a stack?
What is a dequeue?
Dequeue is a queue operation to remove items from the front of a queue.
Equeue adds items to the back of a list.
What are the assumptions of linear regression?
that there is a relationship between the feautes and the response variable.