Cloud Tools Flashcards
Airflow
DAG files to orchestrate data pipelines
AWS Glue
As far as how I use glue.
1) we store all the JDBC ticket information to connect to our RDS instances.
2) the crawlers take schema and metadata information of our tables in our two sources and store them inside AWS glue catalog
3) the glue jobs will write the data into our bronze S3 bucket and also transform some data into the refined bucket . The raw data is written in apache avro fromat and the refined is written in apache parquet. - we use apache parquet because it is columnar format data so it is much more efficient to run analytics on.
AWS Athena
So we use Athena to both transform data into the gold s3 bucket. We can add partitions for alot of queries that our data scientist will run and also partitions for tableau that connects to this bucket.
Our data scientist can also directly link to AWS Athena to run queries straight off it.
Apache Kafka
Open source streaming platform for real time data.
the source streams data and the sinks listen to it.
Built on 4 APIS:
producer API’s: produces data
consumer API: listens and ingests data
Stream API: analyze the data
Connector API: reusability
My story:
In meeting for the last two weeks because we are creating a real time pricing application for short term rentals.
Amazon Aurora
So Aurroa is the database engine that we use. Its very fast and scales automatically. I did not set it up or manage it though.
What is Booksbnb tech stack setup?
Yes so right now and things are always changing. Me and one other data engineer manage our entire infrastructure.
We started off on a small mySQL database with only a couple 1000 lines of data. Now we ingest over 100,000 rows a day and so scaling that has been really interesting.
We moved from MySQL to PostgreSQL on AWS once we got to around 20,000 rows a day. Right now how we are setup. We push all of our commits and source control through Git. We have an S3 bucket we keep all of our SQL and DAG files then we have AWS MWAA where we orchestrate all of our DAG’s and pipelines.
For our Amazon RDS instances. we use PostgresSQL for both our platform and for our CRM. So two data sources but soon to be three. Those feed directly into Glue Crawlers and they feed into a glue catalog.
Then we have AWS glue spark jobs using Apache spark which brings the data into our data lake. We use the data-bricks model so we have 3 S3 buckets. One is bronze for all raw data. One is silver for refined data. And then the last one is Gold for our aggregated partitioned data.
Our data scientist access the gold partitioned data through AWS Athena.
Right now we are working on a new product that will be streaming data using Apache Kafka but i’ve barely worked with it
Amazon MWAA
very easy to setup an airflow environment in AWS. You can just use the same UI as airflow.
Rest API
REST stands for:
representational state transfer.
For communicating between client (ice cream shop) and server. Some of the benefits about them are:
1) simple and standardized
2) scale/ stateless
3) high performance that supports cacheing
It uses CRUD operations:
Create [Post]
Read[Get]
Update[Put]
Delete[Delete]