Analytics - AWS Services for Veracity Flashcards
Organisations need to ensure the integrity of their data at all phases of the data lifecylce. They must have accurate data as it enters their system by going through a data cleansing process.
Is what type of challenge
a) Volume
b) Velocity
c) Veracity
c) Veracity
_________________ streamlines collection and processing data for big data workloads.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
EMR
_____________ prepare and integrate all your data at any scale.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
AWS Glue
______________ clean and normalize data faster and more efficiently.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
Glue DataBrew
______________ share data across your organisation with built in governance.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
DataBrew
Serverless discovery and definition of table definitions and schema.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
Glue
Central metadata repo for your lake. It will discover schemas out of unsutrcutured data sitting in S3 etc and publish table definition for use with analysis tools like EMR and Athena, Redshift
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
Glue
Service that has custom etls jobs where it can discover the scehma for you. Which can have a trigger based on when data is recevied or on a schedule or on demand.
Amazon EMR
AWS Glue
AWS Glue DataBrew
Amazon DataZone
AWS Glue
AWS service which Uses apache saprk for distrubuted data processing. With ________ etl , you dont need to worrya bout managing the spark cluster.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
AWS Glue
_________ crawler scan data in s3 and creates schema.
However somtiem you need to give this hints. Peridocailly or on dmeand.
_____________ crawler populates the glue data catalog where it sores only table definiiton.
Once catalogued, you can treat your unstructured data like its structured. WHere it allows things like Redshift, athena or systems running an EMR like Hive to query your unstructured data in S3.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
Amazon Glue
________________ crawler will extract partitions of your data based on how your S3 data is organised.
Amazon EMR AWS Glue AWS Glue DataBrew Amazon DataZone
AWS glue
You want to think how you are going to query your data lake in S3.
i.e time ranges - organise buckets for year, month, device etc.