Processing Flashcards

Question

AWS Lake Formation

Answer 1

- "Makes it easy to set up a secure data lake in days" - Loading data and monitoring data flows - Setting up partitions - Encryption and managing keys - Defining transformation jobs and monitoring them - Built on top of Glue - Auditing

Answer 2

- No cost for Lake Formation itself - But underlying services incur charges - Glue - S3 - EMR - Athena - Redshift

Answer 3

- Cross-account Lake Formation permission - lake Formation does not support manifests in Athena or Redshit queries - IAM needed to create blueprints and workflows

Answer 4

- Now supports "Governed Tables" that support ACID transactions across multiple tables - Storage Optimization with Automatic Compaction - Granular Access Control with row and cell level security

Answer 5

- Managed Hadoop framework on EC2 instances - Includes Spark, HBase, Presto, Flink, Hive and etc - EMR Notebooks - Several Integration points with AWS

Answer 6

- Master Node : Manages the cluster - Tracks status of tasks, monitor cluster health - Single EC2 instance - Core Node : Host HDFS data and run tasks - Can be scaled up and down - Multi-node clusters have at least one - Task Node : Run tasks and do not host data - Optional and no risk of data loss when removing - Good use of spot instances

Answer 7

- Transient vs Long-Running Clusters - Transient clusters terminate once all steps are completed - Long running clusters must be manually terminated - basically a datawarhouse with periodic processing on large datasets - Can spin up task nodes using Spot instances for temporary capacity - Can use reserved instance on long running clusters to save money - Termination protection enabled by default - Frameworks and applications are specified at cluster launch - Connect directly to master to run jobs directly - Submit ordered steps via the console

Answer 8

- EC2 for the instances that comprise the nodes in the cluster - Amazon VPC to configure the virutal network - S3 to as input and output data - CloudWatch to monitor cluster performance and configure alarms - IAM for permissions - CloudTrail to audit requests made to the service - AWS Data Pipeline to schedule and start your cluster

Answer 9

- HDFS - Multiple copies stored across cluster instances for redundancy - Files stored as blocks (128MB default size) - Ephemeral : HDFS data is lost when cluster is terminated - Useful for caching intermediate results or workloads - EMRFS : access S3 as if it were HDFS - Allows persistent storage after cluster termination - EMRFS Consistent View - Uses DynamoDB to track consistency - May need to give read write capacity to DynamoDB - Local File System - Suitable only for temporary data (buffers, caches, etc) - EBS for HDFS - Allows use of EMR on EBS-only types (M4, C4) - Deleted when cluster is terminated - EBS volumes can only attached when launching a cluster - If you manually detach an EBS volume, EMR treats that as a failure and replaces it

Answer 10

- Charge by hour + EC2 charges - Provision new nodes if a core node fails - Can add and remove task nodes on the fly - Can resize a running cluster's core nodes - Core nodes can also be added or removed

Answer 11

- EMR Automatic Scaling - Custom scaling rules based on CloudWatches metrics - Supports instance groups only - EMR Managed Scaling - Support instance groups and instance fleets - Scales spot, on-demand and instances in a Savings Plan within the same cluster - Available for Spark, Hive, YARN workloads - Scale-up Strategy - First adds core nodes, then task nodes, up to max unit specified - Scale-down Strategy - First removes task nodes, then core nodes, no further than minimum constrainst - Spot nodes always removed before on-demand instances

Answer 12

- Spark adds 10% overhead to memory requested for drivers and executors - Be sure initial capacity is at least 10% more than requested by the job

Answer 13

- EMRFS - S3 encryption at rest - TLS EIT - Local disk encryption - Spark communication between drivers and executors is encrypted

Answer 14

- MapReduce - Framework for distributed data processing - Maps data to key / value pairs - REduces intermediate results to final output - Yet Another Resource Negotiator (YARN) - Manages cluster resouces for multiple data processing frameworks - HDFS - Distributes data block across cluster in a redundant manner - Ephemeral in EMR - Data lost on termination

Answer 15

- Distributed processing framework for big data - In-memory caching, optimized query execution - Supports Java, R, Python, etc - Supports code reuse across - Batch Processing - Interactive Queries - Real-time Analytics - Machine Learning - Graph Processing - Spark Streaming - Integrated with Kinesis, Kafka, on EMR - Spark is NOT meant for OLTP

Answer 16

- The SparkContext (driver program) coordinates Spark applications - SparkContext works through a Cluster Manager - Executors run computations and store data - SparkContext sends application code and tasks to executors

Answer 17

- The SparkContext (driver program) coordinates Spark applications - SparkContext works through a Cluster Manager - Executors run computations and store data - SparkContext sends application code and tasks to executors

Answer 18

- Spark Streaming - Real-time streaming analytics and structured streaming - SparkSQL - Up to 100x faster than MapReduce JDBC, ODBC, JSON, HDFS, ORC, Parquet, HiveQL - MLLib - Classification, Regression, Clustering, Collaborative filtering, pattern mining (Read from HDFS, HBase) - GraphX - Graph Processing, ETL, analysis, Iteractive graph computation - No longer widely used - Spark Core - memory management, fault recovery, scheduling, distribute and monitor jobs, interact with storage - Scala, Python, Java, R

Answer 19

- Data stream as an unbounded input Table

Answer 20

- Allows users to run sql-like syntax query in EMR - Scalable - Easy OLAP queries - Highly optimized - Highly extensible

Answer 21

- Hive maintains a "metastore" that imparts a structure you define on the unstructured data that is stored on HDFS

Answer 22

- Metastore is stored in MySQL on the master node by default - External metastores offer better resiliency and integration - AWS Glue Data Catalog - Shares schema across EMR and other AWS services - Tie Glue to EMR using the console, CLI, or API - Amazon RDS / Aurora - Need to override default Hive configuration values for external database location

Answer 23

- Pig introduces a scripting language that lets you use SQL-like syntax to define your map and reduce steps - Highly extensible with user-defined functions (UDFs)

Answer 24

- Use multiple file systems - HDFS, S3, etc - Load JAR's and scripts from S3

Answer 25

- Non-relational, petabyte-scale database - Based on Goolge's BigTable and on top of HDFS - In-memory - Hive Integration

Answer 26

- Can store data (StoreFiles and metadata) on S3 via EMRFS - Can back up to S3

Answer 27

- It can connect to many different "big data" databases and data stores at once and query across them - Interactive queries at petabyte scale - Familiar SQL syntax - Optimized for OLAP - Developed and partially maintained by Facebook - This is what Amazon Athena uses under the hood - Exposes JDBC, Command-Line and Tableau interfaces

Answer 28

- iPython Notebooks environments - Can share notebooks with others on your cluster - Spark, Python, JDBC, HBase, ElasticSearch

Answer 29

- Run Spark code interactively - speed up your development cycle - allows easy experimentation and exploration of the data - Can execute SQL queries directly against SparkSQL - Query results may be visualized in charts and graphs - Makes Spark fee more like a data science tool

Answer 30

- DynamoDB - Fully managed - More Integration with other AWS services - Glue Integration - HBase - Efficient storage of sparse data - Appropriate for high frequency counters (consistent reads and writes) - High write and update throughput - More integration with Hadoop

Answer 31

- Similar to Zeppelin with more AWS integrations - Notebooks backed up to S3 - Provision clusters from the notebook - Hosted inside a VPC - Accessed only via AWS console

Answer 32

- Hadoop User Experience - Graphical front-end for applications on your EMR cluster - IAM Integration : Hue Supper-users inherit IAM roles

Answer 33

- Splunk makes machine learning data accessible, usable and valuable to everyone - Operational tool : can be used to visualize EMR and S3 data using your EMR Hadoop cluster - Reserved instnaces on 64bit OS recommended

Answer 34

- Another way to stream data into your cluster - Made from the start with Hadoop in mind - Originally made to handle log aggregation

Answer 35

- Like tensorflow, a library for building and accelerating neural networks - Included on EMR

Answer 36

- Tool for copying large amounts of data - from S3 into HDFS or from HDFS to S3 - Uses MapReduce to copy in a distributed manner - Suitable for parallel copying of large numbers of objects - across buckets, across accounts

Answer 37

- Tool for copying large amounts of data - from S3 into HDFS or from HDFS to S3 - Uses MapReduce to copy in a distributed manner - Suitable for parallel copying of large numbers of objects - across buckets, across accounts

Answer 38

- IAM - Kerberos - Secure user authN - SSH - Secure connection to command line - Tunneling for web interfaces - Block Public Access - Easy way to prevent public access to data stored on your EMR cluster - Can set at the account level before creating the cluster

Answer 39

- Kibana is an open source data visulazation and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases.

Processing Flashcards

(63 cards)