Big Data and Analytics Flashcards

Question 1

Q

Q: What is Big Data?

Answer

A

A: Big Data refers to datasets that are large, complex, and require specialized tools and technologies to process, analyze, and manage.

Question 2

Q

Q: What are key AWS services for Big Data and Analytics?

Answer

A

Amazon S3
Amazon Redshift
Amazon EMR
AWS Glue
Amazon Athena
Amazon Kinesis
Amazon QuickSight
AWS Lake Formation
Amazon OpenSearch Service

Question 3

Q

Q: How does Amazon S3 support Big Data?

Answer

A

A: S3 provides scalable, durable, and low-cost object storage for storing massive datasets.

Question 4

Q

Q: What is AWS Lake Formation?

Answer

A

A: A service to set up, secure, and manage data lakes on S3, enabling fast data ingestion, cataloging, and governance.

Question 5

Q

Q: What is Amazon Redshift?

Answer

A

A: A fully managed, petabyte-scale data warehouse that enables fast analytics using SQL queries.

Question 6

Q

Q: What is Redshift Spectrum?

Answer

A

A: A feature that allows querying data directly from S3 without loading it into Redshift.

Question 7

Q

Q: What is Amazon Athena?

Answer

A

A: An interactive query service that allows you to run SQL queries on data stored in S3.

Question 8

Q

Q: What is Amazon EMR?

Answer

A

A: A managed service for processing big data using frameworks like Apache Hadoop, Spark, and Hive.

Question 9

Q

Q: What is Apache Spark?

Answer

A

A: A distributed data processing framework for fast analytics and machine learning on large datasets, available on EMR.

Question 10

Q

Q: What is AWS Glue?

Answer

A

A: A fully managed ETL (Extract, Transform, Load) service for preparing and transforming data.

Question 11

Q

Q: What is the AWS Glue Data Catalog?

Answer

A

A: A centralized metadata repository for managing data schemas and organizing data stored in S3 and other sources.

Question 12

Q

Q: What is Amazon Kinesis?

Answer

A

A: A service for real-time data streaming and analytics.

Question 13

Q

Q: What are Kinesis Data Streams?

Answer

A

A: A service for ingesting real-time streaming data and processing it with AWS services or custom applications.

Question 14

Q

Q: What is Kinesis Data Firehose?

Answer

A

A: A service for loading streaming data into destinations like S3, Redshift, or Elasticsearch.

Question 15

Q

Q: What is Kinesis Data Analytics?

Answer

A

A: A service for analyzing streaming data in real time using SQL.

Question 16

Q

Q: What is Amazon OpenSearch Service?

Answer

A

A: A managed service for real-time search, log analytics, and visualization of large datasets.

Question 17

Q

Q: What is Amazon QuickSight?

Answer

A

A: A cloud-powered business intelligence (BI) tool for creating and sharing interactive dashboards and visualizations.

Question 18

Q

Q: What is AWS Data Pipeline?

Answer

A

A: A service for automating data movement and transformation across AWS services and on-premises systems.

Question 19

Q

Q: What is a data lake?

Answer

A

A: A centralized repository for storing structured, semi-structured, and unstructured data at any scale.

Question 20

Q

Q: How is a data warehouse different from a data lake?

Answer

A

A: A data warehouse stores structured data optimized for analytics, while a data lake stores all types of data in raw format.

Question 21

Q

Q: How is Amazon CloudWatch used in analytics workflows?

Answer

A

A: It monitors and provides logs, metrics, and alarms for data processing pipelines.

Question 22

Q

Q: What is ETL?

Answer

A

A: Extract, Transform, Load (ETL) is the process of extracting data from sources, transforming it into a suitable format, and loading it into a data store.

Question 23

Q

Q: How does AWS Step Functions support Big Data?

Answer

A

A: By orchestrating data processing workflows across multiple AWS services.

Question 24

Q

Q: What is data partitioning in Big Data?

Answer

A

A: Dividing large datasets into smaller, manageable chunks to improve query and processing performance.

Question 25

Q

Q: What is schema evolution?

Answer

A

A: The ability to accommodate changes in data structure (e.g., adding or removing fields) without disrupting processing.

Question 26

Q

Q: What is Amazon S3 Select?

Answer

A

A: A feature that retrieves specific data subsets from S3 objects using SQL queries, reducing data processing overhead.

Question 27

Q

Q: How is Amazon Rekognition used in Big Data?

Answer

A

A: For analyzing large datasets of images and videos, such as facial recognition or object detection.

Question 28

Q

Q: What tools support data governance in AWS?

Answer

A

A: AWS Lake Formation, AWS Glue, and AWS Identity and Access Management (IAM).

Question 29

Q

Q: How is data encrypted in AWS Big Data workflows?

Answer

A

A: Using AWS KMS for encryption at rest (e.g., S3, Redshift) and in transit (e.g., TLS).

Question 30

Q

Q: What is Amazon EventBridge?

Answer

A

A: A service for building event-driven architectures to process and analyze data in real time.

Question 31

Q

Q: What Hadoop ecosystem tools are supported on Amazon EMR?

Answer

A

A: Hadoop, Hive, HBase, Spark, Flink, and Presto.

Question 32

Q

Q: What are common use cases for real-time analytics on AWS?

Answer

A

A: Fraud detection, log analytics, streaming video analysis, and predictive maintenance.

Question 33

Q

Q: How does AWS integrate ML with Big Data?

Answer

A

A: Using SageMaker with data stored in S3, Redshift, or DynamoDB, and processed using EMR or Glue.

Question 34

Q

Q: Which AWS services offer query acceleration?

Answer

A

A: Amazon Athena, Redshift Spectrum, and S3 Select.

Question 35

Q

Q: How does Lake Formation enforce governance?

Answer

A

A: By providing fine-grained access control and cataloging data for secure sharing.

Question 36

Q

Q: How can you optimize Big Data costs on AWS?

Answer

A

A: Use reserved or spot instances, lifecycle policies for S3, compress data, and query subsets with tools like Athena or Redshift Spectrum.

Question 37

Q

Q: How does AWS provide elasticity for Big Data?

Answer

A

A: By scaling resources automatically with services like EMR, Redshift, and Kinesis.

Question 38

Q

Q: Which AWS services support serverless Big Data processing?

Answer

A

A: Amazon Athena, AWS Glue, Amazon Kinesis Data Firehose, and AWS Lambda.

Question 39

Q

Q: What AWS tools are used for data transformation?

Answer

A

A: AWS Glue, Amazon EMR with Spark, and Amazon Redshift with SQL.

Question 40

Q

Q: What are best practices for securing Big Data workflows in AWS?

Answer

A

A: Encrypt data at rest and in transit, enable IAM permissions, use VPC endpoints for data transfer, and enable logging with CloudTrail.

Brainscape's Knowledge GenomeTM

Big Data and Analytics Flashcards

Brainscape's Knowledge Genome^TM