Big Data and Analytics Flashcards

1
Q

Q: What is Big Data?

A

A: Big Data refers to datasets that are large, complex, and require specialized tools and technologies to process, analyze, and manage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Q: What are key AWS services for Big Data and Analytics?

A
  • Amazon S3
  • Amazon Redshift
  • Amazon EMR
  • AWS Glue
  • Amazon Athena
  • Amazon Kinesis
  • Amazon QuickSight
  • AWS Lake Formation
  • Amazon OpenSearch Service
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Q: How does Amazon S3 support Big Data?

A

A: S3 provides scalable, durable, and low-cost object storage for storing massive datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q: What is AWS Lake Formation?

A

A: A service to set up, secure, and manage data lakes on S3, enabling fast data ingestion, cataloging, and governance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Q: What is Amazon Redshift?

A

A: A fully managed, petabyte-scale data warehouse that enables fast analytics using SQL queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Q: What is Redshift Spectrum?

A

A: A feature that allows querying data directly from S3 without loading it into Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Q: What is Amazon Athena?

A

A: An interactive query service that allows you to run SQL queries on data stored in S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Q: What is Amazon EMR?

A

A: A managed service for processing big data using frameworks like Apache Hadoop, Spark, and Hive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Q: What is Apache Spark?

A

A: A distributed data processing framework for fast analytics and machine learning on large datasets, available on EMR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Q: What is AWS Glue?

A

A: A fully managed ETL (Extract, Transform, Load) service for preparing and transforming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Q: What is the AWS Glue Data Catalog?

A

A: A centralized metadata repository for managing data schemas and organizing data stored in S3 and other sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Q: What is Amazon Kinesis?

A

A: A service for real-time data streaming and analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Q: What are Kinesis Data Streams?

A

A: A service for ingesting real-time streaming data and processing it with AWS services or custom applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Q: What is Kinesis Data Firehose?

A

A: A service for loading streaming data into destinations like S3, Redshift, or Elasticsearch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Q: What is Kinesis Data Analytics?

A

A: A service for analyzing streaming data in real time using SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Q: What is Amazon OpenSearch Service?

A

A: A managed service for real-time search, log analytics, and visualization of large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Q: What is Amazon QuickSight?

A

A: A cloud-powered business intelligence (BI) tool for creating and sharing interactive dashboards and visualizations.

18
Q

Q: What is AWS Data Pipeline?

A

A: A service for automating data movement and transformation across AWS services and on-premises systems.

19
Q

Q: What is a data lake?

A

A: A centralized repository for storing structured, semi-structured, and unstructured data at any scale.

20
Q

Q: How is a data warehouse different from a data lake?

A

A: A data warehouse stores structured data optimized for analytics, while a data lake stores all types of data in raw format.

21
Q

Q: How is Amazon CloudWatch used in analytics workflows?

A

A: It monitors and provides logs, metrics, and alarms for data processing pipelines.

22
Q

Q: What is ETL?

A

A: Extract, Transform, Load (ETL) is the process of extracting data from sources, transforming it into a suitable format, and loading it into a data store.

23
Q

Q: How does AWS Step Functions support Big Data?

A

A: By orchestrating data processing workflows across multiple AWS services.

24
Q

Q: What is data partitioning in Big Data?

A

A: Dividing large datasets into smaller, manageable chunks to improve query and processing performance.

25
Q

Q: What is schema evolution?

A

A: The ability to accommodate changes in data structure (e.g., adding or removing fields) without disrupting processing.

26
Q

Q: What is Amazon S3 Select?

A

A: A feature that retrieves specific data subsets from S3 objects using SQL queries, reducing data processing overhead.

27
Q

Q: How is Amazon Rekognition used in Big Data?

A

A: For analyzing large datasets of images and videos, such as facial recognition or object detection.

28
Q

Q: What tools support data governance in AWS?

A

A: AWS Lake Formation, AWS Glue, and AWS Identity and Access Management (IAM).

29
Q

Q: How is data encrypted in AWS Big Data workflows?

A

A: Using AWS KMS for encryption at rest (e.g., S3, Redshift) and in transit (e.g., TLS).

30
Q

Q: What is Amazon EventBridge?

A

A: A service for building event-driven architectures to process and analyze data in real time.

31
Q

Q: What Hadoop ecosystem tools are supported on Amazon EMR?

A

A: Hadoop, Hive, HBase, Spark, Flink, and Presto.

32
Q

Q: What are common use cases for real-time analytics on AWS?

A

A: Fraud detection, log analytics, streaming video analysis, and predictive maintenance.

33
Q

Q: How does AWS integrate ML with Big Data?

A

A: Using SageMaker with data stored in S3, Redshift, or DynamoDB, and processed using EMR or Glue.

34
Q

Q: Which AWS services offer query acceleration?

A

A: Amazon Athena, Redshift Spectrum, and S3 Select.

35
Q

Q: How does Lake Formation enforce governance?

A

A: By providing fine-grained access control and cataloging data for secure sharing.

36
Q

Q: How can you optimize Big Data costs on AWS?

A

A: Use reserved or spot instances, lifecycle policies for S3, compress data, and query subsets with tools like Athena or Redshift Spectrum.

37
Q

Q: How does AWS provide elasticity for Big Data?

A

A: By scaling resources automatically with services like EMR, Redshift, and Kinesis.

38
Q

Q: Which AWS services support serverless Big Data processing?

A

A: Amazon Athena, AWS Glue, Amazon Kinesis Data Firehose, and AWS Lambda.

39
Q

Q: What AWS tools are used for data transformation?

A

A: AWS Glue, Amazon EMR with Spark, and Amazon Redshift with SQL.

40
Q

Q: What are best practices for securing Big Data workflows in AWS?

A

A: Encrypt data at rest and in transit, enable IAM permissions, use VPC endpoints for data transfer, and enable logging with CloudTrail.