Data Transformation, Integrity, and Feature Engineering Flashcards

1
Q

What is AWS EMR?

A

Elastic Map Reduce. This is a managed Hadoop Framework.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Does AWS EMR have Notebooks?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can AWS EMR support Spark?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main node called in an EMR cluster?

A

The master node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are core nodes used for in AWS EMR?

A

Stores HDFS data and runs tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Task Node in an EMR cluster?

A

Runs tasks, but does not host data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a good way to reduce costs for task nodes?

A

Use spot instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a transient cluster in AWS EMR?

A

It automatically terminates after all the steps have been completed,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you start jobs in EMR?

A

Through connecting to the master node, or using the console and adding ordered steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the alternative storage to HDFS in AWS EMR?

A

S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is EMRFS?

A

It acts like HDFS, but is S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the default size of a block in HDFS?

A

128MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is HDFS ephemeral?

A

Yes. Good for performance though.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you track consistency in EMRFS?

A

Dynamo DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can Spark replace MapReduce in AWS EMR?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Spark SQL in AWS EMR?

A

Low latency query engine. Up to 100x faster than map reduce. Allows for dataframes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is GraphX for AWS EMR?

A

A graph processing framework built on top of Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is MLLib for AWS EMR?

A

It allows you to integrate machine learning on top of spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can Spark be integrated with AWS Kinesis?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Zepplin?

A

A notebook compatible with AWS EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the problem with too many features in your data?

A

It leads to sparse data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a dimension?

A

Every feature is a new dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a TF-IDF algorithem?

A

It figures out what terms are most relevant for a document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the two components of the TF-IDF equation?

A

Term Frequency divided by Document Frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is mean replacement?

A

When a column is missing data, you add the average value in of that column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

When is it better to use median over mean replacement?

A

When there are outliers present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Is KNN better than mean replacement?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Does KNN work for categorical data

A

No. Only numerical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What imputation technique is best for categorical data?

A

Deep Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the best way to deal with missing data?

A

Get more data!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the best way to deal with unbalanced data?

A

Duplicate samples from the minority class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is SMOTE?

A

It generates new samples of the minority class by using nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Is SMOTE better than duplicating the minority class?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What method is good to deal with outliers?

A

Removing the outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What can be used to detect outliers?

A

Random Cut Forest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is binning?

A

You take your numerical data and convert it to categorical data / ranges of values. For example put all 20 something people into one classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is quantile binning?

A

It categorizes data based on its place in the distribution. It ensures even bin sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the one hot encoding transformation?

A

Creates buckets for every category. Common in deep learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Can shuffling help with training data?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is Sagemaker?

A

An AWS services geared for machine learning.

41
Q

Where does most data in SageMaker come from?

A

S3

42
Q

What are popular data formats for SageMaker data?

A

RecordIO or Protobuff

43
Q

Does Sagemaker integrate with Apache Spark?

A

Yes

44
Q

What Services does SageMaker integrate with?

A

Athena, EMR, RedShift, Keyspaces DB

45
Q

What is the three step process for SageMaker processing?

A

Copy data from S3

Spin up processing container

Output processing data to S3

46
Q

What are the two deployment models for SageMaker trained models?

A

Persistent - Always available for individual predictions on demand

Batch Transform - To get predictions on a dataset

47
Q

What is Sagemaker Neo?

A

It allows you to deploy your ML model at the edge.

48
Q

What is elastic inference for SageMaker?

A

It accelerates deep learning models

49
Q

Can endpoints in Sagemaker be scaled?

A

Yes

50
Q

What is Shadow Testing?

A

It allows you to evaluate new models against currently deployed models to catch errors.

51
Q

What is SageMaker Ground Truth?

A

Allows you to have humans label your data. Common for image classification.

52
Q

How does it SageMaker Ground Truth work?

A

Ground Truth creates a model and as images are labeled by humans, only images it is unsure of get sent for human labeling.

53
Q

What are two AWS services that can also help with image recognition and labels?

A

Rekognition
Comprehend

54
Q

What is ground truth plus?

A

It is a turnkey solution. Almost completely hands off.

55
Q

What is SageMaker Data Wrangler?

A

And ETL pipeline built into SageMaker. Very similar to Glue Studio. Focused on Machine learning.

56
Q

Can you visualize data in SageMaker Data Wrangler

A

Yes

57
Q

What does the QuickModel feature within SageMaker Data Wrangler do?

A

It allows you to train your model with your data and measure results.

58
Q

What is a common troubleshooting check for Data Wrangler problems?

A

Ensure the user has the correct IAM roles

Ensure the data allows Data Wrangler access. Needs AmazonSageMakerFullAccess.

EC2 limit service quota

59
Q

Do you need a domain to use SageMaker studio?

A

Yes

60
Q

Does Data Wrangler evaluate and provide an analysis on your data?

A

Yes.

61
Q

What can be used to get cloudwatch alerts on quality deviations on your deployed models?

A

SageMaker Model Monitor

62
Q

Can SageMaker Model Monitor detect data drift?

A

Yes, and it visualizes it.

63
Q

What data issues can SageMaker model monitor detect?

A

Anomolies, Outliers, and new Features

64
Q

What does Sagemaker Clarify do?

A

It detects potential bias and can alert on new potential bias.

65
Q

What can be used in SageMaker to explain a model’s behavior?

A

SageMaker Clarify

66
Q

Does SageMaker model monitor integrate with 3rd party dashboards

A

Yes

67
Q

Whare are Partial Dependence Plots PDPs?

A

Shows how feature values influence predictions. Used by SageMaker Clarify

68
Q

What are Shapley values?

A

It determines the contribution of each feature toward a model’s predictions. Used by SageMaker Clarify

69
Q

Can spot instances be used with SageMaker?

A

Yes

70
Q

What is SageMaker Feature Store?

A

It allows you to share features between models.

71
Q

Does SageMaker Feature Store support Streaming or batched data ingestion?

A

Both

72
Q

Can SageMaker Feature Store be secured with Private Link?

A

Yes

73
Q

What is SageMaker Canvas?

A

It is a no-code machine learning service geared for business analysts.

74
Q

Can you share models from SageMaker Canvas to Studio?

A

Yes

75
Q

What is AWS Glue?

A

Serverless discovery and definition of table definitions and schemas. It also handles custom ETL jobs.

76
Q

Can you handle data transformations in AWS Glue Studio?

A

Yes

77
Q

What is AWS Glue Data Quality?

A

It is a step you can add to your job to evaluate the quality of your data. If the data does not meet your quality rules, it can be failed or generate an alert in CloudWatch.

78
Q

What is AWS Glue DataBrew?

A

A visual data preparation tool.

79
Q

What are Glue DataBrew Recipes?

A

They are transforms that can be saved and applied on other jobs.

80
Q

Can you create datasets in AWS Glue DataBrew from Redshift and Snowflake?

A

Yes

81
Q

Can you feature engineer in DataBrew?

A

Yes

82
Q

How do you deal with the removal of PII in DataBrew?

A

Substitution with REPLACE_WITH_RANDOM

Shuffle with SHUFFLE_ROWS to shuffle the PII with other users PII rendering it inaccurate.

Deterministic encryption with DETERMINISTIC_ENCRYPTION

NULL it

Mask it

Hash it

83
Q

What is AWS Athena?

A

It is a query interface for querying S3?

84
Q

What are some good use cases for Athena?

A

Querying staging data before loading it into Redshift

Analyze Cloudtrail or CloudFront logs

Integration with QuickSight

85
Q

What are Athena Workgroups?

A

They organize users/teams/apps and allow you to control query access and track costs by workgroup.

86
Q

Can you limit how much data is scanned with Athena Workgroups?

A

Yes

87
Q

Can you track query history in Athena Workgroups?

A

Yes

88
Q

Are failed queries or CREATE/ALTER/DROP queries billable?

A

No

89
Q

What can you do with your data in s3 for Athena to reduce costs?

A

Use ORC or Parquet. This also increases performance.

90
Q

Is Athena good for highly formatted reports?

A

No

91
Q

Can Athena perform ETL?

A

Light ETL, but Glue is the better option.

92
Q

What is Create Table As Select (CTAS) in Athena?

A

It creates a new table from query results. Good for creating subsets of data.

93
Q

Is CTAS good for changing data format?

A

Yes.

94
Q

What processes faster in Athena, a small amount of large files or a large amount of small files.

A

Small amount of large files.

95
Q

If you add partitions after the face, what command must be run in Athena?

A

MSCK REPAIR TABLE

96
Q

If you want to support ACID transactions in Athena, How do you enable it?

A

Use the table_type of ICEBERG in the create table command.

97
Q

Can you recover data that was recently deleted in Athena?

A

Yes, but only if you enable ACID

98
Q
A