Data Transformation, Integrity, and Feature Engineering Flashcards by Chris Lombardi

What is AWS EMR?

Elastic Map Reduce. This is a managed Hadoop Framework.

How well did you know this?

Not at all

Perfectly

Does AWS EMR have Notebooks?

Yes

How well did you know this?

Not at all

Perfectly

Can AWS EMR support Spark?

Yes

How well did you know this?

Not at all

Perfectly

What is the main node called in an EMR cluster?

The master node

How well did you know this?

Not at all

Perfectly

What are core nodes used for in AWS EMR?

Stores HDFS data and runs tasks

How well did you know this?

Not at all

Perfectly

What is a Task Node in an EMR cluster?

Runs tasks, but does not host data.

How well did you know this?

Not at all

Perfectly

What is a good way to reduce costs for task nodes?

Use spot instances.

How well did you know this?

Not at all

Perfectly

What is a transient cluster in AWS EMR?

It automatically terminates after all the steps have been completed,

How well did you know this?

Not at all

Perfectly

How can you start jobs in EMR?

Through connecting to the master node, or using the console and adding ordered steps.

How well did you know this?

Not at all

Perfectly

What is the alternative storage to HDFS in AWS EMR?

How well did you know this?

Not at all

Perfectly

What is EMRFS?

It acts like HDFS, but is S3.

How well did you know this?

Not at all

Perfectly

What is the default size of a block in HDFS?

128MB

How well did you know this?

Not at all

Perfectly

Is HDFS ephemeral?

Yes. Good for performance though.

How well did you know this?

Not at all

Perfectly

How can you track consistency in EMRFS?

Dynamo DB

How well did you know this?

Not at all

Perfectly

Can Spark replace MapReduce in AWS EMR?

Yes

How well did you know this?

Not at all

Perfectly

What is Spark SQL in AWS EMR?

Low latency query engine. Up to 100x faster than map reduce. Allows for dataframes.

How well did you know this?

Not at all

Perfectly

What is GraphX for AWS EMR?

A graph processing framework built on top of Spark.

How well did you know this?

Not at all

Perfectly

What is MLLib for AWS EMR?

It allows you to integrate machine learning on top of spark.

How well did you know this?

Not at all

Perfectly

Can Spark be integrated with AWS Kinesis?

Yes

How well did you know this?

Not at all

Perfectly

What is Zepplin?

A notebook compatible with AWS EMR

How well did you know this?

Not at all

Perfectly

What is the problem with too many features in your data?

It leads to sparse data.

How well did you know this?

Not at all

Perfectly

What is a dimension?

Every feature is a new dimension

How well did you know this?

Not at all

Perfectly

What is a TF-IDF algorithem?

It figures out what terms are most relevant for a document.

How well did you know this?

Not at all

Perfectly

What are the two components of the TF-IDF equation?

Term Frequency divided by Document Frequency

How well did you know this?

Not at all

Perfectly

What is mean replacement?

When a column is missing data, you add the average value in of that column.

When is it better to use median over mean replacement?

When there are outliers present.

Is KNN better than mean replacement?

Yes

Does KNN work for categorical data

No. Only numerical data.

What imputation technique is best for categorical data?

Deep Learning

What is the best way to deal with missing data?

Get more data!

What is the best way to deal with unbalanced data?

Duplicate samples from the minority class

What is SMOTE?

It generates new samples of the minority class by using nearest neighbors.

Is SMOTE better than duplicating the minority class?

Yes.

What method is good to deal with outliers?

Removing the outliers

What can be used to detect outliers?

Random Cut Forest

What is binning?

You take your numerical data and convert it to categorical data / ranges of values. For example put all 20 something people into one classification.

What is quantile binning?

It categorizes data based on its place in the distribution. It ensures even bin sizes.

What is the one hot encoding transformation?

Creates buckets for every category. Common in deep learning.

Can shuffling help with training data?

Yes.

What is Sagemaker?

An AWS services geared for machine learning.

Where does most data in SageMaker come from?

What are popular data formats for SageMaker data?

RecordIO or Protobuff

Does Sagemaker integrate with Apache Spark?

Yes

What Services does SageMaker integrate with?

Athena, EMR, RedShift, Keyspaces DB

What is the three step process for SageMaker processing?

Copy data from S3 Spin up processing container Output processing data to S3

What are the two deployment models for SageMaker trained models?

Persistent - Always available for individual predictions on demand Batch Transform - To get predictions on a dataset

What is Sagemaker Neo?

It allows you to deploy your ML model at the edge.

What is elastic inference for SageMaker?

It accelerates deep learning models

Can endpoints in Sagemaker be scaled?

Yes

What is Shadow Testing?

It allows you to evaluate new models against currently deployed models to catch errors.

What is SageMaker Ground Truth?

Allows you to have humans label your data. Common for image classification.

How does it SageMaker Ground Truth work?

Ground Truth creates a model and as images are labeled by humans, only images it is unsure of get sent for human labeling.

What are two AWS services that can also help with image recognition and labels?

Rekognition Comprehend

What is ground truth plus?

It is a turnkey solution. Almost completely hands off.

What is SageMaker Data Wrangler?

And ETL pipeline built into SageMaker. Very similar to Glue Studio. Focused on Machine learning.

Can you visualize data in SageMaker Data Wrangler

Yes

What does the QuickModel feature within SageMaker Data Wrangler do?

It allows you to train your model with your data and measure results.

What is a common troubleshooting check for Data Wrangler problems?

Ensure the user has the correct IAM roles Ensure the data allows Data Wrangler access. Needs AmazonSageMakerFullAccess. EC2 limit service quota

Do you need a domain to use SageMaker studio?

Yes

Does Data Wrangler evaluate and provide an analysis on your data?

Yes.

What can be used to get cloudwatch alerts on quality deviations on your deployed models?

SageMaker Model Monitor

Can SageMaker Model Monitor detect data drift?

Yes, and it visualizes it.

What data issues can SageMaker model monitor detect?

Anomolies, Outliers, and new Features

What does Sagemaker Clarify do?

It detects potential bias and can alert on new potential bias.

What can be used in SageMaker to explain a model's behavior?

SageMaker Clarify

Does SageMaker model monitor integrate with 3rd party dashboards

Yes

Whare are Partial Dependence Plots PDPs?

Shows how feature values influence predictions. Used by SageMaker Clarify

What are Shapley values?

It determines the contribution of each feature toward a model's predictions. Used by SageMaker Clarify

Can spot instances be used with SageMaker?

Yes

What is SageMaker Feature Store?

It allows you to share features between models.

Does SageMaker Feature Store support Streaming or batched data ingestion?

Both

Can SageMaker Feature Store be secured with Private Link?

Yes

What is SageMaker Canvas?

It is a no-code machine learning service geared for business analysts.

Can you share models from SageMaker Canvas to Studio?

Yes

What is AWS Glue?

Serverless discovery and definition of table definitions and schemas. It also handles custom ETL jobs.

Can you handle data transformations in AWS Glue Studio?

Yes

What is AWS Glue Data Quality?

It is a step you can add to your job to evaluate the quality of your data. If the data does not meet your quality rules, it can be failed or generate an alert in CloudWatch.

What is AWS Glue DataBrew?

A visual data preparation tool.

What are Glue DataBrew Recipes?

They are transforms that can be saved and applied on other jobs.

Can you create datasets in AWS Glue DataBrew from Redshift and Snowflake?

Yes

Can you feature engineer in DataBrew?

Yes

How do you deal with the removal of PII in DataBrew?

Substitution with REPLACE_WITH_RANDOM Shuffle with SHUFFLE_ROWS to shuffle the PII with other users PII rendering it inaccurate. Deterministic encryption with DETERMINISTIC_ENCRYPTION NULL it Mask it Hash it

What is AWS Athena?

It is a query interface for querying S3?

What are some good use cases for Athena?

Querying staging data before loading it into Redshift Analyze Cloudtrail or CloudFront logs Integration with QuickSight

What are Athena Workgroups?

They organize users/teams/apps and allow you to control query access and track costs by workgroup.

Can you limit how much data is scanned with Athena Workgroups?

Yes

Can you track query history in Athena Workgroups?

Yes

Are failed queries or CREATE/ALTER/DROP queries billable?

What can you do with your data in s3 for Athena to reduce costs?

Use ORC or Parquet. This also increases performance.

Is Athena good for highly formatted reports?

Can Athena perform ETL?

Light ETL, but Glue is the better option.

What is Create Table As Select (CTAS) in Athena?

It creates a new table from query results. Good for creating subsets of data.

Is CTAS good for changing data format?

Yes.

What processes faster in Athena, a small amount of large files or a large amount of small files.

Small amount of large files.

If you add partitions after the face, what command must be run in Athena?

MSCK REPAIR TABLE

If you want to support ACID transactions in Athena, How do you enable it?

Use the table_type of ICEBERG in the create table command.

Can you recover data that was recently deleted in Athena?

Yes, but only if you enable ACID

Data Transformation, Integrity, and Feature Engineering Flashcards

(97 cards)