Data Transformation, Integrity, and Feature Engineering Flashcards
What is AWS EMR?
Elastic Map Reduce. This is a managed Hadoop Framework.
Does AWS EMR have Notebooks?
Yes
Can AWS EMR support Spark?
Yes
What is the main node called in an EMR cluster?
The master node
What are core nodes used for in AWS EMR?
Stores HDFS data and runs tasks
What is a Task Node in an EMR cluster?
Runs tasks, but does not host data.
What is a good way to reduce costs for task nodes?
Use spot instances.
What is a transient cluster in AWS EMR?
It automatically terminates after all the steps have been completed,
How can you start jobs in EMR?
Through connecting to the master node, or using the console and adding ordered steps.
What is the alternative storage to HDFS in AWS EMR?
S3
What is EMRFS?
It acts like HDFS, but is S3.
What is the default size of a block in HDFS?
128MB
Is HDFS ephemeral?
Yes. Good for performance though.
How can you track consistency in EMRFS?
Dynamo DB
Can Spark replace MapReduce in AWS EMR?
Yes
What is Spark SQL in AWS EMR?
Low latency query engine. Up to 100x faster than map reduce. Allows for dataframes.
What is GraphX for AWS EMR?
A graph processing framework built on top of Spark.
What is MLLib for AWS EMR?
It allows you to integrate machine learning on top of spark.
Can Spark be integrated with AWS Kinesis?
Yes
What is Zepplin?
A notebook compatible with AWS EMR
What is the problem with too many features in your data?
It leads to sparse data.
What is a dimension?
Every feature is a new dimension
What is a TF-IDF algorithem?
It figures out what terms are most relevant for a document.
What are the two components of the TF-IDF equation?
Term Frequency divided by Document Frequency
What is mean replacement?
When a column is missing data, you add the average value in of that column.
When is it better to use median over mean replacement?
When there are outliers present.
Is KNN better than mean replacement?
Yes
Does KNN work for categorical data
No. Only numerical data.
What imputation technique is best for categorical data?
Deep Learning
What is the best way to deal with missing data?
Get more data!
What is the best way to deal with unbalanced data?
Duplicate samples from the minority class
What is SMOTE?
It generates new samples of the minority class by using nearest neighbors.
Is SMOTE better than duplicating the minority class?
Yes.
What method is good to deal with outliers?
Removing the outliers
What can be used to detect outliers?
Random Cut Forest
What is binning?
You take your numerical data and convert it to categorical data / ranges of values. For example put all 20 something people into one classification.
What is quantile binning?
It categorizes data based on its place in the distribution. It ensures even bin sizes.
What is the one hot encoding transformation?
Creates buckets for every category. Common in deep learning.
Can shuffling help with training data?
Yes.
What is Sagemaker?
An AWS services geared for machine learning.
Where does most data in SageMaker come from?
S3
What are popular data formats for SageMaker data?
RecordIO or Protobuff
Does Sagemaker integrate with Apache Spark?
Yes
What Services does SageMaker integrate with?
Athena, EMR, RedShift, Keyspaces DB
What is the three step process for SageMaker processing?
Copy data from S3
Spin up processing container
Output processing data to S3
What are the two deployment models for SageMaker trained models?
Persistent - Always available for individual predictions on demand
Batch Transform - To get predictions on a dataset
What is Sagemaker Neo?
It allows you to deploy your ML model at the edge.
What is elastic inference for SageMaker?
It accelerates deep learning models
Can endpoints in Sagemaker be scaled?
Yes
What is Shadow Testing?
It allows you to evaluate new models against currently deployed models to catch errors.
What is SageMaker Ground Truth?
Allows you to have humans label your data. Common for image classification.
How does it SageMaker Ground Truth work?
Ground Truth creates a model and as images are labeled by humans, only images it is unsure of get sent for human labeling.
What are two AWS services that can also help with image recognition and labels?
Rekognition
Comprehend
What is ground truth plus?
It is a turnkey solution. Almost completely hands off.
What is SageMaker Data Wrangler?
And ETL pipeline built into SageMaker. Very similar to Glue Studio. Focused on Machine learning.
Can you visualize data in SageMaker Data Wrangler
Yes
What does the QuickModel feature within SageMaker Data Wrangler do?
It allows you to train your model with your data and measure results.
What is a common troubleshooting check for Data Wrangler problems?
Ensure the user has the correct IAM roles
Ensure the data allows Data Wrangler access. Needs AmazonSageMakerFullAccess.
EC2 limit service quota
Do you need a domain to use SageMaker studio?
Yes
Does Data Wrangler evaluate and provide an analysis on your data?
Yes.
What can be used to get cloudwatch alerts on quality deviations on your deployed models?
SageMaker Model Monitor
Can SageMaker Model Monitor detect data drift?
Yes, and it visualizes it.
What data issues can SageMaker model monitor detect?
Anomolies, Outliers, and new Features
What does Sagemaker Clarify do?
It detects potential bias and can alert on new potential bias.
What can be used in SageMaker to explain a model’s behavior?
SageMaker Clarify
Does SageMaker model monitor integrate with 3rd party dashboards
Yes
Whare are Partial Dependence Plots PDPs?
Shows how feature values influence predictions. Used by SageMaker Clarify
What are Shapley values?
It determines the contribution of each feature toward a model’s predictions. Used by SageMaker Clarify
Can spot instances be used with SageMaker?
Yes
What is SageMaker Feature Store?
It allows you to share features between models.
Does SageMaker Feature Store support Streaming or batched data ingestion?
Both
Can SageMaker Feature Store be secured with Private Link?
Yes
What is SageMaker Canvas?
It is a no-code machine learning service geared for business analysts.
Can you share models from SageMaker Canvas to Studio?
Yes
What is AWS Glue?
Serverless discovery and definition of table definitions and schemas. It also handles custom ETL jobs.
Can you handle data transformations in AWS Glue Studio?
Yes
What is AWS Glue Data Quality?
It is a step you can add to your job to evaluate the quality of your data. If the data does not meet your quality rules, it can be failed or generate an alert in CloudWatch.
What is AWS Glue DataBrew?
A visual data preparation tool.
What are Glue DataBrew Recipes?
They are transforms that can be saved and applied on other jobs.
Can you create datasets in AWS Glue DataBrew from Redshift and Snowflake?
Yes
Can you feature engineer in DataBrew?
Yes
How do you deal with the removal of PII in DataBrew?
Substitution with REPLACE_WITH_RANDOM
Shuffle with SHUFFLE_ROWS to shuffle the PII with other users PII rendering it inaccurate.
Deterministic encryption with DETERMINISTIC_ENCRYPTION
NULL it
Mask it
Hash it
What is AWS Athena?
It is a query interface for querying S3?
What are some good use cases for Athena?
Querying staging data before loading it into Redshift
Analyze Cloudtrail or CloudFront logs
Integration with QuickSight
What are Athena Workgroups?
They organize users/teams/apps and allow you to control query access and track costs by workgroup.
Can you limit how much data is scanned with Athena Workgroups?
Yes
Can you track query history in Athena Workgroups?
Yes
Are failed queries or CREATE/ALTER/DROP queries billable?
No
What can you do with your data in s3 for Athena to reduce costs?
Use ORC or Parquet. This also increases performance.
Is Athena good for highly formatted reports?
No
Can Athena perform ETL?
Light ETL, but Glue is the better option.
What is Create Table As Select (CTAS) in Athena?
It creates a new table from query results. Good for creating subsets of data.
Is CTAS good for changing data format?
Yes.
What processes faster in Athena, a small amount of large files or a large amount of small files.
Small amount of large files.
If you add partitions after the face, what command must be run in Athena?
MSCK REPAIR TABLE
If you want to support ACID transactions in Athena, How do you enable it?
Use the table_type of ICEBERG in the create table command.
Can you recover data that was recently deleted in Athena?
Yes, but only if you enable ACID