Data Transformation, Integrity, and Feature Engineering Flashcards
What is AWS EMR?
Elastic Map Reduce. This is a managed Hadoop Framework.
Does AWS EMR have Notebooks?
Yes
Can AWS EMR support Spark?
Yes
What is the main node called in an EMR cluster?
The master node
What are core nodes used for in AWS EMR?
Stores HDFS data and runs tasks
What is a Task Node in an EMR cluster?
Runs tasks, but does not host data.
What is a good way to reduce costs for task nodes?
Use spot instances.
What is a transient cluster in AWS EMR?
It automatically terminates after all the steps have been completed,
How can you start jobs in EMR?
Through connecting to the master node, or using the console and adding ordered steps.
What is the alternative storage to HDFS in AWS EMR?
S3
What is EMRFS?
It acts like HDFS, but is S3.
What is the default size of a block in HDFS?
128MB
Is HDFS ephemeral?
Yes. Good for performance though.
How can you track consistency in EMRFS?
Dynamo DB
Can Spark replace MapReduce in AWS EMR?
Yes
What is Spark SQL in AWS EMR?
Low latency query engine. Up to 100x faster than map reduce. Allows for dataframes.
What is GraphX for AWS EMR?
A graph processing framework built on top of Spark.
What is MLLib for AWS EMR?
It allows you to integrate machine learning on top of spark.
Can Spark be integrated with AWS Kinesis?
Yes
What is Zepplin?
A notebook compatible with AWS EMR
What is the problem with too many features in your data?
It leads to sparse data.
What is a dimension?
Every feature is a new dimension
What is a TF-IDF algorithem?
It figures out what terms are most relevant for a document.
What are the two components of the TF-IDF equation?
Term Frequency divided by Document Frequency
What is mean replacement?
When a column is missing data, you add the average value in of that column.
When is it better to use median over mean replacement?
When there are outliers present.
Is KNN better than mean replacement?
Yes
Does KNN work for categorical data
No. Only numerical data.
What imputation technique is best for categorical data?
Deep Learning
What is the best way to deal with missing data?
Get more data!
What is the best way to deal with unbalanced data?
Duplicate samples from the minority class
What is SMOTE?
It generates new samples of the minority class by using nearest neighbors.
Is SMOTE better than duplicating the minority class?
Yes.
What method is good to deal with outliers?
Removing the outliers
What can be used to detect outliers?
Random Cut Forest
What is binning?
You take your numerical data and convert it to categorical data / ranges of values. For example put all 20 something people into one classification.
What is quantile binning?
It categorizes data based on its place in the distribution. It ensures even bin sizes.
What is the one hot encoding transformation?
Creates buckets for every category. Common in deep learning.
Can shuffling help with training data?
Yes.