Modelling - Past Questions Flashcards
What is semantic segmentation?
a deep learning algorithm that labels or categorises every pixel in an image?
When you are trying to find items that are similar what algorithm would you use?
K-nearest neighbour
What does the linear learner algorithm show?
How a change in an independent variable affects a dependant variable.
What type of problem is random cut forest used for predominately?
Classification
What sagemaker algorithm supports recommendations?
Factorisation Machines
What SageMaker algorithm supports regression
Linear Learner
What 4 types of problem can XGBoost be used to solve?
Regression, Binary Classification, Multi-class classification and Ranking
What format should the training data be in for XGBoost
CSV or libsvm
What is Random Cut Forest used for?
to identify anomalies in data (ie find fraud)
How does Random Cut Forest find an anomaly?
It provides a score for each data point. A low score = similar to most of the data, high score = anomaly
What format should training data for Random Cut Forest be in?
CSV or x-recordio-protobuf format
For online testing what type of data should you use?
live data
For offline testing what sort of data should you use?
historical data
When you perform offline testing of your models which endpoints should you deploy your trained models to?
alpha endpoints
When using online testing which endpoint should you deploy your trained models to?
SageMaker endpoint
When trying to select the correct trained model for real-time ml what steps would you take?
Deploy your models to SageMaker endpoint, then send a portion of live data to each ,model and finally evaluate each model.
What is object detection used for?
to identify all instances of an object within an image
How does object detection give the location of a particular object?
It uses a bounding box
What type of ML algorithm is Object detection?
Supervised
What format is recommended for Object detection training data ?
Apache MxNet recordIO
What is incremental training?
You seed the training data with a previously trained model.
When would object detection not be a good idea?
For problems at scale
What is Latent Dirichlet Allocation used for?
Discovering a topic in a document
What algorithm would you use to classify millions of high-resolution images?
SageMaker built-in Image Classification
How does SageMaker’s built-in Image Classification work?
It uses a convolutional Neural Network to classify images that supports multi-label classification
What is a factorisation Machine primarily used for?
detect interactions between features ie reactions to ads on a web page or item recommendations
What are factorization machines used for?
Classification and regression
If you want to find all elements of an item in an image and surround it with a bounding box what algorithm would you use?
Object Detection Algorithm
What is a Neural Topic Model algorithm used for?
to group documents into topics using the statistical distribution of words in the documents
What do you use XGBoost for?
predicting a target variable very quickly and efficently
What does XGBoost do with redundant features?
It includes them which can lead to performance drag
Why is removing redundant features outright a bad idea?
There is a risk of information loss
How would you solve the issue of redundant features most efficiently and quickly?
Principal Component Analysis
How does Principal component analysis work?
It finds composites of features that are uncorrelated
What is online learning?
the process of training your model incrementally by giving it data observations as individual observations or in mini-batches
What technique can you use within SageMaker to expedite the deployment and operation of your model?
Transfer learning
What is transfer learning?
You start with an off the shelf trained model and apply it to your different but similar observations
What is incremental learning?
You begin with an existing model you have already trained and extend it with new data.
When do you use Out-of-core learning?
when training with huge datasets that you can’t load into your servers memory.
How does Out-of-core learning work?
The algorithm loads some of the data, trains on that subset, loads another subset of observations, trains on that subset and repeats
What does the early_stopping hyperparameter do?
Decide if the algorithm should be allowed to stop early when training if further training will not be necessary
What does the learning_rate hyperparameter do?
decides how quickly the model adapts to new or changing data. Values between 0.0 - 1.0
What does a learning_rate close to 1.0 do?
The model will learn quickly and take into account new observations quickly
What does a learning_rate close to 0.0
The model will learn slowly and take into account new observations slowly
What does the use_pretrained_model hyperparameter do?
Defines if you want a pre-trained model to be loaded in before training.
What are the three steps needed for deploying a model using Amazon SageMaker Hosting services?
- Create a model in Amazon SageMAker including the S3 path where the model artefacts are stored and the Docker registry path for the inference image
- Create an endpoint config for a HTTPS endpoint
- Create a HTTPS endpoint
What does IoT Core do?
Allows you to send IoT messages to AwS services without managing infrastructure
What does IoT Greengrass do?
Helps you quickly build edge device software and remotely deploy and manage it.
What is IoT Analytics specifically built for?
Analysing and enriching highly unstructured IoT data
What are Inference Pipelines used for?
to define and deploy pre-trained SageMaker algorithms
Can Inference pipelines be used with IoT devices?
No they do not have the Inference Inference integration
If you wanted to enrich data using Kinesis Data Streams would you need any additional steps?
Yes you would need lambda functions to perform the enrichment steps.
Which Amazing ML services/features would you use to manage multiple experiments at scale?
Amazon SageMaker model tracking capability
What is Amazon SageMaker Inference pipeline used for?
to deploy pre-trained SageMaker algorithms packaged in docker containers.
What can you search for in the Amazon SageMaker model tracking capability?
key model attributes ie hyperparameter values. algorithms used and tags associated with the models.
What does Amazon SageMaker model experiments capability do?
It does not exist
What does Amazon SageMaker model containers capability do?
It does not exist
What format must the labelling file be in when using AWS Glue FindMatches Ml Transform?
CSV
How should the labelling file be structured when using AWS Glue FindMatches ML Transform?
The first two columns are the labeling_set_id and the label. Then the rest should match the schema of the data to be processed.
What happens if AWS Glue FindMatches ML Transform can’t find a match for a record?
it is assigned a unique label
How should the labelling file be encoded when using AWS Glue FindMatches Ml Transform?
UTF-8 without BOM
Does SageMaker support GPU instances for the Random Cut Forest Algorithm?
No it does not. It only supports CPU
What is K-means?
an unsupervised learning algorithm. It attempts to find discrete groupings within data where members of a group are as similar as one another.
What is the difference between KNN and K-means?
K-Means is unsupervised and KNN is supervised.
When do you use logistic regression?
When doing supervised classification and the decision boundary is linear.
You are building a binary classifier with highly unbalanced data. What three things can you do to improve model performance?
- Collect more data of the class with less data
- Oversample the class with less data
- Create more samples using algorithms such as smote
How does SMOTE work?
uses kNN neighbours approach to exclude members of the majority class which creating synthetic examples similar the the minority class.
What is the easiest