Modeling 2 Flashcards

Question

Semantic Segmentation training from scratch or incremental

Answer 1

both are supported

Answer 2

``` epochs learning rate batch size optimizer algorithm backbone ```

Answer 3

GPU only: P2, P3 Single Machine Only ml. p2.xlarge ml. p2.8xlarge ml. p2.16xlarge ml. p3.8xlarge ml. p3.16xlarge

Answer 4

CPU C5, M5 | GPU P2, P3

Answer 5

anomaly detection unsupervised detect unexpected spikes in time series data breaks in periodicity unclassifiable data points based on an algorithm developed by amazon

Answer 6

assigns an anomaly score to each data point

Answer 7

RecordIO-protobuf or CSV can use file or pipe mode on either optional test channel for computing accuracy, precision, recall and F1 on labeled data

Answer 8

creates a forest of trees where each tree is a partition of the training data looks at expected change in complexity of the tree as a result of adding a point to it

Answer 9

Randomly sampled and then trained

Answer 10

yes it is. it can work ok streaming data too.

Answer 11

num_trees - increasing reduces noise num_samples_per_tree - 1/num_samples_per_tree approximates the ratio of anomalous to normal data

Answer 12

does not use GPU use M4, C4, C5 for training ml.c5.xl for inference

Answer 13

organize documents into topics classify or summarize dox based on topics not just TF-IDF unsupervised

Answer 14

Neural Variational Inference

Answer 15

Four data channels - train is required - validation, rest and auxiliary are optional recordIO-protobuf or CSV words must be tokenized into integers every document must contain a count for every word in the vocabulary in CSV the auxiliary channel is for the vocabulary file or pipe mode which obviously pipe is faster

Answer 16

define how many topics we have

Answer 17

No, topics are a latent representation based on top ranking words one of two topic modeling algorithms in SageMaker - you can try them both

Answer 18

lowering mini_batch_size and learning_rate can reduce validation loss at expense of training time num_topics

Answer 19

GPU or CPU GPU recommended for training CPU which is cheaper is ok for inference

Answer 20

topic modeling not based on Deep Learning unsupervised - topics are unlabeled, which means they are just groupings of documents with a shared subset of words can be used for things other than words

Answer 21

cluster customers based on purchases | harmonic analysis in music

Answer 22

Train Channel, Optional Test Channel RecordIO-protobuf or CSV Each doc has counts for every word in vocabulary (CSV) pipe mode only supported with RecordIO

Answer 23

unsupervised

Answer 24

Scoring results | - per-word log likelihood

Answer 25

similar to NTM but CPU based | - therefore cheaper / more efficient

Answer 26

num_topics alpha0 - initial guess for concentration parameter - smaller values generate sparse topic mixtures - larger values (>1.0) produce uniform mixture

Answer 27

Single CPU

Answer 28

K-Nearest-Neighbors Simple Classification or regression algorithm supervised

Answer 29

find the K closest points to a sample point and return the most frequent label

Answer 30

Find the K closest points to a sample point and return the average value

Answer 31

Training channel, contains data Test channel, emits accuracy or MSE RecordIO-protobuf or CSV training - first column is label File or Pipe mode, either

Answer 32

1- Data is sampled 2- SageMaker includes a dimensionality reduction stage - avoid sparse data (Curse of dimensionality) - at cost of noise / accuracy - sign or fjlt methods 3- built an index for looking up neighbours 4- serialize the model 5- query the model for a given K

Answer 33

K! | Sample_size

Answer 34

Training on CPU or GPU - ml.m5.2xlarge - ml.p2.xlarge Inference - CPU for lower latency - GPU for higher throughput on large batches

Answer 35

unsupervised clustering divide the data into K groups where members of a group are similar as possible to each other - you define similar - measured by Euclidean distance SageMaker offers web-scale k-means clustering

Answer 36

training channel optional test - train ShardedByS3Key, - test FullyReplicated RecordIO-protobuf or CSV File or Pipe on either

Answer 37

every observation mapped to n-dimensional space n is number of features works to optimize the center of K clusters "extra cluster centers" may be specified to improve accuracy (which end up getting reduced ti k) K = k * x

Answer 38

Determine initial cluster centers - random or k-means++approach - K-means++tries to make initial clusters far apart Iterate over training data and calculate cluster centers Reduce clusters from K to k - using Lloyd's method with k-means++

Answer 39

K! - choosing k is tricky - plot within-cluster sum of squares as function of K - elbow method - basically optimize for tightness of clusters mini_batch_size extra_center_factor Init_method

Answer 40

CPU or GPU but CPU recommended only one GPU/instance on GPU p*.xlarge

Answer 41

Dimensionality Reduction avoid the curse of dimensionality while minimizing loss of information

Answer 42

unsupervised

Answer 43

Components first component has largest possible variability second component has the next largest

Answer 44

recordIO-protobuf or CSV | File or Pipe on either

Answer 45

Covariance matrix is created then singular value decomposition (SVD) Two modes - regular for sparse data and moderate number of observation and features - randomized for large number of observations and features uses approximation algorithm

Answer 46

Algorithm_mode Subtract_mean - unbiased data

Answer 47

CPU or GPU | - it depends on the specifics of the input data

Answer 48

Classification and regression | Dealing with Sparse data

Answer 49

Click Prediction Item Recommendations Since an individual user doesn't interact with most pages / products the data is sparse

Answer 50

supervised | - Classification or Regression

Answer 51

yes. e.g. user - item

Answer 52

recordIO-protobuf with Float32 | - Sparse data means CSV isn't practical

Answer 53

Finds factors we can use to predict a classification e.g. Click or not / Purchase or not or value (predicted rating?) given a matrix representing some pair of things (users and items) usually used in the context of recommender systems

Answer 54

initialization methods for bias, factors, and linear terms - uniform, normal or constant - can tune properties of each method

Answer 55

CPU or GPU CPU recommended GPU only works for dense data

Answer 56

finding fishy behaviour identify suspicious behaviour from ip address identify logins from anomalous ip's identify accounts creating resources from anomalous IP's

Answer 57

unsupervised

Answer 58

``` username account id(raw data no need to pre-process) ``` training channel, optional validation (computes AUC scores) CSV only (Entity, IP)

Answer 59

uses a neural network to learn latent vector representations of entities and ip addresses entities are hashed and embedded. - need sufficiently large hash size automatically generates negative samples during training by randomly pairing entities and IP's

Answer 60

num_entity_vectors - hash size - set to twice the number of unique entity identifiers Vector_dim - size of embedding vectors - scales model size - too large results in overfitting Epochs, Learning rate, batch size, etc.

Answer 61

``` CPU or GPU GPU recommended ml.p3.2xlarge or higher can use multiple GPU size of CPU depends on - vector_dim - num_entity_vectors ```

Modeling 2 Flashcards

(85 cards)