Module 7 Flashcards
Overview of logistic regression
- Used to estimate the probability that an event will occur as a function of other variables
- Can be considered a classifier as well
Describe Inputs and outputs of logistic regression
Input - variables can be continuous or discrete
Output - Set of coefficients that indicate the relative impact of each driver + A linear expression for predicting the log-odds ratio of outcome as function of drivers
List logistic regression use cases
- probability of an event
- Binary classification
- Multi-class classification
What is the goal of logistic regression?
- Predict the true portion of success, pi, at any value of the predictor
- pi = # of success / # of trials
Describe Y X and PI in Binary logistic regression model
Y = Binary Response X = Quantitative predictor pi = success
Logistic regression Pros
- Explanatory value
- Robust
- Concise
- Easy to score data
- returns good probability estimates
- preserves summary stats of training data
Logistic Regression Cons
- Does not handle missing values well
- Doesnot work well with discrete drivers with distinct values
- Cannot handle variables that affect outcome in discontnues way ( step functions)
- Assumes each var affects log-odds
Describe Neural Network Concept
- constructed and implemented to model the human brain
- performs pattern matching, classification, etc tasks that are difficult for traditional computers
Describe an artificial neural network
- posses a large number of processing elements called nodes/neurons operating in parallel
- neurons connected by link
- each link has weight regarding input signal
- each neuron has internal state called activation level
What are the components of a single-layer neural network
Input layer, Hidden layer, output layer, parameters are weights and intercepts are biases
What are Ak and g(z) in a neural network
Ak is activations in the hidden layer
g(z) is called the activation function - popular functions are sigmoid and rectified linear
g(z) are typically non-linear derived features
Describe details of the output layer in ann and fitting model
- Output activation function encodes softmax function
- Fit model by minimizing cross entropy/ negative multinomial log-likelihood
Describe how CNN works
- builds up an image in a hierarchical fashion
- hierarchy is constructed through convolution and pooling layers
- Edges and shapes are recognized and pieced together to form shapes/target image
Describe the convolution filter ( learned, score)
- filters are learned during training
- Input image and filter are combined using the dot product to get a score
- score is high if sub-image of the input image is similar to filter
What is the idea of convolution, its result, and the weight in the filters?
- the idea is to find common patterns that occur in different parts of the image
- Result is a new feature map
- weights are learned by the network
What are Pooling and its adv
- each nonoverlapping 2 x 2 block is replaced by maximum
- sharpens feature identification
- allows for locational invariance
- reduces dimensions by a factor of 4
Describe the architecture of CNN
- many convolve + pool layers
- filters are typically small (3x3)
- Each filter creates a new channel in the convolution layer
- As pooling reduces size, the number of filters/channels increases
How to create features X to characterize the document?
Use Bag of words
What is a bag of words
- Bag of words are unigrams
- Identify 10K most frequently occurring words
- create a binary vector of length 10k for each document and score 1 in every position that the corresponding word occurred
What is a recurrent neural network?
- builds a model that takes into account the sequential nature of the data and build memory of past
What is each observation in RNN and target y
- The feature for each observation is a sequence of vectors
- Target Y is a single variable such as sentiment or one-hot vector for multiclass
- Y can also be a sequence
Describe architecture of RNN and what does it represent?
- Hidden layer is a sequence of vectors A that receive input X and A -1 that produce output O
- same weights, W,U,B are used at each step
- represents an evolving model updated as each element is processed
How to increase accuracy for RNN
add LTSM - long and short-term memory
What is autocorrelation
is the correlation of all pairs
What is the RNN forecaster similar to
Autoregression procedure
When to use deep learning
- image classification, modeling, medical
- Speech modeling, language, forecasting
- when the signal to noise ratio is high
- use simpler models like AR(5) or glmnet if you can
When does fitting neural network become difficult
When the objective is the nonconvex - the solution nonconvex functions and gradient descent.
implementing non convex functions and gradient descent
- Start with a guess for all parameters and set t = 0
- Iterate until the objective fails to decrease
How to find a direction that points downhill on the gradient descent?
- Use gradient vector/ vector of partial derivatives where p is the learning rate
What does Backpropagation use
- R is a sum so the gradient is the sum of gradients
- Backpropagation uses chain rule for differentiation
What is slow learning
- Gradient descent is slow and has a low learning rate
- Use early stopping for regularization
What is stochastic gradient descent
- rather than using all data, use minibatches drawn at random
What is an epoch
count of iterations and amounts to a number of minibatch updates
What is Regularization
shrinks weights at each layer, two forms are dropout and augmentation
What is dropout learning
- at each update remove units with no probability and scale weights of those retained
- other units stand in for those removed
What is ridge and data augmentation, what is effective in
- make copies of each (x,y) and add small gaussian noise, not modifying the copies
- this makes the fit robust - equivalent to ridge regularization in OLS
- effective with SGD
What is double descent
- with neural networks better to have too many hidden units than too few
- running stochastic gradient descent till zero training error gives a good out-of-sample error
- Increasing layers and training to zero error gives better out-of-sample error
In a wide linear model ( p > n) what does SGD with small step size lead to ?
minimum norm
What is minimum norm
zero residual solution
what is similar to the ridge path
Stochastic gradient flow
which ratio is less prone to overfitting
high signal-to-noise ratio