Computer Vision Fundamentals Flashcards
How can you find matching or partially matching images on the web using GCP?
Use Vision API with WEB_DETECTION feature
What are supported features of Vision API?
- Text detection - OCR for image, detects sparse text on images
- Document text detection - OCR for documents and image, detects dense text and handwriting (consider using Document AI)
- Landmark detection
- Logo detection
- Label detection - detect generalized labels and confidence scores
- Image properties - what are the dominant color with confidence score
- Object localization - identify generalized labels and create bounding boxes around objects (consider using Vertex AI Vision)
- Crop hint detection - provide crop ratio and it will automatically detect important part of the image
- Web entities and pages - provides related web content to an image - matching images, similar images, web pages where this image is located, etc.
- Explicit content detection (Safe Search) - detect explicit content categories: adult, spoof, medical, violence and racy
- Face detection - detect where face is located on the image and emotion (not identifying a specific person)
What are the most common steps to do with images before you train the model?
- Load image dataset in batches
- Decode images (3d tensor of RGB)
- Convert images (convert each integer in the Tensor to a value between 0-1)
- Resize images to desired size like (250x250)
How do you regularize DNNs?
With a dropout - randomly drop some nodes in the nerual network in each training step
What is batch normalization?
It is an algorithmic approach to speed up DNN and have impact on regulariazation. The goal is to standardize the inputs and outputs of a layer by calculating mean and standard deviation of inputs. You normalize the values by calculating mean and standard deviation before or after activation function. At inference time, you are using moving average to normalize prediction input.
What is convolution and to what it is related in computer vision domain?
Convolution represents a combination of 2 functions that results in a 3rd function. In computer vision domain, it is related to the group of nearby pixels.
What is a kernel in CNNs?
Kernel represent a filter that is slided around a certain image or part of an image. For example, there is a predefined “shape” that filter represents and it is seen as 3 x 3 matrix with values. It tries to identify this shape in an image of 5 x 5 and the results a new matrix. There are many different kernels that are used in CNN and each of them has a different set of weights.
What model parameters exist in CCN?
- number of filters (how many kernels are applied per layer)
- input channels (for color images it is 3 because RGB)
- size of the filter
- strides (how many pixels does filter move on each step)
- padding (how much padding should be added around image to fit size of the filter)
- activation (activation function, mostly ReLU)
What is a pooling layer in CNN?
CNN are complex neural networks that require a lot of parameters to train a model which can take a lot of time. To speed it up, pooling layers have been introduced to reduce the dimensionality of the convolutional layer. Imagine that we have a picture of 4x4, filter 2x2 and stride 2. Pooling layer can take a result of each filter 2x2 and translate it to 1x1 by applying different techniques like Max, average, median pooling (ex. take the average value of 4 numbers in 2x2 matrix). Dimension from the image will be reduced to 2x2 matrix from 4x4.
How do you decide if you should freeeze model weights or not during transfer learning?
If you have a small dataset you should freeze to avoid overfitting. As the dataset increases you can retrain the model weights.