combined_Fundamentals of Computer Vision_study guide Flashcards
What is the primary goal of computer vision, and how does it differ from simple image processing?
The primary goal of computer vision is to extract meaning or actionable insights from images, whereas image processing often focuses on manipulating pixel values to apply visual effects. Computer vision aims to understand the content of an image rather than just alter its appearance.
Explain the fundamental role of filters in image processing. Provide an example of a filter’s effect on an image.
Filters are used to modify pixel values in an image to create visual effects such as blurring, sharpening, or edge detection. A filter kernel is convolved across the image, calculating a new value for each pixel based on the weighted sum of the pixel values in the area surrounding it, such as highlighting edges.
Describe the basic architecture and process of a Convolutional Neural Network (CNN) for image classification, including the training process.
A CNN uses filter kernels to extract numeric feature maps from images. These feature maps are then flattened and fed into a neural network which uses a softmax function to produce a prediction result. During training, the prediction results are compared to the known labels to determine the loss and the weights of both the neural network and the filter kernels are adjusted to reduce the loss.
How do transformers encode words, and what is the advantage of this method over other techniques?
Transformers encode words as vector-based embeddings, which are arrays of numeric values that capture semantic attributes of the words. This approach is useful because it groups words used in similar contexts together, creating a semantic language model that can build more sophisticated models.
Explain the concept of multi-modal models and how they differ from traditional computer vision approaches. Give an example of a multi-modal model.
Multi-modal models are trained using a combination of different types of data (such as images and text), often using captioned images without fixed labels. They differ from traditional approaches by encapsulating the relationships between language token embeddings and image features. The Microsoft Florence model is an example.
Describe how digital images are represented as numeric data for computer processing. Be sure to include an explanation of image resolution.
Digital images are represented as arrays of numeric pixel values. Each pixel value determines the color (or shade of gray) at a particular point. The image resolution is determined by the dimensions of the array, e.g. 7x7 which would represent an image that is 7 pixels wide and 7 pixels tall.
What is a filter kernel and how is it used to perform convolutional filtering?
A filter kernel is a small array of numeric values used as weights when performing convolutional filtering. Each pixel value in the area of the image that is equal in dimensions to the filter kernel is multiplied by a corresponding weight value in the filter kernel. These products are added together and the result assigned as the new pixel value in the filtered image.
List and describe at least three capabilities offered by the Azure AI Vision service.
Azure AI Vision provides capabilities such as optical character recognition (OCR) which is used for extracting text from images, image captioning, which generates human-readable descriptions of images, and object detection, which identifies the presence of common objects within images.
What is the difference between an Azure AI Vision resource and an Azure AI services resource in the Azure cloud?
An Azure AI Vision resource is a specific resource designed solely for the Azure AI Vision service, useful for tracking its utilization and costs separately, while an Azure AI services resource provides access to multiple AI services, including Azure AI Vision, which is useful when using multiple Azure AI services.
Why is it beneficial to use a pre-trained foundation model, such as Florence, when training a custom computer vision model?
Using a pre-trained foundation model reduces the amount of training data and computing power necessary to train a custom model, as the foundation model has already learned a wide range of features. Building a custom model on top of a foundation model allows for more sophisticated results in less time.