Fundamentals of Computer Vision Flashcards
What is Azure AI Vision service for?
Azure AI Vision service enables software engineers to create intelligent solutions that extract information from images; a common task in many artificial intelligence (AI) scenarios.
How are images represented for computers?
To a computer, image is nothing but a matrix of numeric pixel values. (black and white) and 3 matrices representing Red Green Blue colors combined.
What are filters in image processing?
A common way to perform image processing tasks is to apply filters that modify the pixel values of the image to create a visual effect. A filter is defined by one or more arrays of pixel values, called filter kernels. For example, you could define filter with a 3x3 kernel as shown in this example:
-1 -1 -1
-1 8 -1
-1 -1 -1
The kernel is then convolved across the image, calculating a weighted sum for each 3x3 patch of pixels and assigning the result to a new image.,, this kind of image manipulation is often referred to as convolutional filtering.
What does a laplace image filter do?
highlights the edges of objects in an image.
What are convolutional neural networks (CNNs)?
CNNs use filters to extract numeric feature maps from images, and then feed the feature values into a deep learning model to generate a label prediction.
CNNs have been at the core of computer vision solutions for many years. While they’re commonly used to solve image classification problems as described previously, they’re also the basis for more complex computer vision models. For example, object detection models combine CNN feature extraction layers with the identification of regions of interest in images to locate multiple classes of object in the same image.
What are transformers?
AI discipline - natural language processing (NLP), another type of neural network architecture, called a transformer has enabled the development of sophisticated models for language.
Transformers work by processing huge volumes of data, and encoding language tokens (representing individual words or phrases) as vector-based embeddings (arrays of numeric values). You can think of an embedding as representing a set of dimensions that each represent some semantic attribute of the token. The embeddings are created such that tokens that are commonly used in the same context are closer together dimensionally than unrelated words.
What are multi modal models?
multi-modal models, in which the model is trained using a large volume of captioned images, with no fixed labels. An image encoder extracts features from images based on pixel values and combines them with text embeddings created by a language encoder. The overall model encapsulates relationships between natural language token embeddings and image features.
The Microsoft Florence model is just such a model. Trained with huge volumes of captioned images from the Internet, it includes both a language encoder and an image encoder. Florence is an example of a foundation model. In other words, a pre-trained general model on which you can build multiple adaptive models for specialist tasks.
What can Azure AI Vision service do?
Azure AI Vision supports multiple image analysis capabilities, including:
Optical character recognition (OCR) - extracting text from images.
Generating captions and descriptions of images.
Detection of thousands of common objects in images.
Tagging visual features in images
Summary
Computer vision is built on the analysis and manipulation of numeric pixel values in images. Machine learning models are trained using a large volume of images to enable common computer vision scenarios, such as image classification, object detection, automated image tagging, optical character recognition, and others
While you can create your own machine learning models for computer vision, the Azure AI Vision service provides many pretrained capabilities that you can use to analyze images, including generating a descriptive caption, extracting relevant tags, identifying objects, and others.