Vision architectures Flashcards

Question 1

Q

What is the purpose of SSD (Single Shot Multibox Detector) in vision tasks?

Answer

A

SSD is designed for real-time object detection, combining object localization and classification in a single forward pass.

Question 2

Q

What is R-CNN, and how does it work?

Answer

A

R-CNN (Region-Based Convolutional Neural Network) generates region proposals and classifies them using a CNN for object detection.

Question 3

Q

What does YOLO (You Only Look Once) specialize in?

Answer

A

YOLO is a real-time object detection model that predicts bounding boxes and class probabilities directly from images in a single pass.

Question 4

Q

What are Siamese Networks used for in vision tasks?

Answer

A

Siamese Networks are used for tasks like image similarity and verification by comparing embeddings of two input images.

Question 5

Q

What is EfficientNet, and why is it popular?

Answer

A

EfficientNet is a family of models that scales depth, width, and resolution efficiently to achieve high accuracy with fewer parameters.

Question 6

Q

What is Xception, and how does it improve upon traditional CNNs?

Answer

A

Xception uses depthwise separable convolutions to reduce computational cost while maintaining high performance.

Question 7

Q

What is MobileNet designed for?

Answer

A

MobileNet is optimized for mobile and embedded vision applications, using depthwise separable convolutions for efficiency.

Question 8

Q

What is ViT (Vision Transformer), and what makes it unique?

Answer

A

ViT applies the transformer architecture to image patches, achieving high performance without convolutional layers.

Question 9

Q

What is SegFormer used for in vision tasks?

Answer

A

SegFormer is a transformer-based model for semantic segmentation, combining global attention and lightweight architecture.

Question 10

Q

What is DeepLabv3, and what tasks is it used for?

Answer

A

DeepLabv3 is designed for semantic segmentation, utilizing atrous convolutions to capture multi-scale contextual information.

Question 11

Q

What is the Swin Transformer, and what is its innovation?

Answer

A

The Swin Transformer uses a hierarchical structure with shifted windows for efficient vision tasks, enabling scalability.

Question 12

Q

What is the difference between AE to VAE

Answer

A

VAE is a next step on AE when the encoder’s output is a continuous distribution function

Question 13

Q

How can an image be reconstructed from the distribution of a VAE

Answer

A

There is a sampling process from the distribution, and the sampling is the input for the decoder.

Question 14

Q

Between what distributions is the KL working on in VAE

Answer

A

The distribution which parameters given from the encoder to a u=0 and sigma=1

Question 15

Q

Why to choose AE base architectures over GAN?

Answer

A

Basically GAN is used to create random outputs, but if we want to control the output, VAE VQ-VAE are better suited.

Vision architectures Flashcards

(15 cards)