Vision architectures Flashcards
What is the purpose of SSD (Single Shot Multibox Detector) in vision tasks?
SSD is designed for real-time object detection, combining object localization and classification in a single forward pass.
What is R-CNN, and how does it work?
R-CNN (Region-Based Convolutional Neural Network) generates region proposals and classifies them using a CNN for object detection.
What does YOLO (You Only Look Once) specialize in?
YOLO is a real-time object detection model that predicts bounding boxes and class probabilities directly from images in a single pass.
What are Siamese Networks used for in vision tasks?
Siamese Networks are used for tasks like image similarity and verification by comparing embeddings of two input images.
What is EfficientNet, and why is it popular?
EfficientNet is a family of models that scales depth, width, and resolution efficiently to achieve high accuracy with fewer parameters.
What is Xception, and how does it improve upon traditional CNNs?
Xception uses depthwise separable convolutions to reduce computational cost while maintaining high performance.
What is MobileNet designed for?
MobileNet is optimized for mobile and embedded vision applications, using depthwise separable convolutions for efficiency.
What is ViT (Vision Transformer), and what makes it unique?
ViT applies the transformer architecture to image patches, achieving high performance without convolutional layers.
What is SegFormer used for in vision tasks?
SegFormer is a transformer-based model for semantic segmentation, combining global attention and lightweight architecture.
What is DeepLabv3, and what tasks is it used for?
DeepLabv3 is designed for semantic segmentation, utilizing atrous convolutions to capture multi-scale contextual information.
What is the Swin Transformer, and what is its innovation?
The Swin Transformer uses a hierarchical structure with shifted windows for efficient vision tasks, enabling scalability.