L10: Object detection Flashcards
Linemod
❗️❗️❗️The modalities used by Linemod:
- Color features:
local gradients → takes max gradient across R/G/B maps for pixel position (no greyscale convertion) - Depth features
Depth gradient → local surface normal vector, which is estimated form point cloud data (has orientation/direction)
Linemod
❗️❗️❗️How features are quantized
The resulting directions are quantized into predefined orientation bins.
- Color gradients → Bin direction of the gradients into 180 bins (0 to 180 degrees). Negative color gradients are omitted.
- Normal vectors → 3D orientations, the points that point out of the camera are predefined. Lie inside a 3D cone. Bin normal vector into the nearest section of the cone.
Both gradients are normalized so there is ONLY a direction. instead of having coordiantes, all gradients have integers.
Linemod
❗️❗️❗️The matching function for color/depth gradients, both pixel-wise and for an image window
Cross-correlation between color- and depth gradient.
- Color domain: each object template gradient dot each image gradient.
- - Depth domain: Same thing but it is anti-parallel (only positive)
Linemod
What is Linemod?
Object detection belonging to template matching.
Advantage of Linemod: It reduces the number of templates and speeds it up. Multimodal templates. Can handle scale -, viewpoint -, and illumination change.
- Template matching require a lot of templates
What levels are there?
Instance-level → Detect a specific instance of an object
Category-level → Detect an instance of a certain object type (like dog, fridge, oven, dining_table, etc)
Linemod
What is spreading and binarization?
Introduces a tolerance. This speeds up the process.
Linemod
What is being matched in the gradients?
Compares quantized gradient features of the object templeate with the corresponding gradient feature extracted from the input image.
- This can be done for both color - and depth domain
CenterNet
What is CenterNet?
It is a category-level detector (can also do it instance level).
It is trained to predict bounding boxes around the detected objects.
- Tries to predict object centers plus sizes within the image.
CenterNet
What is anchor boxes?
Smaller set of boxes in smaller pixel locations. It stretches the box and finds the defined object after it has classified it.
- Anchor boxes are fixed initial boundary box guesses.
CenterNet
❗️❗️❗️How 2D detections are parameterized, and how this is different from regular anchor-based detectors
CenterNet scans through the image with strides R = 4. At each stride location the classifier predicts whether it’s an object center. The object size is predicted, and an offset correction is made, to compensate for inaccuracies caused by striding.
Anchors count as positive with an overlap IoU > 0.7.
Strides: Betyder at det bevæger sig over billedet med 4 pixel step i vertical og horizontal retning.
IoU: Intersection of Union (the higher the better!)
CenterNet
❗️❗️❗️How 2D detections are parameterized, and how this is different from regular anchor-based detectorsHow the three-term loss is built and what the terms mean
Guide training with loss:
1. - Classification/”focal” loss → Focus on objects, helps with too much background.
- Penalizes wrong binary prediction (0,1) with a modified log-loss.
2. Size loss → Predicts the size and the loss is found from the real size
- Penalize the discrepancies between the prediction and true size. Euclidean distance
3. Offset loss → Compensates downsampling from strides R=4. If we scale up predictions 4x there will be an offset from the ground truth center position. The net actively tries to predict the offset. (offset between the downscaled ground truth center and the predicted one)
CenterNet
❗️❗️❗️How the prediction output is converted back to a bounding box in the full image
Recovered with the predicted center, size and offset. Nearby pixels can also get classified as object, and 8-point non-maximum suppression is performed to remove overlapping center prediction.
CenterNet
❗️❗️❗️How to repurpose CenterNet to other tasks, e.g. 3D detection and human body pose estimation by joint positions
It can produce other things than only a center point:
- Specify 3D boxes → replace regression heads for the new prediction task
- Human body pose estimation → Specify number of joint locations