Object-wise metric distance estimation from a single RGB image via semantic and geometric reasoning
Date
2026
Authors
Sultana, Abida
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Estimating metric object distance from a single RGB image is challenging because monocular depth does not provide an absolute scale. Existing solutions either require active sensors such as LiDAR or stereo, rely on monocular depth that remains scale-ambiguous, or use implicit vision-language reasoning that can be unstable for precise measurement. This thesis proposes a semantic–geometric pipeline for recovering metric scale by combining open-vocabulary object grounding and segmentation, label normalization, monocular depth, and camera cues. Object-centric 3D points are reconstructed from the predicted depth, an oriented 3D bounding box is fitted to estimate object dimensions, and real-world size priors are used to compute a scale factor that converts relative depth into absolute distance. The proposed method is evaluated on HOT3D, ScanNet, ARKitScenes, and a custom iPhone dataset, achieving Multi-Threshold Relative Accuracy (MRA) values of 68.85%, 88.30%, 75.12%, and 89.85%, respectively, under the per-frame average mean-distance strategy. The results show that frame-level averaging improves stability by reducing the influence of instance-level outliers. The main limitations of the approach are its dependence on segmentation and depth quality, sensitivity to canonical size priors for categories with high size variation, possible instability under occlusion or truncation, and relatively high processing time. Future work includes more robust scale estimation, adaptive size priors, improved object fitting, the use of consecutive frames for temporal consistency, and pipeline optimization for lower latency.
Description
Keywords
monocular distance estimation, semantic–geometric pipeline, 3D distance, depth estimation, metric scale recovery