Efficient image based localization using machine learning techniques

Date

2021-04-23

Authors

Elmougi, Ahmed

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Localization is critical for self-awareness of any autonomous system and is an important part of the autonomous system stack which consists of many phases including sensing, perceiving, planning and control. In the sensing phase, data from on board sensors are collected, preprocessed and passed to the next phase. The perceiving phase is responsible for self awareness or localization and situational awareness which includes multi-objects detection and scene understanding. After the autonomous system is aware of where it is and what is around it, it can use this knowledge to plan for the path it can take and send control commands to pursue this path. In this proposal, we focus on the localization part of the autonomous stack using camera images. We deal with the localization problem from different perspectives including single images and videos. Starting with the single image pose estimation, our approach is to propose systems that not only have good localization accuracy, but also have low space and time complexity. Firstly, we propose SurfCNN, a low cost indoor localization system that uses SURF descriptors instead of the original images to reduce the complexity of training convolutional neural networks (CNN) for indoor localization application. Given a single input image, the strongest SURF features descriptors are used as input to 5 convolutional layers to find its absolute position and orientation in arbitrary reference frame. The proposed system achieves comparable performance to the state of the art using only 300 features without the need for using the full image or complex neural networks architectures. Following, we propose SURF-LSTM, an extension to the idea of using SURF descriptors instead the original images. However, instead of CNN used in SurfCNN, we use long short term memory (LSTM) network which is one type of recurrent neural networks (RNN) to extract the sequential relation between SURF descriptors. Using SURF-LSTM, We only need 50 features to reach comparable or better results compared with SurfCNN that needs 300 features and other works that use full images with large neural networks. In the following research phase, instead of using SURF descriptors as image features to reduce the training complexity, we study the effect of using features extracted from other CNN models that were pretrained on other image tasks like image classification without further training and fine tuning. To learn the pose from pretrained features, graph neural networks (GNN) are adopted to solve the single image localization problem (Pose-GNN) by using these features representations either as features of nodes in a graph (image as a node) or converted into a graph (image as a graph). The proposed models outperform the state of the art methods on indoor localization dataset and have comparable performance for outdoor scenes. In the final stage of single image pose estimation research, we study if we can achieve good localization results without the need for training complex neural network. We propose (Linear-PoseNet) by which we can achieve similar results to the other methods based on neural networks with training a single linear regression layer on image features from pretrained ResNet50 in less than one second on CPU. Moreover, for outdoor scenes, we propose (Dense-PoseNet) that have only 3 fully connected layers trained on few minutes that reach comparable performance to other complex methods. The second localization perspective is to find the relative poses between images in a video instead of absolute poses. We extend the idea used in SurfCNN and SURF-LSTM systems and use SURF descriptors as feature representation of the images in the video. Two systems are proposed to find the relative poses between images in the video using 3D-CNN and 2DCNN-RNN. We show that using 3D-CNN is better than using the combination of CNN-RNN for relative pose estimation.

Description

Keywords

SLAM, deep learning, graph neural networks, convolutional neural networks, recurrent neural networks, computer vision

Citation