

李明等 地球空间信息科学学报GSIS 2022-07-17

本文发表于Geo-spatial Information Science(地球空间信息科学学报,GSIS)


深度学习的兴起,更加激起了广大研究人员在该领域的探索,剑桥大学的Alex Kendall等学者正是在其启发下,首次提出了通过卷积神经网络(CNN)来实现基于单张图像的实时相机姿态解算方法PoseNet。



VNLSTM-PoseNet: A novel deep ConvNet for real-time 6-DOF camera relocalization in urban streets

发表于Geo-Spatial Information Science(地球空间信息科学学报,GSIS)上。


引用本文 /

Ming Li, Jiangying Qin, Deren Li, Ruizhi Chen, Xuan Liao & Bingxuan Guo (2021) VNLSTM-PoseNet: A novel deep ConvNet for real-time 6-DOF camera relocalization in urban streets, Geo-Spatial Information Science, 24:3, 422-437, DOI:10.1080/10095020.2021.1960779

本文主要研究内容 /

  • 本文提出了一种新的深度卷积神经网络结构,它解决了仅从RGB图像在城市街道场景中进行基于图像的相机再定位的巨大挑战,而不是像传统的基于匹配的视觉再定位技术那样需要预先计算局部特征点并构建场景的3D真实稠密模型。

  • 在提出的VNLSTM-PoseNet方法中,首先,通过采用新的剪裁方法,训练图像以获得更大的感受野,从而可以获得更多的关键图像信息,进而提高定位精度。接着,将LSTM结构引入到PoseNet网络中,对全连接层进行结构化降维,并选择最有用的相关特征进行实时相机姿态回归。最后,为了获得更合适的深度卷积网络超参数,在Pytorch框架下使用Nadam优化器对网络进行优化。

  • 通过实验对现有的两个公开室外数据集进行系统评估表明:与本文实验中其他基于PoseNet的方法相比,VNLSTM-PoseNet可以显著提高定位性能,并在OldHospital数据集中实现了小于0.9米的定位精度。

/ 前沿观点/


Image-based camera relocalization is a basic problem in many computer vision applications, such as autonomous vehicle driving, mobile robots, Augmented Reality (AR), pedestrian visual positioning, Structure from Motion (SfM) (Li et al. 2020b; Tateno et al. 2017; Asadi et al. 2019; Liu et al. 2020; Acharya et al.2019a; Niu et al. 2019), and so on.


In 2015, Kendall, Grimes, and Cipolla (2015) innovatively introduced Convolutional Neural Networks (CNN) into the field of image-based camera positioning and proposed PoseNet method. This method uses transfer learning from large-scale classification data to directly obtain 6-DOF camera pose from a single image in an end-to-end manner. It significantly improves the robustness and efficiency of geometric positioning based on local features and positioning using bag of word vectors and random forests image retrieval technology in traditional machine learning. 


Although PoseNet overcomes many limitations of existing methods, especially reduces the dependence on rich textures, and improves the robustness and efficiency of localization, its localization accuracy is still far behind the geometric-based visual relocalization method when the local features perform well.


By improving the clipping method of 

input image, the image can obtain a larger receptive field, thereby obtaining more characteristic information for image positioning.


Based on the Pytorch framework, the Nadam optimizer is used to optimize the network to obtain more suitable network parameters. The LSTM structure is introduced into the PoseNet network to perform structural dimensionality reduction on the Fully Connected (FC) layer and select the most useful relevant features for camera relocalization tasks. Experiments show that the method proposed in this paper has better accuracy and stronger robustness than PoseNet.

与PoseNet相比,LSTM-PoseNet的位置误差也有了很大的改善。大多数图像的位置误差在10米以内,方向误差在20度以内。Nadam-PoseNet只有少数图像的位置误差在15至20米之间,但方向误差在18度以内。对于VNLSTM-PoseNet,最大位置误差和方向误差仅略大于15 米和15度,且此类图像的数量非常少,与PoseNet相比,其位置精度和方向精度有很大提高。

The position errors of LSTM-PoseNet are greatly improved compared with PoseNet as well. The position errors of most images are within 10 m and the orientation errors are within 20 degrees. Nadam-PoseNet has only a few images whose position errors are between 15 m to 20 m but the orientation errors are within 18 degrees. For VNLSTM-PoseNet, the maximum position errors and orientation errors are only slightly larger than 15 m and 15 degrees while the number of such images is very small, and the position accuracy and orientation accuracy are greatly improved compared with PoseNet.


With a systematic evaluation on the two existing outdoor datasets through experiments, we show that VNLSTM- PoseNet can lead to drastic improvements in positioning performance compared to other PoseNet-based methods, and achieving a localization accuracy of approximately less than 0.9 m in the dataset of Old Hospital. 



Besides aiming to close the gap in accuracy between local feature matching-based image localization, it has a vast advantage with robustness and efficiency.

The localization errors definitely can be affected  by  those  challenging  scenarios. Alternatively, the errors could be an effect of the features learnt by the deep ConvNet for localization. In future work, we will conduct more in depth research and exploration on the correlation of these problems, and introduce more constraints and information to improve the accuracy of camera pose regression based on convolutional neural network.



武汉大学副教授,ETH Zürich博士后。他的主要研究兴趣是机器学习、摄影测量计算机视觉、机器人学、水下摄影测量和遥感的原理和方法。

Ming Li is an associate professor of Wuhan University and a postdoctoral research fellow of ETH Zürich. His main research interests are the principles and methods of machine learning, photogrammetric computer vision, robotics, and underwater photogrammetry and remote sensing.



Jiangying Qin is a master of Wuhan University. Her main interests are machine learning and photogrammetric computer vision, especially in indoor positioning and navigation based on geometry and machine learning.



Deren Li is an academician of the Chinese Academy of Sciences and the Chinese Academy of engineering. The main research contents are the theoretical innovation, integrated innovation and collaborative innovation of geospatial informatics.



Ruizhi Chen is a professor of Wuhan University, and his research interests include ubiquitous positioning of smart phones and satellite navigation.

廖  萱


Xuan Liao received her master’s degree from Wuhan University of Photogrammetry and Remote Sensing in 2020. She is currently a research assistant and doctoral student at Hong Kong Polytechnic University. Her current research focuses on global solar computing based on remote sensing and space information technology, deep learning, and change detection.



Bingxuan Guo is a professor of Wuhan University, mainly engaged in digital photogrammetry, computer vision, graphics and imaging, indoor positioning and artificial intelligence.


图1 提出的姿态回归网络的体系结构

图2 PoseNet图像预处理的例子

图3 GoogleNet的结构

图4 PoseNet的结构

图5 改进的图像预处理效果

图6 PoseNet和本文方法的视野对比图

图7 deep ConvNet结构 

图8 实验数据集示例图像(a)光线充足的教堂图像(b)光线不足的教堂图像 (c)建筑大视野图像(d)建筑贴近局部视野图像

表1 几种方法的定位结果。

图9 不同epoch的训练损失(a) 不同epoch的位置误差(b)不同epoch的方向误差

图10 查询图像的定位误差散点图 (a) PoseNet(b)Bv-PoseNet(c)LSTM-PoseNet(d)Nadam-PoseNet(e)本文方法

图11 定位误差的累积直方图

(a) 位置误差(b)方向误差

图12 真实位姿和估计位姿的比较

(a) PoseNet(b)VNLSTM PoseNet

图13 可视化轨迹不同部分的误差

(a) PoseNet(b)VNLSTM PoseNet

# 扫描二维码查看原文 #

Ming Li, Jiangying Qin, Deren Li, Ruizhi Chen, Xuan Liao & Bingxuan Guo (2021) VNLSTM-PoseNet: A novel deep ConvNet for real-time 6-DOF camera relocalization in urban streets, Geo-spatial Information Science, 24:3, 422-437,


翻译:王浩天     制作:王浩天

编辑:王晓醉     审核:张淑娟

Call for papers

# 01



# 02



# 03




Geo-spatial Information Science(GSIS)是由武汉大学主办的测绘遥感专业英文期刊,主编为中国科学院院士、中国工程院院士李德仁教授。2020年9月被SCIE收录,IF2020:4.288;CiteScore2020 7.4。

GSIS 采用开放(OA)获取形式,文章一经发表,可马上被全球读者免费全文下载,这种模式可以让你的文章有更多的曝光度。如果您有需要抢首发权的高质量文章,可与我们联系gsis@whu.edu.cn,主编/国际副主编亲自为您处理,编辑部提供随时随地的疑问解答与状态跟踪。





