浅谈多目标跟踪中的相机运动
©PaperWeekly 原创 · 作者|黄飘
学校|华中科技大学硕士生
研究方向|多目标跟踪
之前的文章中我介绍了 Kalman 滤波器,这个算法被广泛用于多目标跟踪任务中的行人运动模型。然而实际场景中存在有很多相机运动,仅仅依赖行人运动模型是不够的。这次我主要介绍下相机运动模型,以对极几何和 ECC 为主。完整的代码和示例我都放在了 Github:
https://github.com/nightmaredimple/libmot
比如 MOT17-13 号视频中车载相机在车辆转弯时对于两个运动速度较慢行人的视角:
我们从示意图可以看到,由于车辆转弯速度很快,上一帧的行人位置映射到下一帧就变成了另一个的位置。因此相机运动对于多目标跟踪的影响很大,尤其是仅依赖运动信息的模型,相机的运动会严重干扰运动模型。
7. 极线约束:两极线上点的对应关系。
其次,我们先假设本质矩阵 F 已经被估计出来了,这个矩阵是 3x3 的形状,那么为了推导方便,我这里做一个假设:
对于第 t 帧的任意一个目标框的每一个节点 ,这里由于是三维的几何信息,所以添加一个 z 轴坐标,令 为一个已知的三维向量,那么一个目标框就存在四个这样的三维向量,不妨看作一个 4x3 的矩阵 M。
那么就可以将目标函数展开,这里面的 (w,h) 为已知信息,(x,y) 为下一帧目标框的左上角坐标:
很明显这就是一个典型的 Ax=b 问题,后面的问题就迎刃而解了。
2.2 实验分析
class Epipolar(object):
def __init__(self, feature_method = 'orb', match_method = 'brute force',
metric = cv2.NORM_HAMMING, n_points = 50, nfeatures = 500,
scaleFactor = 1.2, nlevels = 8):
"""Using Epipolar Geometry to Estimate Camara Motion
Parameters
----------
feature_method : str
the method of feature extraction, the default is ORB, more methods will be added in the future
match_method : str
the method of feature matching, the default is brute force, more methods will be added in the future
metric: metrics in cv2
distance metric for feature matching
n_points: int
numbers of matched points to be considered
nfeatures: int
numbers of features to be extract
scaleFactor: float
scale factor for orb
nlevels: float
levels for orb
"""
self.metric = metric
if feature_method == 'orb':
self.feature_extractor = cv2.ORB_create(nfeatures = nfeatures,
scaleFactor = scaleFactor, nlevels = nlevels)
if match_method == 'brute force':
self.matcher = cv2.BFMatcher(metric, crossCheck=True)
self.n_points = n_points
def FeatureExtract(self, img):
"""Detect and Compute the input image's keypoints and descriptors
Parameters
----------
img : ndarray of opencv
An HxW(x3) matrix of img
Returns
-------
keypoints : List of cv2.KeyPoint
using keypoint.pt can see (x,y)
descriptors: List of descriptors[keypoints, features]
keypoints: keypoints which a descriptor cannot be computed are removed
features: An Nx32 ndarray of unit8 when using "orb" method
"""
if img.ndim == 3:
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# find the keypoints with ORB
keypoints = self.feature_extractor.detect(img, None)
# compute the descriptors with ORB
keypoints, descriptors = self.feature_extractor.compute(img, keypoints)
return keypoints, descriptors
那么对于本质矩阵的估计和最小二乘法的应用,都可以直接利用已有的工具箱 opencv 和 numpy 搞定:
def GetFundamentalMat(self, keypoints1, descriptors1, keypoints2, descriptors2):
"""Estimate FunfamentalMatrix using BF matcher and ransac
[p2;1]^T K^(-T) E K^(-1) [p1;1] = 0, T means transpose, K means the intrinsic matrix of camera
F = K^(-T) E K^(-1)
Parameters
----------
keypoints : List of cv2.KeyPoint
using keypoint.pt can see (x,y)
descriptor : ndarray
An Nx32 matrix of descriptors
Returns
-------
F: ndarray
A 3x3 Matrix of Fundamental Matrix
mask: ndarray
A Nx1 Matrix of those inline points
pts1: List of cv2.KeyPoint
keypoints matched
pts2: List of cv2.KeyPoint
keypoints matched
matches : List of matches
distance - distance of two points,
queryIdx - query image's descriptor id, default is the second image
trainIdx - train image's descriptor id, default is the second image
imageIdx - train image's id, default is 0
"""
# matching points
matches = self.matcher.match(descriptors1, descriptors2)
matches = sorted(matches, key=lambda x: x.distance)
pts1 = []
pts2 = []
for i, match in enumerate(matches):
if i >= self.n_points:
break
pts1.append(keypoints1[match.queryIdx].pt)
pts2.append(keypoints2[match.trainIdx].pt)
pts1 = np.int32(pts1)
pts2 = np.int32(pts2)
matches = matches[:self.n_points]
## Estimate Fundamental Matrix by ransac, distance_threshold = 1, confidence_threshold = 0.99
F, mask = cv2.findFundamentalMat(pts1, pts2, cv2.FM_RANSAC, 1, 0.99)
return F, mask, pts1, pts2, matches
def EstimateBox(self, boxes, F):
"""Estimate box in target image by Fundamental Matrix
Parameters
----------
boxes : array like
A Nx4 matrix of boxes in source images (x,y,w,h)
F : ndarray
A 3x3 Fundamental Matrix
Returns
-------
aligned_boxes: ndarray
A Nx4 matrix of boxes in source images (x,y,w,h)
Method
-------
L = ||Bi^T F Ai||2 + ||(A2-A0)+(B2-B0)||2
A is the four corner of box in source image
B is the four corner of aligned box in target image
A0,B0:top left corner of box, [x;y;1]
A1,B1:top right corner of box
A2,B2:bottom left corner of box
A3,B3:bottom right corner of box
the height and width of boxes and aligned boxes are assumed to be same
we can use greedy strategy: make M = A^T F^T
then:
M11 x1 + M12 y1 + M13 = 0
M21 (x1+w) + M22 y1 + M23 = 0
M31 x1 + M32 y1+h + M33 = 0
M41 (x1+w) + M42 (y1+h) + M43 = 0
=>
M[:2][x;y] + M[:3]+[0;M21w;M32h;M41w+M42h] = 0 ->Ax = b
x = (pseudo inverse of A )b
"""
boxes = np.asarray(boxes)
if boxes.ndim == 1:
boxes = boxes[np.newaxis, :]
aligned_boxes = np.zeros(boxes.shape)
for i, bbox in enumerate(boxes):
w = bbox[2]
h = bbox[3]
AT = np.array([[bbox[0] , bbox[1] , 1],
[bbox[0] + w, bbox[1] , 1],
[bbox[0] , bbox[1] + h, 1],
[bbox[0] + w, bbox[1] + h, 1]])
M = AT @ F.T
b = -M[:, 2] - np.array([0, M[1][0]*w, M[2][1]*h, M[3][0]*w+M[3][1]*h])
aligned_tl = np.linalg.pinv(M[:,:2]) @ b
aligned_boxes[i, 0] = aligned_tl[0]
aligned_boxes[i, 1] = aligned_tl[1]
aligned_boxes[i, 2] = w
aligned_boxes[i, 3] = h
return aligned_boxes.astype(np.int32)
具体效果如下:
第二章所介绍的对极几何方法,由于我们只是根据二维信息对三维信息的估计,所以也会存在误差。这一张我们也讲一个简单有效的方案,那就是“仿射变换”。当然,并不是我们所理解的那种仿射变换,具体细节我将慢慢介绍。
也就是对于两张内容差异小,但是存在光照、尺度、颜色、平移等变换影响的图像,将二者对齐。ECC 算法本质是一个目标函数:
当然这只是一个原始形式,在求解过程中有所调整,我就不细讲这里的理论了。可以注意到的是 y=warp(x) 这个函数,所以这个算法假设两帧图像之间存在某种变换,不一定是仿射变换,可能有以下几种:
其中最后一种透视变换的矩阵形式是:
前三种变换则不考虑最后一行信息,即 2x3 的矩阵形式。
opencv 中正好提供了 ECC 相关的功能函数,这里我们只需要再次封装,以方便多目标跟踪。可以知道的是 ECC 算法的核心在于变换矩阵的求解:
def ECC(src, dst, warp_mode = cv2.MOTION_EUCLIDEAN, eps = 1e-5,
max_iter = 100, scale = None, align = False):
"""Compute the warp matrix from src to dst.
Parameters
----------
src : ndarray
An NxM matrix of source img(BGR or Gray), it must be the same format as dst.
dst : ndarray
An NxM matrix of target img(BGR or Gray).
warp_mode: flags of opencv
translation: cv2.MOTION_TRANSLATION
rotated and shifted: cv2.MOTION_EUCLIDEAN
affine(shift,rotated,shear): cv2.MOTION_AFFINE
homography(3d): cv2.MOTION_HOMOGRAPHY
eps: float
the threshold of the increment in the correlation coefficient between two iterations
max_iter: int
the number of iterations.
scale: float or [int, int]
scale_ratio: float
scale_size: [W, H]
align: bool
whether to warp affine or perspective transforms to the source image
Returns
-------
warp matrix : ndarray
Returns the warp matrix from src to dst.
if motion model is homography, the warp matrix will be 3x3, otherwise 2x3
src_aligned: ndarray
aligned source image of gray
"""
assert src.shape == dst.shape, "the source image must be the same format to the target image!"
# BGR2GRAY
if src.ndim == 3:
# Convert images to grayscale
src = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)
dst = cv2.cvtColor(dst, cv2.COLOR_BGR2GRAY)
# make the imgs smaller to speed up
if scale is not None:
if isinstance(scale, float) or isinstance(scale, int):
if scale != 1:
src_r = cv2.resize(src, (0, 0), fx = scale, fy = scale,interpolation = cv2.INTER_LINEAR)
dst_r = cv2.resize(dst, (0, 0), fx = scale, fy = scale,interpolation = cv2.INTER_LINEAR)
scale = [scale, scale]
else:
src_r, dst_r = src, dst
scale = None
else:
if scale[0] != src.shape[1] and scale[1] != src.shape[0]:
src_r = cv2.resize(src, (scale[0], scale[1]), interpolation = cv2.INTER_LINEAR)
dst_r = cv2.resize(dst, (scale[0], scale[1]), interpolation=cv2.INTER_LINEAR)
scale = [scale[0] / src.shape[1], scale[1] / src.shape[0]]
else:
src_r, dst_r = src, dst
scale = None
else:
src_r, dst_r = src, dst
# Define 2x3 or 3x3 matrices and initialize the matrix to identity
if warp_mode == cv2.MOTION_HOMOGRAPHY :
warp_matrix = np.eye(3, 3, dtype=np.float32)
else :
warp_matrix = np.eye(2, 3, dtype=np.float32)
# Define termination criteria
criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, max_iter, eps)
# Run the ECC algorithm. The results are stored in warp_matrix.
(cc, warp_matrix) = cv2.findTransformECC (src_r, dst_r, warp_matrix, warp_mode, criteria, None, 1)
if scale is not None:
warp_matrix[0, 2] = warp_matrix[0, 2] / scale[0]
warp_matrix[1, 2] = warp_matrix[1, 2] / scale[1]
if align:
sz = src.shape
if warp_mode == cv2.MOTION_HOMOGRAPHY:
# Use warpPerspective for Homography
src_aligned = cv2.warpPerspective(src, warp_matrix, (sz[1],sz[0]), flags=cv2.INTER_LINEAR)
else :
# Use warpAffine for Translation, Euclidean and Affine
src_aligned = cv2.warpAffine(src, warp_matrix, (sz[1],sz[0]), flags=cv2.INTER_LINEAR)
return warp_matrix, src_aligned
else:
return warp_matrix, None
其他近似方案
参考资料
点击以下标题查看更多往期内容:
🔍
现在,在「知乎」也能找到我们了
进入知乎首页搜索「PaperWeekly」
点击「关注」订阅我们的专栏吧
关于PaperWeekly
PaperWeekly 是一个推荐、解读、讨论、报道人工智能前沿论文成果的学术平台。如果你研究或从事 AI 领域,欢迎在公众号后台点击「交流群」,小助手将把你带入 PaperWeekly 的交流群里。