-
Notifications
You must be signed in to change notification settings - Fork 0
Target Localization and Body Estimation by 3D Points
I follow the camera model and back-projection of points to rays in Hartley and Zisserman's book. Rather than pseuo-inverse of P, finite camera model is used to pack-project a ray from a point x in image frame.
X = (C + u M^(-1)x, 1)
where C is the camera center in the global coordinate system, u is a positive scalar, in the case of finite cameras M is a non-singular matrix M = KR, K is intrinsic camera matrix, R is the camera rotation matrix w.r.t. the global coordinate system.
A cone is composed of 4 3D rays reprojected from a bounding box using the above camera model. To decide if a point is in the cone, I first represent the cone by 4 half spaces converted from the 4 3D rays:
HX >= 0
where H is a 4x3 matrix composed of 4 rows of normal vectors (of the half spaces), and X = (x, y, z) is the 3D point to be decided. If the inequality is satisfied, the point is in the cone; otherwise, not.
Note: A normal vector of two vectors can be computed from cross product. The normal vector direction, which can be obtained the order of cross product, determines the half-space directions.
The above formula requires points in camera coordinate system. Thus, I convert points in world coordinate system to points in camera coordinate system as follows:
P(c) = T(w2c) P(w)
where P(c) is points in camera coordinate system, P(w) is points in world coordinate system, T(w2c) is transform from world to camera, T(w2c)=T(c2w)^(-1), and T(c2w)=T(b2w)T(c2b).
To randomly generate 3D points in a cone, there are the following steps.
- randomly generate 3D points in a rectangle that is the intersection between the cone and the plane z=1 in the camera coordinate system: p=a1* ray1 + a2* ray2 + a3* ray3 + a4* ray4, where a1+a2+a3+a4=1, and a1,a2,a3,a4>=0
- randomly scale the above points: p=ap, where a is a random scalar
- convert the points to world coordinate system
In practice, I use matrix form of the above steps to generate a bunch of points.
A ~ U(4, n) # n is the number points
S = Diag(|A|) # |.| is manhattan distance of all columns
A_ = AS^(-1) # all the columns of A_ are random (a1, a2, a3, a4) above
P = C^(T)A_ # C is a cone, C=[r1, r2, r3, r4]
z = aU(n) + b # a uniform sampling of z distances in a certain range
Z = Diag(z)
P_ = PZ
P_c = H(P) # H(.) is homogeneous transform
P_w = TP_c # T is transformation from camera frame to world frame
I use importance sampling to update points. More weights are put on the points with smaller HX, which means points closer to the bounding box cone are more weighted. Then the points are resampled from a Gaussian distribution, and the resample number is proportional to the weights. The steps of updating points are described as follows:
X_ = X + w # w is updating noise
Z = HX_ >=0 # Z is the points in a bounding box
Y = HX_ <0 # Y is the points out of the bounding box
D = min(HZ) # min(.) is to get the minimum of column values,
# so D is a vector of the minimum distances to any bounding box edges
sigma = a(T - N)/T # T is the total number of points,
# N is the number of points in the bounding box,
# a is a proportional gain
W = N(D, sigma) # D is the distance to mean value, W is the weights for all points
Y_ ~ <Z, W> # importance sampling/ resampling Y proportional to W
X_ = Y_ + Z
In addition, when a bounding box is too close to the image edge, the updating should be different because the bounding box may only capture a part of an object body.
I also update points following the depth filter method, which reprojects 3D points to images and resamples them based on their positions with respect to a bounding box.
X_ = T(X + w) # w is updating noise, T is a transform from world to camera
x_ = proj(X_)
p = P(x|bbox)
X ~ Importance_Sampling(p) # X is updated 3D points
P(x|bbox) is a two-variable distribution composed of Gaussian and uniform distributions, which is similar to the measurement posteriori in SVO. The Gaussian distribution has a mean at the center of the bounding box and deviations as the half length of edges; the uniform distribution covers the bounding box.
The center of an ellipsoid representation is the mean of 3D points. The three major axes of the ellipsoid representation are acquired from the Principle Component Analysis.
When updating 3D points, the following metrics are also computed:
- Kullback–Leibler divergence (KL divergence): Localization is confirmed when KL divergence is smaller than a threshold. i.e. KL(P_old | P_new), the entropy of P_new with respect to P_old.
- Differential entropy: Either too compact or too sparse 3D points will be considered as false detection.