stay LoopClosing in , After obtaining the closed-loop candidate frame of the current frame , Will use word bags between these two frames 2D Characteristic point The matching of . Be careful 2D The matching of feature points is simply to get the same word bag between two frames node The descriptor under is the closest match . Because these feature points have corresponding map points （ In the process of word bag matching, those feature points without map points are screened out ）, So it's equivalent to getting Map points between frames The match of . So we can use the matching relationship of map points between the two frames later , To calculate the distance between the current frame coordinate system and the closed-loop candidate frame coordinate system sim3 Transformation . Reason for calculation sim3, Because monocular has scale drift , That is, actually The pose of the current frame and the world coordinates of map points calculated by it are inaccurate , However, the coordinates of the map points of the current frame in the current frame coordinate system are accurate , Because this is a Local relations .
Preliminary results have been obtained sim3 After transformation ,bool LoopClosing::ComputeSim3()
The function is called matcher.SearchBySim3()
The function is in the map point of the closed-loop candidate frame , Find more matches with the current frame . When matching projections in this , The above preliminary calculation is used sim3 Transformation , Because at this time, the coordinate transformation relationship between the two camera coordinate systems is sim3, Scale drift is considered , So it is relatively accurate . After getting more matches , Just use g2o Optimize to get more accurate sim3 Transformation .
The problem arises in the last step ： In fact, just from the calculation sim3 Come on , The task has been completed above . However, in order to carefully judge whether the closed-loop matching is successful or not , The program also uses matcher.SearchByProjection()
The map points of the closed-loop candidate frame and its common view key frame are projected into the current frame again , See how many map points match in the end . In fact, it can be considered that there is no scale drift between the closed-loop candidate frame and its common view key frame because they are very close , Therefore, the map points of the common view key frame of the closed-loop candidate frame can be transformed into the closed-loop candidate frame by Euclidean transformation , And then according to 2 The operation , utilize sim3 Transform to the current frame to find a match . But there are two different things ：
matcher.SearchBySim3()
It's just a match between two map points , Their perspectives don't differ much . But now all map points in the common view key frame group of the closed-loop candidate frame are matched with the map points of the current frame , There is likely to be a big difference in perspective , stay ORB-SLAM Medium angle of view is poor >60 The degree is that the match is inaccurate . Therefore, the direction vector of the map point from the camera optical center of the current frame to the common view key frame group of the closed-loop candidate frame , Therefore, we need to know the distance to the optical center of the camera Real world coordinates . At first I thought I understood , But after careful consideration, I found that I still didn't understand a lot . But a vague idea can be summed up in one sentence, that is, scale s After stripping ,sim3 The transformation is in the following form ：
$X_{′}=sR∗X+t=s(R∗X+s1 t)$
Among them is by $R$ and $s1 t$ It consists of a Euclidean transformation , The Euclidean transformation can represent the pose . So this way of using peel scale , It's equivalent to recovering The real pose of the camera , So scale s What does it stand for ？ here scale s Is the scaling of the camera axis scale . For example, in the world coordinate system, the length is 1 Vector , If it is a simple Euclidean transformation, then the length is still 1. But use here sim3 Transformation , It can be considered that according to $R$ and $s1 t$ The composition of the Euclidean transformation , But after the transformation, the coordinates have to be changed again s Zoom in . It is equivalent to the coordinate axis of the camera after Euclidean transformation , After scaling , The length of the axis is not a unit 1 了 , Turned into s（ The axis has no length , Such a metaphor may not be appropriate , But that's what it means ）.