fast-yolo4/3rdparty/opencv/inc/opencv2/surface_matching.hpp

//
//  IMPORTANT: READ BEFORE DOWNLOADING, COPYING, INSTALLING OR USING.
//
//  By downloading, copying, installing or using the software you agree to this license.
//  If you do not agree to this license, do not download, install,
//  copy or use the software.
//
//
//                          License Agreement
//                For Open Source Computer Vision Library
//
// Copyright (C) 2014, OpenCV Foundation, all rights reserved.
// Third party copyrights are property of their respective owners.
//
// Redistribution and use in source and binary forms, with or without modification,
// are permitted provided that the following conditions are met:
//
//   * Redistribution's of source code must retain the above copyright notice,
//     this list of conditions and the following disclaimer.
//
//   * Redistribution's in binary form must reproduce the above copyright notice,
//     this list of conditions and the following disclaimer in the documentation
//     and/or other materials provided with the distribution.
//
//   * The name of the copyright holders may not be used to endorse or promote products
//     derived from this software without specific prior written permission.
//
// This software is provided by the copyright holders and contributors "as is" and
// any express or implied warranties, including, but not limited to, the implied
// warranties of merchantability and fitness for a particular purpose are disclaimed.
// In no event shall the Intel Corporation or contributors be liable for any direct,
// indirect, incidental, special, exemplary, or consequential damages
// (including, but not limited to, procurement of substitute goods or services;
// loss of use, data, or profits; or business interruption) however caused
// and on any theory of liability, whether in contract, strict liability,
// or tort (including negligence or otherwise) arising in any way out of
// the use of this software, even if advised of the possibility of such damage.
#ifndef __OPENCV_SURFACE_MATCHING_HPP__
#define __OPENCV_SURFACE_MATCHING_HPP__

#include "surface_matching/ppf_match_3d.hpp"
#include "surface_matching/icp.hpp"

/** @defgroup surface_matching Surface Matching

Note about the License and Patents
-----------------------------------

The following patents have been issued for methods embodied in this
software: "Recognition and pose determination of 3D objects in 3D scenes
using geometric point pair descriptors and the generalized Hough
Transform", Bertram Heinrich Drost, Markus Ulrich, EP Patent 2385483
(Nov. 21, 2012), assignee: MVTec Software GmbH, 81675 Muenchen
(Germany); "Recognition and pose determination of 3D objects in 3D
scenes", Bertram Heinrich Drost, Markus Ulrich, US Patent 8830229 (Sept.
9, 2014), assignee: MVTec Software GmbH, 81675 Muenchen (Germany).
Further patents are pending. For further details, contact MVTec Software
GmbH (info@mvtec.com).

Note that restrictions imposed by these patents (and possibly others)
exist independently of and may be in conflict with the freedoms granted
in this license, which refers to copyright of the program, not patents
for any methods that it implements.  Both copyright and patent law must
be obeyed to legally use and redistribute this program and it is not the
purpose of this license to induce you to infringe any patents or other
property right claims or to contest validity of any such claims.  If you
redistribute or use the program, then this license merely protects you
from committing copyright infringement.  It does not protect you from
committing patent infringement.  So, before you do anything with this
program, make sure that you have permission to do so not merely in terms
of copyright, but also in terms of patent law.

Please note that this license is not to be understood as a guarantee
either.  If you use the program according to this license, but in
conflict with patent law, it does not mean that the licensor will refund
you for any losses that you incur if you are sued for your patent
infringement.


Introduction to Surface Matching
--------------------------------

Cameras and similar devices with the capability of sensation of 3D structure are becoming more
common. Thus, using depth and intensity information for matching 3D objects (or parts) are of
crucial importance for computer vision. Applications range from industrial control to guiding
everyday actions for visually impaired people. The task in recognition and pose estimation in range
images aims to identify and localize a queried 3D free-form object by matching it to the acquired
database.

From an industrial perspective, enabling robots to automatically locate and pick up randomly placed
and oriented objects from a bin is an important challenge in factory automation, replacing tedious
and heavy manual labor. A system should be able to recognize and locate objects with a predefined
shape and estimate the position with the precision necessary for a gripping robot to pick it up.
This is where vision guided robotics takes the stage. Similar tools are also capable of guiding
robots (and even people) through unstructured environments, leading to automated navigation. These
properties make 3D matching from point clouds a ubiquitous necessity. Within this context, I will
now describe the OpenCV implementation of a 3D object recognition and pose estimation algorithm
using 3D features.

Surface Matching Algorithm Through 3D Features
----------------------------------------------

The state of the algorithms in order to achieve the task 3D matching is heavily based on
@cite drost2010, which is one of the first and main practical methods presented in this area. The
approach is composed of extracting 3D feature points randomly from depth images or generic point
clouds, indexing them and later in runtime querying them efficiently. Only the 3D structure is
considered, and a trivial hash table is used for feature queries.

While being fully aware that utilization of the nice CAD model structure in order to achieve a smart
point sampling, I will be leaving that aside now in order to respect the generalizability of the
methods (Typically for such algorithms training on a CAD model is not needed, and a point cloud
would be sufficient). Below is the outline of the entire algorithm:

![Outline of the Algorithm](img/outline.jpg)

As explained, the algorithm relies on the extraction and indexing of point pair features, which are
defined as follows:

\f[\bf{{F}}(\bf{{m1}}, \bf{{m2}}) = (||\bf{{d}}||_2, <(\bf{{n1}},\bf{{d}}), <(\bf{{n2}},\bf{{d}}), <(\bf{{n1}},\bf{{n2}}))\f]

where \f$\bf{{m1}}\f$ and \f$\bf{{m2}}\f$ are feature two selected points on the model (or scene),
\f$\bf{{d}}\f$ is the difference vector, \f$\bf{{n1}}\f$ and \f$\bf{{n2}}\f$ are the normals at \f$\bf{{m1}}\f$ and
\f$\bf{m2}\f$. During the training stage, this vector is quantized, indexed. In the test stage, same
features are extracted from the scene and compared to the database. With a few tricks like
separation of the rotational components, the pose estimation part can also be made efficient (check
the reference for more details). A Hough-like voting and clustering is employed to estimate the
object pose. To cluster the poses, the raw pose hypotheses are sorted in decreasing order of the
number of votes. From the highest vote, a new cluster is created. If the next pose hypothesis is
close to one of the existing clusters, the hypothesis is added to the cluster and the cluster center
is updated as the average of the pose hypotheses within the cluster. If the next hypothesis is not
close to any of the clusters, it creates a new cluster. The proximity testing is done with fixed
thresholds in translation and rotation. Distance computation and averaging for translation are
performed in the 3D Euclidean space, while those for rotation are performed using quaternion
representation. After clustering, the clusters are sorted in decreasing order of the total number of
votes which determines confidence of the estimated poses.

This pose is further refined using \f$ICP\f$ in order to obtain the final pose.

PPF presented above depends largely on robust computation of angles between 3D vectors. Even though
not reported in the paper, the naive way of doing this (\f$\theta = cos^{-1}({\bf{a}}\cdot{\bf{b}})\f$
remains numerically unstable. A better way to do this is then use inverse tangents, like:

\f[<(\bf{n1},\bf{n2})=tan^{-1}(||{\bf{n1}  \wedge \bf{n2}}||_2, \bf{n1} \cdot \bf{n2})\f]

Rough Computation of Object Pose Given PPF
------------------------------------------

Let me summarize the following notation:

-   \f$p^i_m\f$: \f$i^{th}\f$ point of the model (\f$p^j_m\f$ accordingly)
-   \f$n^i_m\f$: Normal of the \f$i^{th}\f$ point of the model (\f$n^j_m\f$ accordingly)
-   \f$p^i_s\f$: \f$i^{th}\f$ point of the scene (\f$p^j_s\f$ accordingly)
-   \f$n^i_s\f$: Normal of the \f$i^{th}\f$ point of the scene (\f$n^j_s\f$ accordingly)
-   \f$T_{m\rightarrow g}\f$: The transformation required to translate \f$p^i_m\f$ to the origin and rotate
    its normal \f$n^i_m\f$ onto the \f$x\f$-axis.
-   \f$R_{m\rightarrow g}\f$: Rotational component of \f$T_{m\rightarrow g}\f$.
-   \f$t_{m\rightarrow g}\f$: Translational component of \f$T_{m\rightarrow g}\f$.
-   \f$(p^i_m)^{'}\f$: \f$i^{th}\f$ point of the model transformed by \f$T_{m\rightarrow g}\f$. (\f$(p^j_m)^{'}\f$
    accordingly).
-   \f${\bf{R_{m\rightarrow g}}}\f$: Axis angle representation of rotation \f$R_{m\rightarrow g}\f$.
-   \f$\theta_{m\rightarrow g}\f$: The angular component of the axis angle representation
    \f${\bf{R_{m\rightarrow g}}}\f$.

The transformation in a point pair feature is computed by first finding the transformation
\f$T_{m\rightarrow g}\f$ from the first point, and applying the same transformation to the second one.
Transforming each point, together with the normal, to the ground plane leaves us with an angle to
find out, during a comparison with a new point pair.

We could now simply start writing

\f[(p^i_m)^{'} = T_{m\rightarrow g} p^i_m\f]

where

\f[T_{m\rightarrow g} = -t_{m\rightarrow g}R_{m\rightarrow g}\f]

Note that this is nothing but a stacked transformation. The translational component
\f$t_{m\rightarrow g}\f$ reads

\f[t_{m\rightarrow g} = -R_{m\rightarrow g}p^i_m\f]

and the rotational being

\f[\theta_{m\rightarrow g} = \cos^{-1}(n^i_m \cdot {\bf{x}})\\
 {\bf{R_{m\rightarrow g}}} = n^i_m \wedge {\bf{x}}\f]

in axis angle format. Note that bold refers to the vector form. After this transformation, the
feature vectors of the model are registered onto the ground plane X and the angle with respect to
\f$x=0\f$ is called \f$\alpha_m\f$. Similarly, for the scene, it is called \f$\alpha_s\f$.

### Hough-like Voting Scheme

As shown in the outline, PPF (point pair features) are extracted from the model, quantized, stored
in the hashtable and indexed, during the training stage. During the runtime however, the similar
operation is perfomed on the input scene with the exception that this time a similarity lookup over
the hashtable is performed, instead of an insertion. This lookup also allows us to compute a
transformation to the ground plane for the scene pairs. After this point, computing the rotational
component of the pose reduces to computation of the difference \f$\alpha=\alpha_m-\alpha_s\f$. This
component carries the cue about the object pose. A Hough-like voting scheme is performed over the
local model coordinate vector and \f$\alpha\f$. The highest poses achieved for every scene point lets us
recover the object pose.

### Source Code for PPF Matching

~~~{cpp}
// pc is the loaded point cloud of the model
// (Nx6) and pcTest is a loaded point cloud of
// the scene (Mx6)
ppf_match_3d::PPF3DDetector detector(0.03, 0.05);
detector.trainModel(pc);
vector<Pose3DPtr> results;
detector.match(pcTest, results, 1.0/10.0, 0.05);
cout << "Poses: " << endl;
// print the poses
for (size_t i=0; i<results.size(); i++)
{
    Pose3DPtr pose = results[i];
    cout << "Pose Result " << i << endl;
    pose->printPose();
}
~~~

Pose Registration via ICP
-------------------------

The matching process terminates with the attainment of the pose. However, due to the multiple
matching points, erroneous hypothesis, pose averaging and etc. such pose is very open to noise and
many times is far from being perfect. Although the visual results obtained in that stage are
pleasing, the quantitative evaluation shows \f$~10\f$ degrees variation (error), which is an acceptable
level of matching. Many times, the requirement might be set well beyond this margin and it is
desired to refine the computed pose.

Furthermore, in typical RGBD scenes and point clouds, 3D structure can capture only less than half
of the model due to the visibility in the scene. Therefore, a robust pose refinement algorithm,
which can register occluded and partially visible shapes quickly and correctly is not an unrealistic
wish.

At this point, a trivial option would be to use the well known iterative closest point algorithm .
However, utilization of the basic ICP leads to slow convergence, bad registration, outlier
sensitivity and failure to register partial shapes. Thus, it is definitely not suited to the
problem. For this reason, many variants have been proposed . Different variants contribute to
different stages of the pose estimation process.

ICP is composed of \f$6\f$ stages and the improvements I propose for each stage is summarized below.

### Sampling

To improve convergence speed and computation time, it is common to use less points than the model
actually has. However, sampling the correct points to register is an issue in itself. The naive way
would be to sample uniformly and hope to get a reasonable subset. More smarter ways try to identify
the critical points, which are found to highly contribute to the registration process. Gelfand et.
al. exploit the covariance matrix in order to constrain the eigenspace, so that a set of points
which affect both translation and rotation are used. This is a clever way of subsampling, which I
will optionally be using in the implementation.

### Correspondence Search

As the name implies, this step is actually the assignment of the points in the data and the model in
a closest point fashion. Correct assignments will lead to a correct pose, where wrong assignments
strongly degrade the result. In general, KD-trees are used in the search of nearest neighbors, to
increase the speed. However this is not an optimality guarantee and many times causes wrong points
to be matched. Luckily the assignments are corrected over iterations.

To overcome some of the limitations, Picky ICP @cite pickyicp and BC-ICP (ICP using bi-unique
correspondences) are two well-known methods. Picky ICP first finds the correspondences in the
old-fashioned way and then among the resulting corresponding pairs, if more than one scene point
\f$p_i\f$ is assigned to the same model point \f$m_j\f$, it selects \f$p_i\f$ that corresponds to the minimum
distance. BC-ICP on the other hand, allows multiple correspondences first and then resolves the
assignments by establishing bi-unique correspondences. It also defines a novel no-correspondence
outlier, which intrinsically eases the process of identifying outliers.

For reference, both methods are used. Because P-ICP is a bit faster, with not-so-significant
performance drawback, it will be the method of choice in refinment of correspondences.

### Weighting of Pairs

In my implementation, I currently do not use a weighting scheme. But the common approaches involve
*normal compatibility* (\f$w_i=n^1_i\cdot n^2_j\f$) or assigning lower weights to point pairs with
greater distances (\f$w=1-\frac{||dist(m_i,s_i)||_2}{dist_{max}}\f$).

### Rejection of Pairs

The rejections are done using a dynamic thresholding based on a robust estimate of the standard
deviation. In other words, in each iteration, I find the MAD estimate of the Std. Dev. I denote this
as \f$mad_i\f$. I reject the pairs with distances \f$d_i>\tau mad_i\f$. Here \f$\tau\f$ is the threshold of
rejection and by default set to \f$3\f$. The weighting is applied prior to Picky refinement, explained
in the previous stage.

### Error Metric

As described in , a linearization of point to plane as in @cite koklimlow error metric is used. This
both speeds up the registration process and improves convergence.

### Minimization

Even though many non-linear optimizers (such as Levenberg Mardquardt) are proposed, due to the
linearization in the previous step, pose estimation reduces to solving a linear system of equations.
This is what I do exactly using cv::solve with DECOMP_SVD option.

### ICP Algorithm

Having described the steps above, here I summarize the layout of the ICP algorithm.

#### Efficient ICP Through Point Cloud Pyramids

While the up-to-now-proposed variants deal well with some outliers and bad initializations, they
require significant number of iterations. Yet, multi-resolution scheme can help reducing the number
of iterations by allowing the registration to start from a coarse level and propagate to the lower
and finer levels. Such approach both improves the performances and enhances the runtime.

The search is done through multiple levels, in a hierarchical fashion. The registration starts with
a very coarse set of samples of the model. Iteratively, the points are densified and sought. After
each iteration the previously estimated pose is used as an initial pose and refined with the ICP.

#### Visual Results

##### Results on Synthetic Data

In all of the results, the pose is initiated by PPF and the rest is left as:
\f$[\theta_x, \theta_y, \theta_z, t_x, t_y, t_z]=[0]\f$

### Source Code for Pose Refinement Using ICP

~~~{cpp}
ICP icp(200, 0.001f, 2.5f, 8);
// Using the previously declared pc and pcTest
// This will perform registration for every pose
// contained in results
icp.registerModelToScene(pc, pcTest, results);

// results now contain the refined poses
~~~

Results
-------

This section is dedicated to the results of surface matching (point-pair-feature matching and a
following ICP refinement):

![Several matches of a single frog model using ppf + icp](img/gsoc_forg_matches.jpg)

Matches of different models for Mian dataset is presented below:

![Matches of different models for Mian dataset](img/snapshot27.jpg)

You might checkout the video on [youTube here](http://www.youtube.com/watch?v=uFnqLFznuZU).

A Complete Sample
-----------------

### Parameter Tuning

Surface matching module treats its parameters relative to the model diameter (diameter of the axis
parallel bounding box), whenever it can. This makes the parameters independent from the model size.
This is why, both model and scene cloud were subsampled such that all points have a minimum distance
of \f$RelativeSamplingStep*DimensionRange\f$, where \f$DimensionRange\f$ is the distance along a given
dimension. All three dimensions are sampled in similar manner. For example, if
\f$RelativeSamplingStep\f$ is set to 0.05 and the diameter of model is 1m (1000mm), the points sampled
from the object's surface will be approximately 50 mm apart. From another point of view, if the
sampling RelativeSamplingStep is set to 0.05, at most \f$20x20x20 = 8000\f$ model points are generated
(depending on how the model fills in the volume). Consequently this results in at most 8000x8000
pairs. In practice, because the models are not uniformly distributed over a rectangular prism, much
less points are to be expected. Decreasing this value, results in more model points and thus a more
accurate representation. However, note that number of point pair features to be computed is now
quadratically increased as the complexity is O(N\^2). This is especially a concern for 32 bit
systems, where large models can easily overshoot the available memory. Typically, values in the
range of 0.025 - 0.05 seem adequate for most of the applications, where the default value is 0.03.
(Note that there is a difference in this paremeter with the one presented in @cite drost2010 . In
@cite drost2010 a uniform cuboid is used for quantization and model diameter is used for reference of
sampling. In my implementation, the cuboid is a rectangular prism, and each dimension is quantized
independently. I do not take reference from the diameter but along the individual dimensions.

It would very wise to remove the outliers from the model and prepare an ideal model initially. This
is because, the outliers directly affect the relative computations and degrade the matching
accuracy.

During runtime stage, the scene is again sampled by \f$RelativeSamplingStep\f$, as described above.
However this time, only a portion of the scene points are used as reference. This portion is
controlled by the parameter \f$RelativeSceneSampleStep\f$, where
\f$SceneSampleStep = (int)(1.0/RelativeSceneSampleStep)\f$. In other words, if the
\f$RelativeSceneSampleStep = 1.0/5.0\f$, the subsampled scene will once again be uniformly sampled to
1/5 of the number of points. Maximum value of this parameter is 1 and increasing this parameter also
increases the stability, but decreases the speed. Again, because of the initial scene-independent
relative sampling, fine tuning this parameter is not a big concern. This would only be an issue when
the model shape occupies a volume uniformly, or when the model shape is condensed in a tiny place
within the quantization volume (e.g. The octree representation would have too much empty cells).

\f$RelativeDistanceStep\f$ acts as a step of discretization over the hash table. The point pair features
are quantized to be mapped to the buckets of the hashtable. This discretization involves a
multiplication and a casting to the integer. Adjusting RelativeDistanceStep in theory controls the
collision rate. Note that, more collisions on the hashtable results in less accurate estimations.
Reducing this parameter increases the affect of quantization but starts to assign non-similar point
pairs to the same bins. Increasing it however, wanes the ability to group the similar pairs.
Generally, because during the sampling stage, the training model points are selected uniformly with
a distance controlled by RelativeSamplingStep, RelativeDistanceStep is expected to equate to this
value. Yet again, values in the range of 0.025-0.05 are sensible. This time however, when the model
is dense, it is not advised to decrease this value. For noisy scenes, the value can be increased to
improve the robustness of the matching against noisy points.

*/

#endif