SIFT algorithm
SIFT algorithm is a feature matching algorithm proposed by Lowe D in 2004 [11]. It is based on the technology of invariant features and proposes a point feature registration algorithm that keeps invariant on the translation and rotation of images. A brief process of extracting feature points from images using SIFT algorithm is shown in Fig. 1.
1. Detect scale space extremes. That is, the Gaussian convolution kernel is used to transform the original image to obtain the representation sequence in multiscale space, where the calculation formula is [12]:
$$ L\left(x,y,\sigma \right)=G\left(x,y,\sigma \right)\ast I\left(x,y\right) $$
(1)
Among them, (x, y) is the gray value of the image, G(x, y, σ) is a scalevariable Gaussian function, which is defined as \( G\left(x,y,\sigma \right)=\frac{1}{2{\pi \sigma}^2}{e}^{\frac{\left({x}^2+{y}^2\right)}{2{\sigma}^2}} \), and σ is a Gaussian scale factor.
Finally, the spatial extremum is extracted from these sequences for feature extraction.
2. Locate and filter feature points. Fitting the Laplacian scale space function D(x, y, σ) in a twodimensional continuous space. The Taylor formula is developed at the local extremum point (x0, y0, σ0), and the derivative of the expansion formula is equal to 0. The position coordinate \( \overset{\frown }{x} \) of the extremum point is obtained, substituting the coordinate position of the extreme point into the scale space function obtains the extremum formula of the original function:
$$ D\left(\overset{\frown }{x}\right)=D+\frac{1}{2}\frac{\partial {D}^T}{\partial X}\overset{\frown }{x} $$
(2)
When the offset in any direction is greater than 1/2, it indicates that the center point of the interpolation has been shifted to the neighboring point, then the extreme point of the class is discarded. Finally, in order to enhance the antinoise ability and improve the stability, the Hessian matrix is used to eliminate unstable edge response points.
3. Feature points are assigned to the main direction. After determining the feature points, the main direction needs to be determined for each feature point so that it has rotation invariance. The calculation formula of the main direction of feature points is [13]
$$ \left\{\begin{array}{c}m\left(x,y\right)=\sqrt{{\left(L\left(x+1,y\right)L\left(x1,y\right)\right)}^2+{\left(L\left(x,y+1\right)L\left(x,y1\right)\right)}^2}\\ {}\theta \left(x,y\right)={\tan}^{1}\left(\frac{L\left(x,y+1\right)L\left(x,y1\right)}{L\left(x+1,y\right)L\left(x1,y\right)}\right)\end{array}\right. $$
(3)
Where m(x,y) denotes the gradient modulus value of the feature point (x,y), θ(x,y) denotes the gradient direction of the feature point, and L(x, y) denotes the Gaussian image of the scale on which the feature point is located. For each feature point, a histogram can be used to count the gradient distribution of the pixel gray values in the neighborhood of the pixel center, that is, determine the main direction of the feature point.
In addition, in order to make up for the instability of the feature points without affine invariance, the gradient amplitude of each sampling point should be weighted for each gradient histogram, and the weight of each sampling point is finally determined by the gradient modulus value of the sampling point and the Gaussian weight.
4. A key point descriptor is constructed. Through the calculation of the above three steps, each feature point detected contains three messages ((x, y), σ, θ). That is, position, scale, and direction. Because the descriptor of the feature point is related to the scale of the feature point, the generation of the feature point descriptor needs to be carried out in the Gaussian pyramid space of the corresponding scale [14]. First, the neighborhood centered on the feature point is divided into B_{P} × B_{P} subblocks, and the edge size of each subblock is mσ pixels. The construction process of feature point descriptors is as follows: first, taking the feature point as the center, the rotation angle of the image in the neighborhood of the \( \left( m\sigma \left({B}_p+1\right)\sqrt{2}\times m\sigma \left({B}_p+1\right)\sqrt{2}\right) \) of the feature point is θ to the main direction of the feature point. Then, take the feature point as the center, select the mσB_{p} × σB_{p} size image block, and divide the interval into the B_{P} × B_{P} sub block, then use the gradient histogram to calculate the gradient accumulating value of all pixels in each sub block in eight directions, and form the seed point. A 128 dimensional feature vector is formed. In addition, in the process of constructing feature point descriptors, all the pixels in the neighborhood of the feature point need to be weighted by Gaussian, and all the pixels in the neighborhood range of the feature points need to be normalized two times in order to remove the influence of illumination and other factors [15].
Surf algorithm
The SURF feature based image square splicing is used in this paper. The basic process is image preprocessing, SURF feature matching, transformation parameter estimation, global splicing of LM optimization, and image fusion.
Image preprocessing
Because of the uneven illumination of the original remote sensing image, the background gray level of the image is uneven. Therefore, in the binarization process of the image, in order to achieve the best effect, it is required to block the image. A threshold is determined by the valley on the gray histogram of the image, and then each block of the image is binarized.
SUFR feature matching
The SURF algorithm uses the idea of SIFT algorithm. In order to improve the computing speed of feature extraction and matching, the SURF algorithm uses the approximate method of integral image and box filter, and keeps the image scale and rotation invariance as well as the better diversity and robustness. SURF feature matching is mainly divided into two steps: SURF feature extraction and SURF feature matching.

1)
SURF feature extraction

Construct scale space, extreme point detection. A convolution operation is performed on the integral image of the input image using an approximate Gaussian filter scaled up by layers to obtain a pyramid image. Calculate the Hessian matrix determinant and get the feature point response value. By nonmaximal suppression, each pixel in the scale space is compared with the other 26 pixels in the same layer and adjacent layers of the pixel to obtain local maxima and minima points.

The Taylor expansion of the threedimensional quadratic equation is used for surface fitting to achieve accurate positioning of interest points.

The main direction of the feature point is determined. The main direction of the feature points in the SURF algorithm is determined based on Haar wavelet response and other information in the neighborhood of the feature points.

Feature description vector generation. On the scale image of the feature point, the orientation of the coordinate axis and the orientation of the feature point are adjusted to be the same, and then the feature point is taken as the center. Similar to SIFT method of constructing the feature description vector, a 64dimensional feature vector is obtained.

2)
SURF feature matching
Here, the Euclidean distance of the SURF feature vectors of the two feature points is used as feature matching similarity measure. The characteristic points P_{A} and P_{B} feature vectors are D_{vA} and D_{vB}, respectively. The distance between two points is defined as [16]
$$ D\left({P}_A,{P}_B\right)=\sqrt{\sum \limits_{i=1}^n{\left({D}_{vAi}{D}_{vBi}\right)}^2} $$
(4)
The description of the process is as follows: the matching point search algorithm is used to find the feature points with the minimum and the second minimum distance to the match point. When the ratio of the minimum distance and the second minimum distance is smaller than the preset threshold, the matching is considered successful.
The transformation parameter estimation
The perspective transformation between images X(x, y, 1)^{T} _{and} X^{'}(x^{'}, y^{'}, 1)^{T} is expressed as
$$ {X}^{\hbox{'}}\sim HX=\left[\begin{array}{l}h0\kern1em h1\kern1em h2\\ {}h3\kern1em h4\kern1em h5\\ {}h6\kern1em h7\kern1em h8\end{array}\right]X $$
(5)
Since the extracted initial matching points generally have certain mismatched pairs, the RANSAC algorithm based on the perspective transformation constraints is required to first purify the matching pairs, and then the least squares method is used to estimate the transformation matrix parameters. The process is to first calculate the transformation matrix H linearly by randomly selecting six points from the initial point pair, and then calculate the distance between other point pairs. A value less than the threshold is defined as an “inner point,” and a value greater than the threshold is defined as an “outside point.” Repeating iterations, selecting the set with the most “inner point” is the correct point pair after purification.
Global splicing of LM optimization
In order to solve the cumulative error caused by the influence of photographic attitude, terrain fluctuation, and other factors [17], the global LM optimized global splicing strategy is adopted in this paper. Based on the idea of minimizing the variance of the mean square distance between the same name points, this method uses the dynamic adjustment of each image transformation parameter in the splicing process to reduce the cumulative error effect and achieve the goal of global optimization.
The principle of this optimization strategy is to first select the reference plane, read the single images in turn, and then optimize the transformation matrix of each image to the reference plane at the same time so that the error of each image transformation to the reference plane is the smallest. For feature pairs (x_{i},x_{j}) with the same name, x_{i} after the projection of the reference plane, the coordinate after projection to its adjacent image is x_{i}’, and the distance difference is as follows:
$$ {d}_{ij}=\left\Vert {x}_j{H_j}^{1}{H}_{iXi}\right\Vert $$
(6)
Where H_{i} is the transformation matrix of the feature point of the image I_{i} projected onto the reference plane, H_{j}^{−1} is the transformation matrix from the reference plane back to the image I_{J} at this point.
According to the formula (6), summing up the distance difference of all other images that are overlapped with the current image I_{i}, we get the overall optimization target equation [18]:
$$ e=\sum \limits_{i=1}^n\sum \limits_{j\in M(i)}\sum \limits_{k\in F\left(i,j\right)}f{\left({d}_{ij}^k\right)}^2 $$
(7)
Where n is the total image; F(i,j) is a set of matching pairs between image I_{i} and image I_{j}; \( {d}_{ij}^k \) is the distance difference calculated from the K matching point between image I_{i} and image I_{j}; M(i) is an image set that overlaps with I_{i}; f(x) is an error function, which is expressed as follows [19]:
$$ f(x)=\left\{\begin{array}{c}\leftx\left\Vert x\right.\right<{x}_{\mathrm{max}}\\ {}{x}_{\mathrm{max}}\leftx\right\ge {x}_{\mathrm{max}}\end{array}\right. $$
(8)
The optimal transformation matrix is iteratively determined using the LM algorithm [20], the basic process is described as follows:
1) SURF matching between all images, obtaining matching pairs, establishing an image adjacency relationship table.
2) Automatically select the reference plane. This paper uses the image I_{i} with the largest weight in the image sequence as the reference plane, the weight expression is [21]:
$$ T=N+\frac{n}{S} $$
(9)
Where N is the number of overlapped images with the same name point pair as image I_{i}; S is the area of the overlap area; n is the number of matching points for the overlapping area. The larger the T value, the larger the weight of the image I_{i} in the video sequence, which is taken as the reference plane.
3) The continuous transformation of the adjacent image transformation matrix in the airstrip is carried out to establish the general registration relationship between the photos and the reference image, which is used as the iterative initial value of the next LM algorithm.
4) The best adjacent image is added in sequence according to the image adjacency relation table. In order to minimize the error when each image is transformed to the reference plane, the LM algorithm is used to optimize the transformation matrix of each image to the reference plane.
5) Continue to add images, optimize transformation matrix, until all images are added, and finally output results.
Image fusion
When the geometric relationship between images is determined, in order to keep the visual color consistent and maintain a smooth visual transition, this paper uses a weighted average fusion method based on Gaussian model to stack multiple images into a panoramic image.