From the perspective of segmentation, the segmentation of the video object in space is the detection and segmentation of the moving target. Specifically, it refers to separating the independent regions of interest or meaning in the video sequence from the background. Target segmentation is the most basic part of video motion pose recognition. If the target can be correctly detected and segmented in each frame image, it provides a guarantee for the correct recognition of the pose. However, target detection is subject to many unknown factors, and in order to suppress these external interferences, there is often a real-time price.

The motion of the two-dimensional image is the projection of the three-dimensional velocity vector of the visible point in the scene on the imaging plane. An estimate of the instantaneous variation of a point in a sequence of sequential images is generally considered to be an optical flow field or a velocity field. Optical flow field calculation methods are generally divided into five types: gradient-based methods, energy-based methods, matching-based algorithms, phase-based methods, and neurodynamic methods. Among them, the gradient-based method uses the image gray value to calculate the optical flow field. It is assumed that the gray value before and after the moving image remains unchanged, and the optical flow constraint equation is derived, which is the most studied method. However, since the optical flow equation does not uniquely determine the optical flow, other constraints need to be introduced. According to the introduced constraints, the gradient-based methods can be divided into two categories: global constraint methods and local constraint methods. Typical algorithms include Hom-Schunck algorithm and Lucas-Kanade algorithm. In contrast, the Lucas-Kanade algorithm has improved a lot in accuracy and speed, and has strong anti-noise ability. The calculation method of the algorithm will be described in detail below.

Assuming that the point *m* = (*x*, *y*)^{T} on the image has a gray value of I (*x*, *y*, *t*) at time *f*, then after the time dt is derived, the gray value of the corresponding point should be expressed as *I*(*x* + dx, *y* + dy, *t* + dt). When dt → 0, the gray value of the two points can be considered to be unchanged, that is,

$$ I\left(x\kern0.5em +\kern0.5em \mathrm{dx},y+\mathrm{dy},t\kern0.5em +\kern0.5em \mathrm{dt}\right)\kern0.5em =\kern0.5em I\left(x,y,t\right)a $$

(1)

If the gray value of the image changes slowly with *x*, *y*, *t*, then the left side of Eq. (1) can be expanded by Taylor series:

$$ \left(x\kern0.5em +\kern0.5em \mathrm{dx},y\kern0.5em +\kern0.5em \mathrm{dy},t\kern0.5em +\kern0.5em \mathrm{dt}\right)\kern0.5em =\kern0.5em I\;\left(x,y,t\right)\kern0.5em +\kern0.5em \frac{\partial I}{\partial x} dx+\frac{\partial I}{\partial y} dy+\frac{\partial I}{\partial t} dt+\varepsilon $$

(2)

Among them, ε represents an infinitesimal term of second order or higher order. Since dt → 0, the ε in the above equation is ignored, so that

$$ \frac{\partial I}{\partial x} dx+\frac{\partial I}{\partial y} dy+\frac{\partial I}{\partial t} dt=0 $$

(3)

\( u=\frac{dx}{dt},u=\frac{dx}{dt} \) represents the optical flow in the x and y directions, and \( {I}_x=\frac{\partial I}{\partial x},{I}_y=\frac{\partial I}{\partial y},{I}_t=\frac{\partial I}{\partial t} \) represent the partial derivatives of the gray value with respect to *x*, *y*, *t*, respectively. Then Eq. (3) can be expressed as

$$ {I}_xu\kern0.5em +\kern0.5em {I}_yv\kern0.5em +\kern0.5em {I}_t=0 $$

(4)

The above formula is the basic equation of optical flow. As mentioned above, the optical flow equation contains two unknowns of *u* and *v*. It is impossible to uniquely determine by one equation alone. In order to solve this problem, we must find a new constraint equation. We use the local constrained Lucas-Kanade algorithm to increase the constraint condition, and use the windowed weighting method to process the optical flow calculation to obtain the optical flow between two adjacent frames. It is assumed that the optical flows of the points in a small area centered on the p-point are the same, and different points in the area are given different weights, so that the calculation of the optical flow is converted into the minimum value of the Eq. (5) to estimate the velocity (*u*, *v*):

$$ {\sum}_{\left(x, y\epsilon \Omega \right)}{W}^2\left(x,y\right){\left({\mathrm{I}}_xu+{\mathrm{I}}_yv+{\mathrm{I}}_t\right)}^2 $$

(5)

Among them, Ω represents the neighborhood centered on point p, and *W*^{2}(*x*, *y*) represents the window weight function, and *A* represents the weight of each pixel in the neighborhood. Usually, the Gaussian function is used, and the closer the p-point is, the larger the weight is, so that the pixel in the central region of the neighborhood has a greater influence than the peripheral. Lucas-Kanade assumes that the motion vector remains constant over a small spatial neighborhood Ω and then uses the weighted least squares method to estimate Eqs. (6) and (7). The velocity (*u*, *v*) is solved by two equations, which solves the above problem that only the optical flow constraint equation cannot solve two unknowns.

$$ u\sum \limits_{\left(x, y\epsilon \Omega \right)}{W}^2\left(x,y\right){I}_x{I}_y+v\sum \limits_{\left(x, y\epsilon \Omega \right)}{W}^2\left(x,y\right){I}_y^2+\sum \limits_{\left(x, y\epsilon \Omega \right)}{W}^2\left(x,y\right){I}_t{I}_y=0 $$

(6)

$$ u\sum \limits_{\left(x, y\epsilon \Omega \right)}{W}^2\left(x,y\right){I}_x^2+v\sum \limits_{\left(x, y\epsilon \Omega \right)}{W}^2\left(x,y\right){I}_x{I}_y+\sum \limits_{\left(x, y\epsilon \Omega \right)}{W}^2\left(x,y\right){I}_t{I}_y=0 $$

(7)

In practice, this method is usually combined with a Gaussian pyramid distribution. It is assumed that the original image is *I*, and *I*_{0} = *I* represents the layer 0th image (original image). The pyramid image represents an image that is created in a regression form and is launched by a lower layer. In actual calculation, the algorithm uses the order from the upper layer to the lower layer. The experimental part of this paper is divided into four layers. When the optical flow increment of a certain level is calculated, it will be added to its initial value, and then the projection reconstruction will be carried out, which will be used as the initial value of the calculation of the next layer of optical flow. This process continues until the optical flow of the original image is estimated. After the skin color detection, a series of connected areas are obtained, and the image becomes a black pixel point and other excluded white areas only of these connected areas, and a new binary image is obtained. Although most of the global motion regions have been found when using the Lucas-Kanade method, and the regions found are taken to be shielded, there are still some missing global motion regions. After skin color detection, these areas will inevitably be mistaken for skin color check i914, which becomes noise. The reason for this is that in addition to the target skin area, the image includes a background area similar to the color of the skin, such as the face of the audience in the auditorium, clothing, other facilities in the scene, and the like. In order to further reduce the interference of the erroneously detected skin color regions in these background regions on the segmentation results, in this paper, the characteristics of the player’s screen in the diving game video are usually at the center position and the proportion of the connected area formed by other noise is larger. The projection method is used to determine the moving target in the rectangular frame. The specific algorithm is as follows: We project the connected region in the image with noise in the vertical direction to form an m-segment projection, as shown in Fig. 1. At the same time, we find the length of L_{v} as the target rectangle.

The main goal of this paper is to identify the types of poses that athletes are doing in a complex environment. Therefore, in the representation of human motion state, this study extracted the key features of the overall shape and motion of the human body. However, whether it is based on the appearance of shape features alone or the use of motion features to characterize people’s motion state, there will be deficiencies. Therefore, this paper uses the idea of feature fusion to represent people’s sports postures with multiple feature fusions. The selected features will be described in detail below.

Different actions make the distribution of skin tones different in rectangular boxes. For this feature, the rectangular frame is divided into four blocks according to the “Tian” shape, and numbered in order from left to right and from top to bottom. The average value of R, G, and B is obtained for each block, as shown in Eq. (8). \( \overline{{\mathrm{f}}_{bR}} \),\( \overline{{\mathrm{f}}_{bG}} \), \( \overline{{\mathrm{f}}_{bB}\ } \)indicates the average of R, G, and B in the sixth block. Among them, *b* = {1, 2, 3, 4}, *N* represents the number of pixels in each block, and *R*_{i}, *G*_{i}, *B*_{i} represents the *R*, *G*, and *B* values of the *i*th pixel.

$$ \left\{\begin{array}{c}\overline{{\mathrm{f}}_{bR}}={\sum}_{i=1}^N\frac{R_i}{N}\\ {}\overline{{\mathrm{f}}_{bG}}={\sum}_{i=1}^N\frac{G_i}{N}\\ {}\overline{{\mathrm{f}}_{bB}}={\sum}_{i=1}^N\frac{B_i}{N}\end{array}\right. $$

(8)

The color features extracted here do not extract the color histogram for the entire target as usual, but only the average of the three primary color channels in each block. This is because if the same athlete is doing different postures, the color histograms in the rectangular frame that determine the target position are very similar, and such color features have little meaning for the recognition of the posture. Another advantage of using this feature is that it can significantly reduce the dimension of the feature. If the color histogram is used, there is a 0–255 dimension feature, which increases the amount of computation. Figure 2 is a three-frame image of the diving process.

The next process is image grayscale processing. First, a mapping function from color map to gray scale is defined. This function must ensure the contrast, ensure the continuity of the map, ensure the consistency of the map, and the order of brightness. This function can be a linear function or it can be nonlinear. The second step is to block the image and use the traditional k-means-based segmentation algorithm to divide the image into blocks of many super pixels for subsequent processing. The third step is to obtain a color image and a grayscale image with parameters, and generate a corresponding color image saliency map and grayscale image saliency map. Using the color image after the block and the color image before the block, we obtain the saliency map of the color image through a color image saliency detection algorithm. On the other hand, after the first step is processed, a grayscale image with parameters is generated, and the saliency map of the grayscale image with parameters can be obtained by the same saliency detection algorithm. In the fourth step, the energy function is obtained by using the saliency map generated by the color image and the saliency map generated by the gray image with parameters, and the optimal value of the parameter is obtained by optimizing the energy function. Finally, the optimal value is brought into the gray image with parameters to obtain the final gray image. The results obtained by the treatment are shown in Fig. 3.

x_{t} is set to a certain pixel value at time *t*. If x_{t} matches the existing *j*th Gaussian distribution, the weight of the Gaussian distribution is updated to:

$$ {\mathrm{w}}_{i,t}=\left(1-\beta \right){\mathrm{w}}_{i,t-1}+\beta {\mathrm{M}}_{i,t},i=1,2,\dots, K $$

(9)

$$ {\mathrm{M}}_{i,t}=\left\{\begin{array}{c}1,i=j\\ {}0,i\ne j\end{array}\right. $$

(10)

Among them, β is the weight update rate. The above equation shows that only the weight of the Gaussian distribution matching x_{t} is increased, and the remaining weights are reduced. The parameters of the Gaussian distribution are updated to:

$$ \left\{\begin{array}{c}{\mu}_{j,t}=\left(1-\alpha \right){\mu}_{j,t-1}+\alpha {\mathrm{x}}_t\\ {}{\sigma}_{j,t}^2=\left(1-\alpha \right){\sigma}_{j,t-1}^2+\alpha {\left({\mathrm{x}}_t-{\mu}_{j,t-1}\right)}^T{\left({\mathrm{x}}_t-{\mu}_{j,t-1}\right)}^2\end{array}\right. $$

(11)

Among them, α is the update rate of the Gaussian distribution parameter, and the parameters remain unchanged for the Gaussian distribution with no matching success. In the establishment of the background model, we set the number of Gaussian distributions describing each pixel 3 = K. The background model is initialized first, and the initial weights are *w*_{1, 0} = 1, *w*_{2, 0} = 1, *w*_{3, 0} = 1. The pixels of the first frame are used to initialize the first Gaussian distribution mean, and the mean of the remaining Gaussian distribution is 0. The standard deviation of each model takes a larger value σ_{i, 0} = 30, a weight update rate β = 0.33, a learning rate α = 0.7, and a threshold value of 7.0 = T. If no Gaussian distribution is found to match the x_{t} at the time of detection, then a Gaussian distribution with the lowest priority is removed, and a new Gaussian distribution is introduced according to x_{t}, and a smaller weight and a larger variance are assigned, and then weight normalization is performed.

Firstly, the Gaussian mixture background modeling and background subtraction method are combined to detect and extract the target human body region, and the morphological operator is used to process the noise and cavity phenomena in the foreground image. We fill the holes in the target area and remove the isolated noise points by the operation of first opening and then closing, and smooth the contour of the human body. During the morphological processing, the overall position and shape of the target object are unchanged, and the details are easily destroyed when the target size is relatively small. The process of the motion region extraction algorithm is shown in Fig. 4.

In order to improve the computational efficiency, the contour image of each player is equally divided into h × w sub-blocks that do not overlap each other. Then calculate the normalized value of each sub-block with \( {N}_i=\frac{b(i)}{mv},i=1,2,\dots, h\times w \). Among them, *b*(*i*) is the number of foreground pixels of the *i*th block, and mv is the maximum of all *b*(*i*). In space, the description of the sportsman’s outline of the *t*th frame is *f*_{t} = [*N*_{1}, *N*_{2}, … , *N*_{h × w}], and the player’s outline in the entire video is correspondingly expressed as vf = {*f*_{1,}*f*_{2}, … , *f*_{T,}}. In fact, the original player outline representation V_{r} can be considered as a special case based on block features, that is, a pixel whose block size is 1 × 1. Due to the difference in distance and angle of the pedestrian from the camera in the video sequence, the size of the human body is greatly different, and the contour of the human body needs to be normalized before the contour feature is extracted. After the target area is raised, its outline is mapped to a uniform height *H*, and the width is scaled accordingly, as shown in Fig. 5. Among them.