Buffer evaluation model and scheduling strategy for video streaming services in 5G-powered drone using machine learning

With regard to video streaming services under wireless networks, how to improve the quality of experience (QoE) has always been a challenging task. Especially after the arrival of the 5G era, more attention has been paid to analyze the experience quality of video streaming in more complex network scenarios (such as 5G-powered drone video transmission). Insufficient buffer in the video stream transmission process will cause the playback to freeze [1]. In order to cope with this defect, this paper proposes a buffer starvation evaluation model based on deep learning and a video stream scheduling model based on reinforcement learning. This approach uses the method of machine learning to extract the correlation between the buffer starvation probability distribution and the traffic load, thereby obtaining the explicit evaluation results of buffer starvation events and a series of resource allocation strategies that optimize long-term QoE. In order to deal with the noise problem caused by the random environment, the model introduces an internal reward mechanism in the scheduling process, so that the agent can fully explore the environment. Experiments have proved that our framework can effectively evaluate and improve the video service quality of 5G-powered UAV.

a non-negligible impact on service quality and user behavior in the process of real-time video streaming.
In order to reduce the negative impact of buffer starvation, many experts and scholars are conducting research to evaluate this event and try to find a reasonable start-up delay configuration strategy for the transmission of video streams [7][8][9][10][11]. Nevertheless, there is an unavoidable difficulty in this problem, that is, wireless network video transmission is a random process that controls the arrival of data packets. Thus, it is not comprehensive enough to focus only on video streams of limited size and length. In addition, since the network states of different transmission processes are not geographically homogenous, the fixed start-up delay configuration will no longer be suitable for changing network environments [11][12][13][14]. This shows that the prefetch threshold calculation model designed under a given traffic intensity and file size distribution has very large limitations, because this method ignores the uncontrollable impact of network status fluctuations [15][16][17][18][19].
The first purpose of this article is to design an evaluation model based on deep learning to calculate the buffer starvation probability under different transmission scenarios. The deep neural network can extract the characteristics of deep-level correlation, which enables the model to accurately return the buffer starvation distribution through channel information. Thus, the starvation behavior can be effectively evaluated [20,21]. After obtaining the specific distribution through the evaluation algorithm, we adopt the reinforcement learning method to dynamically configure the start-up delay during the transmission process and the data packet prefetching strategy to achieve the purpose of intelligently scheduling the video stream transmission process [22]. The input of the model is the encoded state, and the output is an actual value of each possible action that is taken. The agent starts to execute the strategy from a given initial state. The initial state can choose the maximum operation or take random exploration operations [23,24]. The model is trained based on the channel parameter datasets collected by a 5G-powered drone, and is executed on different threads independent of each other according to specific strategies, which can not only effectively solve the problem of insufficient buffer in this scenario, but also improve the quality of video service under different wireless network environments to a certain extent [25].
The main contributions of this paper are: • We propose a deep neural network that can accurately return the probability distribution of buffer starvation in the video stream transmission process. The calculation results are used in the subsequent video stream scheduling process. • We propose a reinforcement learning scheduling model that can dynamically allocate start-up delays and calculate packet prefetching strategies. This method can greatly improve the quality of video transmission service. • Based on the scheduling model, we propose an internal reward mechanism to deal with random environmental noise, which can also reduce the difficulty of training when the model rewards are sparse. • The robustness of the model has been verified in the actual 5G-powered UAV realtime video transmission process. Experiments have proved that the method performs very well in complex network environments. The structure of this article is as follows. The second section introduces the related research. The third section describes the regression model based on deep learning in detail. Section 4 presents the internal-driven reinforcement learning model for video streaming scheduling. Section 5 shows the experimental process and results. Section 6 summarizes this work.

Buffer starvation evaluation
There are existing studies [26] related to the present work. About the evaluation of buffer starvation events, Y. D. Xu and E. Altman et al. uses a method based on Ballot theorem to obtain the buffer starvation probability during fixed-size video transmission [9]. This article proposes a clear solution and summarizes it as the M/D/1 queue. In addition, the authors also derived a recursive method to calculate the distribution of starvation, and extended it to the ON/OFF burst arrival process. For a given start-up threshold, the method provides a fluid model to calculate the starvation probability, and analyzes how the prefetching threshold affects the starvation behavior.

Video streaming scheduling
As for the transmission scheduling problem, the research on the trade-off between buffer starvation and start-up delay is mainly divided into bandwidth change model and mathematical method [27]. Xu et al. modeled the buffer starvation probability and the starvation event generation function in [10], then dynamically analyzed the starvation behavior at the file level. When the distribution of traffic intensity and file size can be determined, this method can calculate the relationship between the starvation probability and the packet prefetching threshold. Despite the merits of this work, it only considers a single video stream of fixed size and length, so it has great limitation and is not practical. In [5], the authors take into account the needs of network operators and solve three problems: measuring traffic patterns, modeling the probability of buffer starvation and using calculation results for resource allocation. This article takes a method which uses short-term and long-term QoE to balance the overall QoE. At the same time, the authors also introduced a Bayesian inference algorithm, which can make the model have the ability to infer whether the input stream is short-view or long-view.

Overview
We propose a packet-level deep learning model to calculate the starvation probability and the distribution of starvation behaviors during video streaming.
Our model is based on a recurrent neural network (RNN) framework, such as the gate recurrent unit (GRU) structure, to extract the correlation between different time series [28,29]. In the feature selection stage, we use spatial attention mechanism combined with channel attention mechanism. This approach makes the model tend to focus on the information that plays a key role in achieving the goal. That is, we only consider about the state of the network transmission channel that assists in the judgment. In this way, certain parts of the final input are more helpful to decision-making than irrelevant information that is discarded [30,31]. In addition, the model uses a multi-task learning structure to share a large part of the weights for prediction, which can effectively reduce the scale of parameters and make the prediction more effective. The network architecture is shown in Fig. 1, where we adopt our backbone with three bilateral gate recurrent unit (BiGRU) blocks [32]. More specific descriptions about this network can be found in the next section.

System description
The input of the network is the state vector of different timings in the process of data packet transmission. The elements contained in the vector are summarized in Table 1, including Poisson arrival rate, Poisson service rate, traffic intensity, etc.
The model consists of three important parts: attention mechanism module, bilateral gate recurrent unit module (BiGRU) and multi-task learning module, which is shown in Fig. 1. The combination of these three structures can effectively extract the correlation between different elements in the state vector at the same time, and obtain the correlation between the state features of different time series. Meantime, it can also assign greater weight to certain elements that have a greater impact on the final buffer hungry probability, and make full use of different levels of correlation to obtain better results. Finally, the model can accurately predict the starvation probability and the distribution of starvation behaviors.
Because certain parts of the channel state information play a more critical role in the final calculation result, the model uses a spatial attention mechanism. And this attention mechanism is only for the state information at a single moment, so it is also called  channel sequence attention. We use the structure of Squeeze-and-Excitation (SE) block to accomplish this task. The first stage of SE block is the compression operation. It achieves feature compression along the dimension of the state sequence, and a real number can be obtained, which has a global receptive field to some extent. The output dimension matches the feature channel and sequence number. This number characterizes the global distribution of responses on characteristic channels and sequences, and enables layers close to the input to obtain global receptive fields. The second stage is the excitation operation, which uses parameters to assign weights for each feature sequence and channel. The recurrent neural network part uses a three-layer bilateral GRU structure. GRU is a variant of LSTM, but it has only two gates (update and reset). The GRU block can make the training phase easier to converge due to its simple structure, and it can avoid the problem of gradient disappearance to a certain extent. The bilateral GRU block combines the forward GRU and the backward GRU to solve the problem that the one-way structure cannot encode backward-to-forward sequence information, so that it can capture more comprehensive semantic dependence between different channel state sequences.
This module is mainly divided into three parts, the first is the channel information extraction layer, and the second is the channel information representation layer, and the last one is called the buffer starvation behavior prediction layer.
• The channel information extraction layer. After adding the attention mechanism, the information matrix is used as the input for the next step. • The channel information presentation layer. Considering that both the front and back directions during the video stream transmission process may contain timing information, the bilateral GRU encoding structure is used as the representation of channel information. • The buffer starvation behavior prediction layer is completed by a fully connected layer of perceptron structure as a prediction of starvation behavior at a certain moment.
After the recurrent network, the model adopts a multi-task structure. That is, for a channel state input, the network simultaneously outputs the buffer starvation probability value and the starvation event distribution. When multiple tasks are forecasting at the same time, a large part of the weight is shared, which reduces the scale of the overall model parameters and makes the forecast more efficient. In addition, the two tasks are highly correlated, which has been verified by experiments to stably improve the accuracy of buffer starvation behavior prediction.

Buffer starvation behavior loss
The loss function is represented by the sum of two parts where α, β denote discount factors.
The first part of the formula is the starvation probability loss. The specific calculation is given by In this part, Y S represents the output value of the model which fits the buffer starvation probability, and P S is given by the following formula based on the famous Ballot theorem [18]: The detailed proof can be found in [10]. During the file transfer process, starvation events may occur multiple times. Given a fixed file size N, the maximum number of starvation events is J = ⌊N /x1⌋ , where ⌊·⌋ is the lower limit of real numbers. P S j i represents the probability of meeting j starvations. Therefore, the second part is the distribution loss of the starvation events, using cross-entropy loss: The vector Y D = (P S (0), P S (1), . . . , P S (J )) . We let P ε(k l ) , P S l (k l ) ,P U j (kj) be the probabilities of events 'the buffer becoming empty for the first time in the entire path' , 'the empty buffer after the service of packet given that the previous empty buffer happens at the departure of packet k l and 'the last empty buffer observed after the departure of packet k j . The calculation process of P S j is given by where T denotes the transpose. The detailed analysis can be found in Ref. [10].

Reinforcement learning model environment settings
The purpose of reinforcement learning (RL) algorithm is to train the model so that the agent can complete specific tasks. In order to achieve this goal, it is necessary to abstract the video streaming scheduling problem as an RL problem, which requires the definition of the environment of the RL model. The environment describes the state of a certain task within a specified period of time, a series of actions that can be taken, and the final impact of these actions [33].
The state in the environment is represented by a vector, which mainly describes the network state during video streaming in the wireless network. The elements in the vector include: packet arrival rate, packet service rate, duration of service time slot, start delay, traffic intensity, file size in the packet, packet arrival probability, total number of packets, packet departure probability, The minimum file size, the start threshold in packages, and the average value of the index file size.
The action of the agent is to reconfigure the packet prefetching strategy in each state and reconfigure the start-up delay when buffer starvation occurs. Every action will result in a change of the state. The state is not only determined by these actions, but also related to the current network state. Therefore, the environment has a certain degree of randomness, and the model is a model-free RL approach. The reward function in the environment assigns an actual value to each possible pair of state and action. In the video stream scheduling problem, we define the reward function to be composed of two parts, which, respectively, represent the current state buffer starvation probability and the expected interval between two adjacent buffer starvations. This reward function R e is also a quantitative form of QoE: P s is the buffer starvation probability. The calculation of it is based on Ballot theorem. Ballot theorem: In a ballot, candidate A gets N A votes, candidate B gets N B votes, where N A > N B . Assuming that all orders are the same when counting votes, the probability that A will always lead in the number of votes during the whole counting process is (N A -N B )/(N A + N B ).
After the start of a transmission service, for a given initial queue length x1 and total size N , the starvation probability is given by: The detailed proof process can refer to [9]. E(T ) is the expected time interval between two starvations. We let g(·) be a strictly mono-increasing and convex function of the expected start-up delay:

Internal reward mechanism
In the actual video stream transmission scheduling process, due to the network may produce unpredictable fluctuations at any time, its status cannot be determined [34,35]. Therefore, a problem arises, that is, random environment and sparse reward (or even almost no reward) may cause the agent to be unable to effectively explore the environment. In response to this problem, we propose an internal reward mechanism. Independent of external reward signals, the mechanism is modeled as the difference between the predicted state and the actual state in the feature space [36]. Meanwhile, a self-supervised inverse dynamic model is used to extract state features from the feature space. When the environment changes, the model can have strong adaptability. The structure of this mechanism is shown in Fig. 2.
Different from the traditional reinforcement learning reward form, the reward signal is divided into two parts. It means the function is rewritten into R = R i + R e , where R i represents the internal reward due to the mechanism we proposed, and R e represents the reward inherent in the environment, which is calculated by Eq. (6). Then we use the strategy learning method to find the corresponding strategy by optimizing accumulated rewards.
. The essence of this mechanism is to learn the effective information that actually affects the agent. We use a deep neural network to extract the feature from the state s . Through which we can obtain ϕ(s) . Then we use the feature extraction ϕ(s ′ ) of the next state to predict the action between these two states: By minimizing the error between the predicted a prediction and the actually adopted action a , back propagation is used to allow the neural network to extract the features that are truly influenced by the action. Considering that the action here is discrete, we can take SoftMax function on the predicted action, and then set the corresponding loss function through maximum likelihood estimation, that is After obtaining the feature ϕ(s) from the current state, we also use a neural network to predict the feature of the next state s ′ : Since the predicted feature is a vector, the L 2 norm is used as the loss: At the same time, we use the above loss L F ϕ s ′ , (\hat{s { \prime}}) to calculate the internal reward:     The learning goal of this model is given by: where α > 1, 0 ≤ β ≤ 1 is only a measure of the scale of the corresponding item.
In the training phase, after the initial part of the prediction is accurate, in order to obtain more internal rewards, this mechanism will actively explore more unknown states.

Optimal strategy learning
Our model uses a Deep Q network (DQN) method to learn the best strategy. DQN and Q-learning are similar to algorithms based on value iteration [37]. In the ordinary Q-learning method, if the state and action space are high-dimensional continuous, it is hard to use the Q table. Therefore, we convert the Q table update into a function fitting problem and use a deep neural network instead of the Q table fitting function to generate the Q value. DQN uses a neural network to approximate the value function. After obtaining the value function, DQN takes the ϵ − greedy strategy to select action. The structure of the approach is shown in Fig. 3.
The algorithm has two main structures: • The experience replay (experience pool) method is used to solve the problem of correlation and non-static distribution. • A MainNet is introduced to obtain the real-time Q value, and the target Q value is obtained through another TargetNet.
The memory mechanism in the experience pool is used to learn from previous experiences. For the reason that Q-learning is an off-policy learning method, it can learn from the current experience as well as the past experience. Thus, randomly adding previous experience during the learning process will make the deep neural network more efficient. The experience pool stores the transfer samples (s t , s t , r t , s t+1 ) obtained by the interaction between the agent and the environment at each time step into the playback (14)  memory network, and randomly takes out some batches to disrupting the correlation while training. Q-targets is actually an approach to disrupt correlation. It will build two networks with the identical structure. However, they have completely different parameters. The network for predicting Q estimation, MainNet, uses the latest parameters, while the Tar-getNet parameters of the network that predicts Q reality use a long time ago. Q(s, a; θ i ) represents the output of the current network MainNet. They were used to measure the value function of the current state action pair. Q(s, a; θ − i ) represents the result of Tar-getNet, which can be used to calculate the target Q and update the MainNet parameters based on the loss function. They can also copy the MainNet parameters to TargetNet after a certain number of iterations.
In the value function network training phase, the environment will give an observation at first; then, the agent will get all the Q values about this observation calculating by the value function network, and use ϵ − greedy method to select the action and make a decision. After the environment receives this action, it will feedback a reward and the next observation. This is a complete step. At this time, we update the parameters of the value function network according to the reward, then enter the next step. This cycle continues until we have trained a satisfying value function network.
The update of the value function in this algorithm is given by: The loss function of DQN is as follows, θ indicates that the network parameter is the mean square error loss: There are two structures which have nearly the same design but different parameters in DQN. The network MainNet predicts Q estimation value by latest parameters, while the TargetNet predicts Q reality value by parameters of a long time ago. It means when the agent in the model takes action a in the RL environment, Q can be calculated according to the above formula and the MainNet parameters can be updated. They will be copied to TargetNet after a certain number of iterations. Thereupon, a learning process is completed.

Experiment settings
In the experiment, in order to verify that the model is effective and robust in complex 5G NAS network, a 5G-powered drone (DJI M210 UAV) equipped with a communication module (Hubble I) provided by China Mobile was used as the video streaming transmission equipment (as shown in Fig. 4). The trace-driven simulation proves the accuracy of our method.
On the issue of experimental settings, we once considered pure simulation experiments, which generate the required data and variables through random distribution. However, this has another very big flaw. The pure simulation method cannot represent the fluctuating network environment, nor can it simulate the noise problem in the random environment. At the same time, when there are enough random events, this approach will converge to a specific mathematical model. However, this runs counter to our purpose. Because even if the pure simulation method can verify the correctness of the model, it is not enough to evaluate the accuracy of the model. In combination with the above considerations, our experiment finally adopts a tracking-driven simulation method, and randomly selects requests in the 5G-powered UAV video streaming service in a real wireless network environment. Such an experimental method can not only effectively test the performance and robustness of the model in actual scenarios, but also artificially control the parameter range to a certain extent for targeted measurement.
The video resolution used in the experiment includes 1080p, 2 k and 4 k, and the video frame rate includes 24fps and 30fps. Since 5G base stations cover 100-300 m, the drone's flying range is within 100 m of the base station. The terrain for the experiment includes above the lake, above the woods, and above the buildings.
For different network environments, the accuracy of the buffer starvation probability evaluation model can reach about 96.3%. Using the channel data generated during the transmission of videos with a total duration of about 100 h for training, the model converges within 22 min. Under the same experimental conditions, the cumulative QoE finally achieved by the reinforcement learning scheduling model is more than 15% higher than that of the existing methods, and the training process converges after about 130,000 episodes.

Starvation behavior prediction
For a comprehensive evaluation, video streams of different lengths have been tested up to 8000 times. We use four different parameter settings: ρ = 0.95 or 1.25, × 1 = 40 or 60 packets. If not specifically mentioned, the departure rate μ is normalized to 1. The video size in the experiment is between 300 and 9000 packets. Figure 5 shows the probability of 0-4 starvations with the parameters ρ = 0.95 and × 1 = 40. As the file size increases, the none starvation probability decreases. We can learn that the probability of multiple starvations first increases and then decreases. Figure 5 also shows that our analysis results are in great agreement with the simulation results. When the starting threshold is 50 packets, Fig. 6 shows similar results. The none starvation probability decreases as the video is longer. However, the probability of multiple starvations increases at first and then decreases. Figure 7 verifies the asymptotic none starvation probability with the traffic intensity ρ = 1.25, × 1 = 50 and 100. When the video is short and the data packets are small, the total transmission time is relatively short. Thus the network status has not changed much during the whole process, and the probability of buffer starvation is relatively small. That is, there is no starvation most of the time. This can explain why the model line is farther away from the asymptote when the data packets is smaller. Figure 8 plots the asymptotic probability of a single starvation event with the same settings. The probability of one starvation occurring increases as the video file is larger. More prefetching packets will cause smaller starvation probability.

Cumulative QoE
In order to make the base station under a heavy load but not exceed the capacity area, we finally set the traffic intensity ρ = 0.98 , while the video request intensity = 0.009 . This can also make the attainment rate of the video request within a stable range. In specific experiments, we will evaluate the overall objective quality of experience and the performance of the model in different start-up states of the video transmission process. First, we divide the video stream into two categories to measure the model, respectively. The Poisson arrival rate of the k-th video request is k . The total arrival rate is = 1 + 2 . The service time of a video stream in a state is equal to the video request size in bits divided by its throughput. Since viewing time follows a super-exponential distribution, the service time of each category is also exponentially distributed. Here, we use the state pair (m, n) to indicate that there are currently m first-class flows and n second-class flows passing through the bottleneck. As more video streams share bottlenecks, the probability of buffer starvation during the transmission process will increase. Figure 9 illustrates the change in the objective QoE of the first-class stream and the second-class stream scheduling by our model when the video transmission duration increases from 0 to 3000 s. Figure 10 shows the results of scheduling in the same experimental environment through the method in [5]. It can be found that although our scheduling model is unable to make the QoE index higher than the result obtained by the  method in [5] at each moment of the transmission process, it can effectively improve the long-term cumulative reward during the process. When the video streaming transmission process started in different states, we also evaluated the fluctuation of the objective QoE, as shown in Figs. 11 and 12. We used the state pairs of (0, 6), (2,6), (4,8) to conduct experiments. In addition, Figs. 13 and 14 compare the long-term cumulative QoE of first-class and second-class flows with different maximum numbers of coexisting flows, respectively. As the maximum number of streams increases, more video streams may coexist in the base station, thus causing buffer starvation to occur more often. Comparing the cases where the   maximum flow number is 5, 10 and 15, it can be found that the long-term cumulative QoE has a strong correlation with the maximum flow amount. Therefore, our model can be used to design a video stream admission control strategy that can tolerate a certain degree of starvation probability.

Mean DT/VT ratio
The DT/DV ratio is an important indicator that reflects the best buffer ratio during the entire video streaming period. In Figs. 15 and 16, we plot the average DT/DV ratio when the maximum number of streams increases from 5 to 15. It can be seen from the curve that the increase of the number of flows leads to a higher average DT/ DV ratio. However, our scheduling strategy can effectively reduce the video buffering time compared to the method in [5] and transmission without scheduling. Our reinforcement learning model provides an effective scheduling strategy for the QoE trade-off of heterogeneous video streams. When different types of video streams have different perceptions of buffer starvation, our algorithm can empower the base station, so that it has the ability to intelligently schedule different flows, which can optimize the long-term cumulative QoE. For example, if a flow is more sensitive to buffer starvation behavior at a certain moment, the scheduling strategy can provide it with higher priority.

Conclusion
In this article, we first propose a regression model which combines the recurrent neural network and attention mechanism. This model can accurately calculate the buffer starvation probability and the specific distribution of starvation events in any state during the video streaming transmission process, that is, the buffer starvation behavior is precisely evaluated and analyzed at the packet level. After obtaining the starvation probability distribution, we propose a reinforcement learning model which introduces an intrinsic reward mechanism to intelligently schedule the transmission of video streams. This method can not only maximize the long-term cumulative QoE by dynamically adjusting the start-up delay and data packet prefetching strategy regardless of random noise, but also is highly adaptable to different network environments.
The effectiveness of the approach proposed in this paper has been verified in the 5G-powered UAV video streaming transmission scenario. It is found that our model can stably improve the quality of video transmission service in a complex wireless network environment, and thus provide broader ideas for 5G low-latency research topics.