Semantic video analysis involves the inclusion or identification of the major events of the video. We propose a robust and efficient framework for semantic analysis of a soccer video possessing highly varying illumination conditions. It is robust in the sense that it succeeds to achieve favourable results even under various conditions of the soccer video with minimal assumptions. It is also efficient as we apply low-level features like colour and motion to detect the important events while neglecting object-based features which are computationally expensive. The proposed framework of semantic analysis of a soccer video is shown in Figure 1. The proposed framework is fully automatic. The entire video is processed frame by frame. The block diagram of the framework is briefly described below.
-
1)
The first step of our framework is to carry out event segmentation. We propose a novel algorithm for event segmentation based on change in optical flow between the successive frames of the video. Change in the horizontal component of optical flow is found very successful to segment the video. At the end of event segmentation, we have a set of events. There are many dull events which are not important to the end users. A block diagram trifurcates after the event segmentation. Each process is carried out independently than the others which is subsequently described.
-
2)
We propose a novel and robust card event detection algorithm. The algorithm understands the caption and exploits this domain knowledge effectively to detect the card event. After event segmentation, the card detection algorithm is applied on the obtained set of events.
-
3)
Event filtration is applied to remove dull events like just passing the ball on the ground, audience views etc. After obtaining meaningful events, event categorization is applied to separate out high-impact and low-impact classes of events. High-impact class of events is lengthy as well as involve more view transitions. High-impact class consists of events like goal, injury, player exchange etc. while low-impact class consists of events like goal attacks or corner and other events like throw in, offside etc.
-
4)
View classification is carried out for all the detected event segments. This algorithm heavily depends on the dominant colour and edges. The algorithm is robust to the varying conditions of the ground which extremely affect the grass colour. Grass colour exhibits different brightness in flood light condition compared to the daylight condition. Based on this view classification, HMM models are generated for the goal, corner/goal attack and other types of events.
-
5)
Event classification is carried out on the low-impact and probable goal class of events using the obtained HMM models.
All the above-mentioned steps of the framework are described in detail in the subsequent Section 2 of the paper.
2.1 Event segmentation
Event segmentation is carried out by computing optical flow between consecutive frames of the video. As sports videos are very dynamic in nature and involve huge motion, optical flow becomes the most appropriate choice. We apply the Lukas and Kanade optical flow technique [9], which is a widely used differential method for optical flow estimation in computer vision. It is also less sensitive to image noise and also fast. The Lucas and Kanade method assumes that displacement of the image contents is approximately constant within a neighbourhood (window) of the pixel under consideration. The velocity vector (V
x
, V
y
) must satisfy:
where:
q1, q2,…, q
n
are pixels inside the window, and I
x
(q
i
), I
y
(q
i
) and I
t
(q
i
) are the partial derivatives of image I with respect to positions x, y and t evaluated at pixel q
i
and at the current time. The solution of Equation 1 is obtained by the least squares method. It is computed as:
In the experiments, the neighbourhood (window) size is set to 3. For the group of N +1 frame, N optical fields (F1, F2,…, Fn) will be computed by the algorithm. Before applying the optical flow computation algorithm, the resolution of the frames is down-sampled by a factor of 2 to speed up the computation. After obtaining the optical flow component, the optical flow magnitude is computed using Equation 2. It is observed that optical flow components V
x
and V
y
are quite sensitive to the shot transition also. This is natural due to global camera motion.
(2)
Occurrence of any major event in soccer involves gathering of players, audience feelings and a rapid change in views. The camera undergoes huge motion during the occurrences of all major events in soccer. Camera motion is well and effectively observed by the optical flow components. We have emphasized on change in the horizontal component V
y
as cameras track the soccer ball which has a more horizontal movement. We propose a novel feature to measure optical flow variation. For this, we differentiate Equation 2 with respect to V
y
. The obtained equation is described below:
(3)
The above equation is found very efficient to exhibit a noticeable change at the beginning or at the end of an event period. As the event occupies the time span in the video, it is necessary to demarcate the event boundary. However, this task is very challenging to identify such candidate frames which mark the beginning and ending of an event. Figure 2 clearly reflects the occurrences of major events in soccer by showing a larger fluctuation in . We can easily understand that in the case of a goal event, undergoes a rapid and large change due to frequent shot transition and rapid camera motion. Threshold is decided automatically to detect important events by using the following min-max normalization equation:
(4)
where is the average value of , min
M′
is the minimum value of , max
M′
is the maximum value of . new_max and new_min values have been set to 0.5 and 0.8, respectively. After performing a large number of experiments on various video datasets, it has been observed that the minimum value of the M′ is 0.5 for the detection of any event. Based on a newly determined threshold (T), significant events are demarcated. Important events like card, corner and goal continue over a certain minimum time span.
Using this fact, we consider only those events which sustain more than 5 s. This fact helps to reduce false detections. In Figure 2, the horizontal line is the threshold which is calculated using Equation 3. The computed value is varying over time (frames). As shown in Figure 2, it consists of multiple peaks corresponding to a significant soccer event. Figure 3a,b shows the corner and goal event sequences, respectively. These sequences are the series of various views like goal post views, close-up, player gathering etc. As the goal event shows a large number of continuous higher peaks above the threshold, it clearly exhibits more fluctuations compared to the corner event and it also continues longer than the corner event. Any which is higher than the obtained threshold marks the candidate frames which indicate the presence of an event. To separate out two events, there is at least a gap of 6 s between two peaks of ; otherwise, the next peak also contributes to the first event. This is valid because the corner event lasts at least for 5 s while the card and goal events are much longer than the corner event.
2.2 Yellow card event detection
Soccer is an eventful sport. Any unfair behaviour or some foul may cause the issue of a yellow card to the player. The yellow card itself is a very important event because it is like a warning to the issued player. Issuing a second yellow card to the player (red card) compels the player to discontinue the game. This event itself gains very high importance; if the yellow card has been issued in the penalty area, then the opponent may be offered a penalty kick which may eventually end in scoring a goal by the opponent. Hence, it may generate a series of events. An algorithm to detect a yellow card by examining every frame of the segments obtained after event segmentation is described below.
Normally, a yellow card caption stays 2 to 4 s on the screen while the whole event lasts longer. The caption remains on display for the duration of almost 50 to 100 frames. A novel algorithm to detect a yellow card frame is proposed, which is described below. Generally, the yellow card is displayed as a caption with the player information at the bottom part of the image. It is observed that in most broadcast soccer videos, the caption appears on the bottom of the screen. We process an input frame below the centre of an image and that is also not exactly at the bottom area as it is very difficult to have an idea of the caption placement and its size and height variations. This domain knowledge helps us to detect this event easily rather than classifying this event from a set of events like goal, corner and even other types of events.
Yellow card frame detection algorithm
Phase I
-
1.
Divide the image horizontally into three equal parts and choose the bottom part where the caption is located.
-
2.
Convert the input frame from the RGB image into a grey image.
-
3.
Apply a canny edge detection operator on the grey image to detect horizontal lines of the rectangular box.
-
4.
Erode the horizontal lines which have a length shorter than the threshold.
-
5.
Extract the connected components.
-
6.
If number components >1, then
-
a.
Convert the image from the RGB image into an HSV image.
-
b.
Set the pixels with the highest grey level values (255) which have the hue, saturation and value range as per empirically defined values. Set the rest of the pixels to grey value 0.
-
c.
Convert the image in binary.
-
d.
Extract the connected components.
-
i.
If any connected component is found which has the number of pixels in a specified range,
-
1.
Declare a probable yellow card frame.
-
ii.
Otherwise, neglect and proceed to the next frame.
-
e.
Otherwise, neglect and proceed to the next frame.
-
7.
Collect probable yellow card frames.
-
8.
Keep those frames as yellow card frames if the minimum number of frames within that segment meets the defined criteria.
Phase II
-
9.
Compute the edge pixel ratio of the cropped image. Apply the Sobel operator to find edges.
-
10.
If the edge pixel ratio (EPR) is within the threshold, declare that segment as a yellow card event.
Phase III
-
11.
Perform steps 1 to 6d on two chosen yellow card frames of every segment.
-
12.
Find the absolute distance between the detected connected component.
-
13.
If the distance is less than the threshold.
-
a.
Declare a yellow frame and accept the segment.
-
b.
Else, discard the segment.
The entire algorithm involves three phases. The first phase consists of one to eight steps which find probable yellow card frames based on the presence of yellow colour. Steps 1 to 3 are straightforward. Step 4 involves erosion which uses a horizontal structuring element of size 20 which removes all lines shorter than a length of 20. In step 6, we first check for connected components because their absence indicates a smooth or constant-intensity image. The display of a yellow card in the caption exhibits various colours, hence gives rise to a number of edges. After steps 3 and 4, we still have few longer edges left which eventually contribute to connected components. Step 6a does the conversion of the RGB image to HSV as an HSV model is considered perceptually uniform. We deal with yellow cards of largely varying shades, so HSV becomes the most appropriate model. The pure yellow colour is represented at 60° in HSV; this defines a range of hue for yellow colour at 0.16 (60°/360°). But there may be variation in the intensity of yellow colour in different league videos. It is observed that the hue range for yellow colour at 0.14 to 0.22 is found to be satisfactory for detecting yellow card frames. From an empirical study, the threshold values of the saturation and value components of HSV are set greater than or equal to 0.8 and 0.6, respectively. These thresholds are set in step 6b of the algorithm. At the end of step 6b, we confirm the presence of yellow colour and we set the yellow region with the highest grey level intensity while the rest of the pixels are set with the lowest intensity. In step 6c, the binary threshold has been set to 0.8 because a strict threshold removes all the unnecessary components. From observation of different broadcast videos of standard league matches, the area of the yellow card is fixed between 20 to 450 number of pixels (step 6d) which represent a smaller to wider size of yellow card at a different tilt.As the yellow card stays for 2 to 4 s, step 8 uses this knowledge to identify the yellow card frames. The minimum number of frames which is required to declare a yellow card event is set to 15. Various types of yellow cards are shown in Figure 4. Figure 4 clearly depicts the largely varying size, intensity as well as location of the yellow card. Figure 5 shows various intermediate steps of the yellow card event detection algorithm. Few leagues display a yellow card whose intensity keeps varying. The second phase attempts to mark the presence of caption in these detected probable yellow card frames. In soccer videos, there are misleading frames like a player wearing a yellow t-shirt or yellow socks and even a yellow ball. We compute the edge pixel ratio of the cropped image to confirm the presence of a caption showing a yellow card. The edge pixel ratio is computed using following equation:
(5)
The range of EPR is set between 0.38 and 0.55. However, every event is narrated with the support of the caption, so there could be misleading cases where the frame shows a player wearing a yellow t-shirt along with the caption which conveys the information of a goal event. Phase III takes care of misleading frames that have a caption which coincides with a yellow t-shirt or some yellow logos of the t-shirt. This type of typical case is shown in Figure 6. In the case of wrong yellow event frames, due to the movements of the player, the player's t-shirt of these frames experiences motion. Step 11 selects two frames at some specific interval of the detected yellow event segment. If these are genuine yellow event frames, then after applying steps 1 to 6d, almost similar images are produced consisting of one connected component at the place of the yellow card. We find the city block distance between the top left coordinates of the connected components of both images. Absolute distance is the sum of x and y differences of the top left coordinate of connected components. Due to motion, the city block distance between both images will be high.
2.3 Event filtration
After event segmentation of the video, we obtain the set of events. The change in optical flow is a key parameter for demarcation of events in an input video. As we are processing a broadcast video, there is no control over capturing of video content, i.e. the camera is moved over the ground from one angle to another and it results into a change in optical flow and leads to an event segmentation which does not actually represent any event. We apply event filtration on this set to filter out certain events which are not significant and dull according to the interest of the end user. In order to carry out this task, we apply Fourier transform on . Every event is characterized by the mean of the magnitude of the Fourier transform. Fourier transform and the magnitude can be found using the following formula. If the event is short and not important and there are smooth and fewer fluctuations within the event, it has a smaller magnitude of Fourier transform. Even if the event is short in time duration but consists of many views like far, goal post, audience etc. and transitions among these views, then it gives rise to the magnitude of high-frequency components. This can be very well captured using Fourier transform magnitude. We choose only 10% of frequency coefficients and extremely high-frequency components.
(6)
However, shorter events can also be an important event, so every event can be further analysed by how many frames of an event experience the change in average motion magnitude () above the threshold. This is a useful descriptor because goal and player exchange events last longer and also involve frames which have a greater change in motion than the threshold. This can easily be found by the formula given in Equation 7.
Most goal events are followed by a celebration which involves gathering of players and cheering in the audience, so one can easily look out for this feature. However, there are goal events which are shorter and may not be followed by the much more cheering and celebration, but still due to more camera movements, they involve more frames undergoing a larger change in motion. We successfully think to use the product of the two features: Fourier transform and the ratio of frames having a change in motion greater than the threshold to total frames of an event. This product feature itself carries the neutral effect of Fourier transform and change in motion greater than the threshold. We introduce this product as an event filtration feature (EFF). This product feature enhances the capability to filter out insignificant events. We compute the mean of the EFF of every event. Events which succeed to satisfy the following criteria will be selected as filtered events:
(7)
where is the average value of the EFF of every event while α1 is the empirical parameter which can be set between 0 and 1. We have set the value of α1 to 0.7. EFF
i
corresponds to the EFF value of event i. Next, we proceed to the event categorization phase.
2.3.1 Event categorization
The event categorization phase splits the set of events into low-impact and high-impact sets. The high-impact set consists of events like goal, player exchange, injury etc. while the low-impact set includes events like goal attack, corner, foul, cheering in audience etc. Broadly, high-impact events are longer in span while low-impact events are shorter. Goal event is the most valuable event for the end users as well as for the game itself. For each event, the following features will be computed. Each event is characterized by the n number of values. Using these values, the first kurtosis is computed for every event. It is a descriptor of the shape of a probability distribution. Higher kurtosis means more of the variance is the result of infrequent extreme deviations. Goal event produces more deviations which can be frequent or infrequent. We can conclude that for the goal event, the value of kurtosis cannot be low but the values will be more than average or high. Kurtosis is computed using the following equation:
(9)
where μ is the mean and σ is the standard deviation of the values. Second, we compute the energy for each event i using the following sum of squared formula:
The above equation has quite good capacity to realize the event which is longer over time span and having higher values. In soccer, goal and player exchange types of event can be easily identified by this parameter. To address this issue, we formulated an event categorization feature (ECF), which is a product of kurtosis and the energy of an event:
High-impact events are selected using following equation:
(12)
where α2 and α3 are empirically set to 1.1 and 0.85. FEVENT corresponds to the filtered events after event filtration. and indicate the average value of the event categorization feature and event filtration feature, respectively. Events which satisfy Equation 12 are referred as high-impact events while the rest of the events are put in low-impact class of events. At the end of the event categorization stage, we obtain high-impact and low-impact sets of events. Other events may remain present in both these classes because player clash, foul and injury are such events which may exist for a longer span or shorter span.
2.4 Edge and caption analysis
After obtaining the high-impact events, we analyse them using the edge pixel ratio and contents of the caption. Goal event is mostly followed by cheering in the audience view as well as gathering of players as well as goal post views which may give rise to edges in the frames. After the occurrence of goal events, every broadcaster displays the caption about the goal information. Broadcasters display the caption containing the information of players who has scored the goal and his team name just after the occurrence of an event as shown in Figure 7a. This caption almost stays for 4 to 8 s. After the completion of an event, the broadcaster displays the caption of team score information at the bottom part of the frame (image) as shown in Figure 7b. Generally, it is observed that the goal score caption is displayed within 55 s after the goal event. The presence of such views and detailed caption becomes a very important clue for the confirmation of the goal event. We do not consider initial frames up to 4 s (100 frames) of an event for EPR computation, and we continue to compute EPR for another 55 s (1,250 frames) even after the end of an event for the inclusion of the goal score caption.
We apply the following steps to carry out edge analysis of high-impact events:
-
1.
Divide the image horizontally into three equal parts and choose the bottom part where the caption is located.
-
2.
Convert the image in grey and apply the Sobel operator to detect horizontal edges.
-
3.
Compute EPR.
EPR is computed using the formula mentioned in Equation 5.The EPR of every frame of an event is computed, and then we compute the average value of the EPR of an event and also the average EPR of all high-impact events. Finally, we select such events whose EPR value is greater than the threshold. The threshold is empirically set to 0.97 times the mean of the EPR of high-impact events. Events which have a higher EPR also include player exchange events. Player exchange events experience huge motion as the players are replaced on the ground. Many times, it involves transitions among far (ground), close-up and audience views similar to the goal event. The caption is also displayed for a longer duration while the player is leaving and a new player is entering. When the player leaves the ground, a red triangular symbol is displayed within the caption, and a green triangular symbol is displayed within the caption while a new player is entering the ground. This caption contains the important triangular shape of either red or green colour. These types of captions and the results of the below mentioned algorithm are shown in Figure 8a,b, respectively. Now, we analyse the high EPR events based on the nature of their caption. A brief algorithm has been described below to separate the player exchange events from the high EPR events. The algorithm process is much more similar to the yellow card detection algorithm.
-
1.
Divide the image horizontally into three and vertically two equal parts and choose the bottom right part where the caption is located.
-
2.
Resize the bottom right image by a factor of 2.
-
3.
Obtain the HSV image and search for the triangle symbol made of red/green pixels in the HSV image.
-
4.
Obtain the connected components of the image which has an area within the specified range.
-
5.
Keep those frames as player exchange frames if the minimum number of frames within that segment meets the defined criteria.
-
6.
Find the distance between the red and the green spot of the selected images; if the distance is less than threshold, then declare a player exchange event.
We resize the image by a factor of 2 to have enough large area of red/green triangle for proper detection. The hue range for red has been set to more than 0.90, and for green, it has been set between 0.30 and 0.40, while saturation and value for both red and green colours are set more than 0.80 and 0.60, respectively, in step 3. This region is smaller; hence, we cannot exactly search for the triangular shape or symbol, but we only search for pixels which belong to the above specified range of hue, saturation and value. The threshold for the area of the symbol is set between 10 and 150 in step 4 after observing various sizes of the symbol of player exchange events. The minimum number of frames which are required to declare a player exchange event is set to 15 for both the red and green symbols. The threshold for the city block distance between the top left coordinates of the red and green spots is empirically set less than 20. Detected player exchange events are removed from the set of events which have high EPR, and we refer to the remaining set of events as probable goal events.
2.5 View classification
After edge analysis, two sets of probable goal and low-impact events. In order to appropriately classify or label these events, it is necessary to realize the temporal pattern of the frames of an event. To furnish this task, it is necessary to label every frame of the event of the video. This process is referred to as view classification. Since this process is entirely independent of event filtration and categorization, it can be applied in parallel. In order to carry out view classification, we extract visual features from the frames and classify them into one of the predefined views. Characteristics of different views are described below:
-
Far field view: A far field view displays a global view of the game field. It is captured by a camera at a long distance. It is often used to show the play status, such as play position and long passes. In this view, the ratio of the field area to the whole image is high, and the size of players within the field is small.
-
Goal post view: A goal post view displays a goal post area. It is shown when players are attempting to get a goal. If the goal post is captured from a long distance, the ratio of the field area is high; otherwise, the field ratio is medium to low. The goal post view is partially dominated by audience view.
-
Medium field view: A medium view is a zoom-in view of a specific part of the field. It usually shows players and referees with the field as a background. In a medium view, the size of players in the playfield is bigger than that in a long view and the field ratio is in the medium range.
-
Close-up view: An outfield view displays close-up of players, coach or players gathering with non-field background. It often focuses on the leading actor of current event. In this view, the field ratio is very low.
-
Audience view: An outfield view displays the audience, as an indication of a break caused by highlights, such as an audience cheer view after a goal. In the audience view, the field ratio is extremely low and generally texture is dominant and complex.The view classification system is shown in Figure 9. At the first level of classification, algorithm I is applied on all the frames of an event of the video for far field view and non-far field view classifications.
Algorithm I: field view detection
-
1.
Convert the input frame from an RGB image into an HSV image.
-
2.
Get the hue histogram of the image.
-
3.
Define the hue range, which covers the different variations of the playfield's green colour, as a green window.
-
4.
Compute the grass pixel ratio (GPR).
-
5.
Apply the K-means algorithm on GPR to cluster frames into two clusters, one with high GPR values and the other with low GPR values.
The playfield usually has a distinct tone of green that may vary from stadium to stadium of different leagues of soccer. Matches that are played under floodlight exhibit different tones of green than sunlight. Even the shadow effect is also observed on the playfield many times under sunlight which also affects the intensity of green colour. So, hue range, which can cover different playfields’ green colour, is carefully decided and identified as green range. The range of hue for the identification of various shades of green is set between 0.23 and 0.38 which we can refer to as a green window. We also involve the saturation and value components by setting them greater than 0.40. Due to varying green tones, the grass pixel ratio differs largely on various datasets; hence, it is not wise to set the threshold statically for the separation of far field and non-far field views. Instead, in step 5, we apply k-means to separately cluster these views. Algorithm I classifies each frame in either far field view or non-far field view. The proposed goal post view detection method is mentioned below.
Algorithm II: goal post detection
-
1.
Convert the input RGB image into a grey scale image.
-
2.
Apply the Sobel edge detection operator on the grey image to detect vertical edges.
-
3.
Erode the image with a vertical structuring element.
-
4.
Apply a canny edge detection operator on the grey image to detect field lines near the goal post.
-
5.
Apply Hough transformation.
-
6.
If vertical parallel lines and parallel lines on the field are detected, then the frame belongs to a goal post view.
The Sobel edge detection operator is applied on the image to detect vertical lines. The resultant image exhibits vertical lines of the goal post as well as many other vertical lines, whose length is less than that of goal post lines. To remove such unimportant lines, erosion operation with a vertical structuring element is applied on the resultant image. We have used a vertical structuring element of length 5.The output of edge detection operation is an image described by a set of pixels having vertical edges. This set of pixels rarely characterizes an edge completely because of noise and breaks in the edge. So, edge detection operation is followed by edge linking technique to assemble edge pixels into meaningful edges. We apply the Hough transform to detect linked vertical edges. If parallel vertical lines of the goal post and parallel field lines are detected, then we can conclude that the frame is having a goal post view. Figure 10c shows the detected two vertical poles of the goal post; however, these edges may be broken or noisy. Hence, results of the Hough transformation in Figure 10c are shown in Figure 10d. For the detection of parallel field lines near the goal post which are partially horizontal, the canny edge detection method is applied. Canny is a good candidate for thin as well as dull edges; we opt canny detection for these horizontal edges. Figure 10e depicts the existence of field lines near the goal post, and Figure 10f shows the result of the Hough transformation. Figure 11a,b,c,d,e,f shows the results of the goal post detection method of a left-oriented goal post.
2.6 Audience view detection
Classification of audience view and close-up view is based on finding EPR. Edge images generated using canny edge detection are shown in Figure 12b,d along with their EPR values. The EPR value of audience view is quite higher than that of close-up view. EPR is statically set to 4.5 by experimenting on a large number of frames of different condition videos. The audience view detection algorithm has been described below.
Algorithm III: audience view detection
-
1.
Convert the input RGB image into a grey image.
-
2.
Convert the grey image into a binary image using canny edge detection operator.
-
3.
Compute EPR as shown in Equation 5.
-
4.
Define the edge pixel threshold (EPth) for audience view classification.
-
5.
If EPR > EPth, then
-
i.
The frame is classified as audience view.
-
6.
Else
-
i.
The frame is classified as close-up view.