- Open Access
Transition effect detection for extracting highlights in baseball videos
EURASIP Journal on Image and Video Processing volume 2013, Article number: 27 (2013)
In this research, a transition effect detection scheme for identifying possible highlight segments in baseball videos will be presented. The effects that are inserted manually by the broadcasters for signaling the slow-motion segments will be extracted and the frames containing such effects can serve as anchor positions for further processing. A set of video segments will first be chosen to construct the ‘transition effect template’ for the archived video. The candidate frames will be compared with this template for searching the slow-motion video segments. In baseball videos, we further construct the ‘pitching view template’ so that the starting positions of the video segments of interest can be located. By processing these segments only, we may further employ such method as hidden Markov model to classify their content. The major contribution of this research is the usage of compressed-domain features to achieve the efficiency. The experimental results show the feasibility of the proposed scheme.
Watching sportscast has been a popular past-time activity worldwide and many viewers may choose to record their favorite games for archiving or time-shifting purposes. Thanks to the superior perceptual quality, the convenience of storing, transmitting, and even processing of digital visual content, digital recording facilities with lower cost and more computational power are becoming widely available nowadays. When the users set to enjoy their archived digital videos, they may be more interested in watching only the game highlights, which will save them substantial amount of time without sacrificing too much excitement. Therefore, efficient and effective sports video highlight extraction from digital content raises a lot of research activities [1–10] in recent years.
The approaches to extracting highlights from sports videos may be roughly classified into four categories. The first approach is to identify the unique visual and/or audio characteristics that may exist in game highlights. When an impressive performance occurs in the sportscast, a typical scene or sound may appear. By combining the audio-visual features with the domain-specific knowledge, we may obtain a better understanding of the content. Such visual features as the color histogram, types of camera shots, motion information, and such audio features as zero-crossing rate, frequency spectrum, and signal energy level help to identify the special events . Wang et al.  presented a soccer goal extraction algorithm by analyzing the correlations among scenes to extract the ones that contain the goal shot attempts. A graphic representation is proposed by Ren et al.  to facilitate the analysis of temporal saliency in soccer videos. In baseball videos, the combination of certain court views may be useful in determining the play of home runs or base hits . The higher-level understanding of the baseball game for highlight detection can also be achieved by the delicate scene analysis . The sound processing is also applied quite often in video highlight extraction [15–19]. In sports videos, the sound from the crowds at the stadium or the speaking tone of the anchorman/commentators will reflect the exciting moments of the games. The identification of such sounds as the whistles from the referees or ball hitting will be beneficial. The major drawback of sound processing may be the higher false identification rate. For example, the crowds in the stadium may not cheer for the visiting team. Additional processing may be needed to increase the accuracy. The second approach is to analyze the text data shown in the sports videos. The caption sent along with the transmitted video surely provides more accurate information. If the caption is not available, the so-called video optical character recognition can be applied to identify the content of score boxes superimposed on the sides of the screen [20–24]. The moment when the score changes in a game will be what the audiences care so the message conveyed in the score boxes will assist the browsing of sports videos. The major challenge of this approach may be the inconsistent forms of score boxes in different sports games as their sizes/types may be different. The third approach is to determine the slow motion replays in sports videos [25–27]. After a special event happens in a ball game and the broadcasters identify that the audience may be interested in viewing it again, the video segment will be replayed in a slower pace. It has been observed that the replayed video segments may demonstrate certain visual representations, such as the repeated fields in TV broadcasting , the unique statistics of motion vectors in MPEG video , and the scene transitions . These characteristics may be used to differentiate the slow-motion segments from normal scenes. Some slow-motion replays are shown after fading in/out  or dissolving effects , so the successful detections of them may help to identify the replays. Giusto et al.  viewed slow-motion replays as special effects and employed the fractal/wavelet decomposition to detect them. However, the accuracy of slow-motion detection may be affected by the way that the replays are processed since they are broadcaster dependent. In addition, some slow-motion scenes are quite difficult to be differentiated from the normal ones, even by the human eye. Certain replays may even be displayed with varying speeds to attract the viewers’ attention, and this inconsistent structure of slow-motion replays may complicate the process of extraction.
The fourth approach is to employ the methodology of multimodal fusion [34–40] to build highlight extraction/classification systems, which may bridge the gap between the extracted low-level features and the semantics of the data. Bertini et al.  employed the camera motion, play-field zone and players’ positions to fuse for highlight annotation. Shih et al.  employed the maps of spatial/temporal features and face information to construct the attention model for identifying the highlights. Zhu et al.  proposed a multimodal approach to organize the highlights extracted from racket sports videos by using a nonlinear ranking model. They also proposed to fuse text, time, and view types to extract attack events for tactics classification in soccer videos . Niu et al.  further proposed a real-world trajectory extraction method based on field line detection to recognize six typical soccer attack patterns for tactic analysis. The hidden Markov model (HMM) is utilized quite often in extracting highlights from sports videos. Cheng et al.  developed a baseball highlight extraction scheme based on HMM by fusing video and audio features. Papadopoulos et al.  utilized the motion vectors, and Kijak et al. [47, 48] made use of the structure of video shots in the training phase of their HMM-based schemes. Nguyen et al.  employed principal component analysis and the frame features for data fusion. Wang et al.  proposed to convert the low-level features into a keyword sequence for their HMM classifier by using Viterbi algorithm. Delakis et al.  employed HMM and segment models for audio-visual integration in video indexing. Chang et al.  applied HMM by using scene shots and visual features in baseball games. Chen et al.  further employed HMM to analyze the details of ball hitting events. Ouazzani et al.  combined Bayesian inferences and HMM in soccer games. Instead of using the general HMM, Ding et al.  employed the multi-channel segmental HMM for video mining in football videos. Tang et al.  made use of MPEG2 features and HMM to detect highlights in cricket games.
In this research, we will present a transition effect detection scheme for locating the replay segments in baseball videos. In our opinions, the replays are selected manually so the associated content should be more related to the game highlights. Besides, the insertion of such transition effects by broadcasters is becoming a trend due to its dual effects of advertisement and informing the audiences of replays. We may classify it as a replay-related approach or as a ‘logo-based’ approach since a transition effect usually demonstrates a team, business, or merchandize logo. Pan et al.  first proposed to detect the logos for replay detection. Their previous method of detecting slow-motion segments  was applied to locate possible replays and then the frames before and after the segment are compared to see whether the similar contents or logos exist. Tong et al.  proposed to detect certain logo transitions via frame-by-frame differences. The logo template was then formed from some detected candidates for the further matching. Su et al.  made use of the unique characteristics of transition effects in MPEG2 bit streams for detecting replays. Zhao et al. , Dang et al. , and Li et al.  extracted the superimposed logos from video frames by employing rule-based methods. Song et al.  proposed to detect the logos and apply the audio-visual multimodal analysis for verification. Xu et al.  detected the logos by calculating the accumulated differences in frames to form the logo template from the candidate set in soccer videos. Zhao et al.  employed speeded-up robust features to find repeated logo patterns in video frames and then search those patterns to handle various transition types. Although quite a few methods utilizing transition effects for locating replays have been proposed, most of them rely on expanding video frames to extract either spatial or temporal features and are thus time-consuming. In our opinions, the highlight extraction is an auxiliary function of a video recorder, which should not be computationally expensive. As videos are often archived in MPEG format these days, the schemes directly working in the compressed domain will be preferred in manufacturing electronic products. Therefore, we further simplify and extend our early work  to develop a compressed domain transition effect detection scheme for highlight extraction. We make use of both the characteristics of effects and their repeated appearances to construct the associated templates in the investigated video so that we may reduce the challenges of using a set of fixed parameters or rules to identify all kinds of effects correctly. The classification of highlights, which can also be operated in the compressed domain, will then be facilitated by analyzing the video segments of interest only. We will describe the details of the proposed scheme, including the feature extraction, the construction of templates, and the classification of highlights in the following sections. Experimental results will show the feasibility of our method.
The proposed scheme
Figure 1 shows the block diagram of the proposed scheme. The input of the system is a compressed video in either MPEG-1 or MPEG-2 format, from which the representative features are extracted for the subsequent processing. We will use a longer video segment, which can cover a few transition effects, to train the ‘templates’ of transition effects and pitching views in baseball videos for more accurately locating these segments. The so-called processing units that may include the transition effects are formed from this training video segment for constructing the transition effect template by the methodology of majority voting. Then we start to construct the pitching view template. Since the transition effects always come with scene changes, the compressed-domain scene-change detection is applied and the frames around the scene changes will be compared with the effect template. Once the frames with transition effects are identified, the pitching views associated with the plays will be extracted by using the pitching view template. The contents of the plays can then be classified by such method as HMM, which is trained off-line. In the following subsections, we will examine the procedures of each step in details.
In this subsection, we describe the procedures of generating the data for effective processing, including the extraction of features from the MPEG stream and the detection of scene changes.
Features from the MPEG bit stream
The features for the subsequent processing are extracted from the MPEG-compressed bit stream. The coding modes and motion vectors, which can be acquired conveniently, are employed to determine the variation of content in adjacent frames. The mean values of blocks, which are derived from the lowest frequency coefficients, i.e., ‘DC’ coefficients in DCT (discrete cosine transform) blocks, will provide the color information in the frames. The ‘DC frames,’ which are the coarse down-sampled frames with size equal to 1/8×1/8 of the original frame resolution, will be constructed as follows: In I frames, we can retrieve the DC coefficients without any problem as they are only differentially Huffman-coded. For P frames, Figure 2 shows the four 8×8 blocks, including B a, B b, B c, and B d in the reference frame and the block B p in the currently processed frame. The best match of B p in the reference frame, , has been found by the motion estimation and marked by the dashed line in Figure 2. Given that covers parts of B a, B b, B c, and B d with areas A a, A b, A c, and A d, respectively, the DC coefficient of , i.e., , is estimated by
The DC coefficient of the residual blocks in B p is then decoded from the MPEG-compressed bit stream and added onto to form the estimated DC coefficient of B p, i.e., , whose value is limited in [0,255]. The similar procedure can be applied on B frames and we can acquire the estimated DC frames of the video segment of interest. Special care has to be paid on boundaries of a frame. After applying this process to all the blocks in inter-coded frames, we can obtain every DC frame of the video.
The procedure of our scene-change detection by using the MPEG features is as follows. We first extract the DC frames of I frames, I i and I j , from the two adjacent GOP’s, GOP i , and GOP j , respectively. We compute the histograms of I i and I j to form two vectors, and . The distance of and is calculated by
If is larger than a threshold T i , a scene change is identified as occurring between I i and I j . Next, we calculate the percentage of macroblocks that are intra-coded, denoted by , in all the P frames in GOP i . The P frame with the largest , denoted by P m , is chosen and is compared to the other threshold T P . If T P , we calculate to ensure that I i and P m are not similar frames. If is larger than a threshold T D , P m will be chosen as the frame with scene change, F c. Otherwise, I j will be chosen as F c. We do not process B frames at this stage because the accuracy is already good enough and the complexity can thus be reduced. In other words, the percentage of intra-coding in P frames serves as a pretty good indication of content fluctuations with smaller computational cost.
Two templates will be constructed for each baseball game video, i.e., the transition effect template and pitching view template.
Transition effect template
We first have to collect video segments that probably contain the effects. Therefore, our objective here is to ensure that a transition effect, if exists, should be completely covered in the selected segments, i.e., processing units. Since an effect usually causes large variations in the contents of frames, scene changes can always be found in the duration of an effect. Figure 3 shows the percentage of intra-coding in P frames in a typical video segment containing transition effects and the associated slow-motion replay. There are seven scenes in the video with scenes (1), (6), and (7) showing the normal plays and scenes (3) and (4) demonstrating two different views in the slow-motion replay. The large numbers of intra-coding between (3) and (4) and between (6)and (7) clearly indicate the scene changes. The scenes (2) and (5) in Figure 3 illustrate the transition effects. We can find that the surges of intra-coding percentage occur during the appearance of transition effects. This may be explained by Figure 4, which shows consecutive frames of an effect. When this effect just appears, it usually covers a smaller portion of a frame as shown in Figure 4b, so the number of intra-coded macroblock is also small. This number will increase along with the emerging effect and hit the maximum value when the complete logo is shown. The other observation is that there are more P frames with a large number of intra-coding macroblocks in the duration of the effect than in simple scene changes since the effects usually continue for a short while and their fast-moving characteristics will affect the coding of several macroblocks. The two-peak structure in Figure 3 comes from the fact that the effects emerge and then disappear quickly and both actions result in a lot of intra-coding macroblocks. It should be noted that this phenomenon is not a specific case but exists in many transition effects that we have observed. Furthermore, Figure 5 shows the curves of intra-coding percentages in P frames from five varying transition effects. The data of four different video segments of the same effect are plotted together. We can find that, in addition to the existence of multiple peaks in each case, the shapes of the curves of the same effect tend to be similar because the effects usually dominate in the frames and affect the coding in a similar manner.
After the compressed-domain shot boundary detection helps to determine the frame of shot change, the forward/backward extensions will then be made to establish the processing unit with several frames by the following procedures. From the scene-change frame, F c, we search backward and forward to find the temporary starting frame, F s, and ending frame, F e, of the processing unit. We have to include more frames than necessary to expect that the entire transition effect is covered. Since the transition effect is usually inserted when a play stops and that the scenes before and after the transition effect seldom contain large content variations, we select the frame as F s (F e) after we meet consecutive N=5 P frames with smaller than a threshold, , in the backward (forward) search. A refinement process is then applied on the constructed DC frames as follows: A transition effect is visually different from the scenes before and after it so we can remove a frame at the beginning (end) of the current processing unit if it is similar to the frame right before (after) it. To be more specific, in order to determine a suitable starting frame of a processing unit, we check the color difference of the first two frames F 1 and F 2 by
where is the m th8×8 block of F 1(F 2) and M is the number of blocks in a frame. If the difference is not large, we delete F 1 from the processing unit and make F 2 become the starting frame to repeat the process. The same procedure is applied at the end of the processing unit in the reverse order. We can thus ensure that the resulting first frame and last frame of the processing unit can be quite different from the preceding and following frames respectively after the refinement. In addition, we will remove/ignore the unit once the number of frames in the unit becomes less than a threshold value, T l =60, to remove some normal scene changes and even zoom-in/out shots. Finally, we will check the current and previous processing units and may merge the two units if they are overlapping.
Most transition effects are usually superimposed objects/logos on the video frames so when the artificial effect appears in a frame, certain parts of the scene in the ball game will also be revealed. The revealed ‘background’ pixels will complicate the identification of the ‘foreground’ transition effect so the pixels associated with the effect should be identified. In most of the cases, the background scenes before or after the effect may look quite different from the frames of the replay. Therefore, given that the starting and ending frames of the processing unit are F s and F e respectively, we will pick the frame preceding F s and the frame following F e as the background frames. We then compare the luminance DC values of all frames in the processing unit with those in the two background frames. If the DC difference at the same location in a frame and either one of the background frames is large, we mark this location as being covered by the transition effect. We can thus form a binary mask called ‘effect mask’ which indicates the pixel associated with the effect.
Next we will employ the refined processing units that are assumed to include transition effects for training the template. The cross-correlation and majority-voting approaches will be adopted to obtain the template, which will be used to track all the slow-motion replays in the video. To be more specific, after marking the spatial locations of an effect in each frame in the candidate processing units, we calculate the cross-similarity of mask positions and colors among these units for grouping. This process may be time-consuming since we need to not only calculate the similarity of masks/colors between each pair of the units but also temporally synchronize each pair. We choose to simplify this process by exploiting the probability of intra-coding, as shown in Figure 5, in which the same effects tend to have similar curves of intra-coding rates in P frames. In other words, the peaks in the curves will appear at the same frames in the processing units covering the effects. We thus apply a one-dimensional matching on these curves of intra-coding rates first. For each pair of processing units, (PU i , PU j ), after recording the intra-coding rates in P frames as vectors, s i and s j , we zero-pad the vectors so that their lengths are the same and equal to a power of 2. Their (circular) cross-correlation  can be calculated efficiently via fast Fourier transform (FFT) by
where ⊙ indicates the point-by-point multiplication, and is the flipping of s j . IFFT indicates the inverse fast Fourier transform. If C intra(PU i ,PU j ) is larger than a threshold, PU i and PU j are viewed as a candidate pair and the index of the largest C intra will help to roughly synchronize PU i and PU j .
For a selected and roughly synchronized pair, PU i and PU j , their masks and colors will be further compared to achieve a more accurate matching. The procedure is shown in Figure 6. We first extract the frame at the center of PU i , F C,0, and its adjacent frames, from F C,K to F C,−K . From these 2K+1 frames, the frame with the largest foreground, F L,0, will be picked as the anchor frame, which will be compared or matched with the frames in PU j . This strategy comes from the fact that a transition effect usually looks more clearly and occupies a larger portion in the middle of its appearance. K is empirically set as 8 to select one frame from the span of around half a second. One may think that a larger K should provide us the better chance of obtaining a larger logo. Nevertheless, in many transition effects designed these days, the logo may occupy larger areas in frames at the end of its appearance but, at this moment, the logo is usually semi-transparent and cannot help to construct a good template. Therefore, we still prefer to find the logo in the middle of its appearance. Furthermore, since the contents of consecutive frames may be similar, in order to increase the accuracy of synchronization, we also include the other two frames, F L,Q and F L,−Q , to form the three anchor frames for matching. Q is set as 8 so that the three anchor frames can be slightly different from each other and contain the logo as well. Then, we shift PU j ±8 frames, one frame at a time, and count the matched foreground pixels in the corresponding three anchor frames. The mask/color matching is applied on the DC frames. The pixels are viewed as being matched if they are both in the foreground area and the difference of their colors is within 8. The largest number of matched pixels will determine whether PU i and PU j are a synchronized pair.
In the sportscast nowadays, slow-motion replays are usually sandwiched by two transition effects, which may be different. Therefore, one group (if a single logo is used) or two groups (if two different logos exist) of matched processing units will have obviously more processing units. Then, we choose one unit from the largest group and check the corresponding pixels of other units in the group. If the pixels in DC frames are both in the foreground areas and their luminance values are close, the location is ruled as being matched. The frame with the largest number of matched pixels is selected as the template frame, and the luminance mean at these matched positions in the units will be calculated to form the template. In fact, we adopt a more efficient way by iteratively forming the groups during the process of making processing units. In other words, a new processing unit will be compared with the existing ones to see if a synchronized pair can be found. We keep track of the numbers of matched units in groups and when this value in a certain group is larger than the threshold T g =4, we stop the collection of processing units and then simply construct the template by using the matched units. Two examples are shown in Figure 7, including the constructed template frames and the associated video frames. The green pixels indicate the locations of background, which are not supposed to be related to the effect.
After the template frame is constructed, detecting all the transition effects for locating slow-motion replays can be done effectively. One possible way is to generate the processing units by the similar procedures in the template training phase, that is, some refined processing units are extracted, and their DC frames are compared with the template frame based on the similarity of colors and masks. Nevertheless, the misses of detecting the effects may occur. In order to find all the transition effects related to the slow-motion replays, we choose a rather conservative way by matching the frames near the detected scene-change frames with the template. Because the number of scene changes is large in a video, we employ the intra-coding rate of P frames to reduce such cases of matching. According to Figure 5, the intra-coding rate in a P frame is usually quite high. Therefore, when we construct the template of the transition effect in this video, we also calculate the average of the largest intra-coding rate in the effect and scale this value by a factor (0.7) as the threshold. Given a scene-change frame, we check the intra coding rates of P frames in around 2 seconds’ span. If the intra-coding rate of P frame is higher than this threshold, the matching of these DC frames with the template frame will be done to determine whether a transition effect happens here. This method can effectively avoid skipping the possible transition effects and an efficient implementation can also be achieved. Again, the matching is basically executed by comparing the luminance values of pixels covering the effect in the template frame.
Pitching view template
When the viewers browse the video, they may prefer to watch the plays displayed with a normal speed, instead of slow motion. Therefore, an appropriate starting position of real/normal plays of a game highlight should be located. Since a play in a baseball game always starts with the pitching view consisting of the pitcher, catcher, batter, and umpire, we will try to locate the pitching view right before the detected transition effect, that is, after the transition effect is identified, we will trace back to find the pitching view by matching the data with a pitching view template, which will be again established for this specific ball game. The other motivation of finding pitching views is related to the content analysis. It should be noted that designing a common model for the content identification/classification directly from slow-motion segments is challenging since the camera angles or the ways of displaying replays may vary considerably in ball games. In contrast, the video segments of real/normal plays exhibit more unified structures so the their analysis may lead to better results.
By observing that a pitching view shot usually appears within a few shots before a transition effect, we will collect a few scene-change frames before the transition effects. Because of the facts that the scenes of pitching views are almost the same in one game and that other views are essentially different from each other, we can apply the majority-voting strategy again to construct the pitching view template. We make use of the same training video segment in the construction of the effect template. To be more specific, after the transition effects are located, we search backward from each transition effect to find several scene-change frames with the associated scene being reasonably long (longer than 1 s). The closest I frame within the scene will be selected, and the spatial feature will be extracted for the comparison. For an M×N DC frame of an I frame, the singular value decomposition is applied on the mean-removed block, X M×N , as
where u i , v i are the columns of U, V, representing eigenvectors of X XT and XT X, respectively, and Λ is a diagonal matrix with λ 1≥λ 2≥...≥λ N on the diagonal line. We choose the first eigenvectors, u 1 and v 1, as the extracted feature of the block. As mentioned before, the pitching views of the same game tend to have a similar structure. Therefore, we will group the features of selected shot change frames to build the template of pitching view. For each pair of candidate scene-change frames, and , we calculate the correlation of u i and u j (v i and v j ) to obtain corU i j (corV i j ). and will be in the same group if the following conditions are satisfied:
where T s is empirically set as 0.9. The group with the largest number of pairs will be employed to calculate the representative feature, u m and v m , which are the median values of the features in this group. In addition, the mean of these frames in the group, DCmean, will be calculated as the threshold for rough screening.
The determination of the pitching view can then be applied in a straightforward manner. Our scheme simply searches the pitching view frame before a detected transition effect as the starting position of a possible highlight. If a given scene-change I frame has the mean color close to DCmean, its spatial features, u i / v i , will be extracted. The correlation between u i / v i and u m / v m is calculated to determine whether the frame shows a pitching view according to the conditions of Equation 6. Since the pitching view usually lasts for a while, to improve the accuracy, our scheme will identify the pitching view frame if at least three consecutive I frames are recognized as such frames. Figure 8 shows an example of detected pitching views from a one-inning video. We can see from this example that the template has to be resilient to the movements and uniforms of players, and such varying information as texts/numbers on the captions/score boxes.
Although the extracted slow-motion replays certainly provide us good references of retrieving the highlights, the content analysis is still necessary for identifying and/or classifying the data so that more accurate game highlights can be extracted. Our content analysis is based on HMM and the compressed-domain features will be employed for training our high-level semantic models, which help us to analyze the content more precisely. We collect several baseball videos and train the models off-line for the content classification in the investigated video. In our viewpoint, the content analysis here mainly serves as an illustration to show that if the transition effects can be retrieved reliably and the slow-motion replays are located, we should be able to analyze the contents more easily to determine the parts that the viewers really care. Many existing algorithms may also be employed and our method can help to further improve their performances since more suitable data are selected for processing.
After locating the transition effect and the associated pitching view, we will first examine the number of scene changes in the replay segment. If only one or two scenes exist, the event will be ruled as the non-highlight event. Four types of highlight events are considered in our scheme, including base hit, score, out and special. The base hit events include base hits without scoring while the score events may contain hits with scoring, home runs and sacrifice hits, etc. The out events may represent good defensive plays. Other plays such as double plays and errors are categorized in the special events. We adopt the supervised training by HMM to classify the content, that is, we extract the video segments, each of which starting from the shot next to the pitching view to the shot right before the transition effect, from some baseball videos for training. We will build an HMM for each of the four highlight events. First, we have to define the following elements of HMM: the state S, observation O, observation probability in the state Pr(O|S), transition probability A, and initial state distribution π. In our scheme, the video segment of interest will be divided into shots to form the states S in HMM. In other words, the states are the various video shot types. According to the selected video segments based on the transition effects, we consider eight shot types or states as follows: (a) infield, (b) outfield, (c) home-base, (d) defense-infield, (e) player close-up, (f) player walking, (g) player running, and (h) others, as shown in Figure 9. The low-level features will be extracted from the state to form the observations, O, which include (1) the shot length, (2) the intra-coded macroblock percentage in the P frame, (3) the existence of dominant color, and (4) the camera motion. Basically, we record the information in frames of a shot and then determine the state observations accordingly. To examine the dominant color, we quantize the 256 colors in DC frames into 16 levels and the largest number in a level will show the dominant color, which helps us to identify whether the scene covers a large area of field. For the camera motion, the motion vectors of each inter-coded frame are examined in our work to see whether the zooming of the view happens, that is, a frame is divided into four quadrants and the directions of motion vectors in each quadrant are identified. There are basically six types, including intra, skip and four directions. Then each frame will be recognized as containing inward motion directions or not, and several such frames indicate that the shot has zooming operations. Again, the features we use are extracted from the data of MPEG bit stream to avoid the complex operations, such as object detection or complicated image processing procedures.
In the training phase of HMM, we have to evaluate the initial state probability, Π i , the priori probability of each view type, Pr(S i ), and the conditional observation probability, Pr(O k |S i ), where 1≤i≤8 and 1≤k≤16. These items can be estimated from the training data via the histogram analysis. There are 16 observations since the shot will be classified into a long or short shot, a fast or slow shot, a shot containing the dominant color or not, and a shot with zooming or without. The thresholds are carefully set according to the training videos. Given Pr(S i ) and Pr(O k |S i ), we can determine Pr(S i |O k ) by
The transition matrix A is an 8 ×8 matrix since eight states are defined. Each element, a i,j indicates the probability for the model evolution from the state S i to S j , i.e.,
where t is the state or shot index and . Due to the fact that the shot types of training videos have been manually set, A can also be computed in a rather automatic manner. A HMM model can thus be depicted by Λ=(A,B,π), in which the element of B (the matrix of conditional observation probability) is b i,k =Pr(O k |S i ), 1≤i≤8 and 1≤k≤16. We will construct four HMM models for the four highlight types. Given an observation sequence, O=O(1)O(2)…O(T), where T is the number of states in the investigated video segment, we employ Viterbi algorithm to compute Pr(O|Λ). To be more specific, Viterbi algorithm considers the probability of the partial observation sequence O(1)O(2)…O(t) (until the time t), the state at the time t, S(t)=S i , and the given model, Λ, to compute a function δ i (t) as
We can then solve δ i (t) inductively as follows:
Equation 10 initializes the function δ as the joint probability of state S i and the initial observation O(1). The induction step is illustrated in Figure 10, which shows the most probable path to state S j that is passed at the time t from the 8 possible states, S i , 1≤i≤8, at the time t−1. Since δ i (t−1) is the probability of the joint event that O(1)O(2)…O(t) are observed, and the state at the time t−1 is S i , δ i (t−1)×a i,j is the probability of the joint event that O(1)O(2)…O(t) are observed, and state S j is reached at t. Finding the maximal product over all the possible states S i , 1≤i≤8 at t−1 results in the probability of S j at the time t with all the previous partial observations. δ j (t) is then obtained by examining the observation O(t) in state S j , i.e., by multiplying the maximal quantity with the probability Pr(S(t)=S j |O(t)). The computation of Equation 12 is performed for all the states j, 1≤j≤8, and is iterated for t=2,3,...,T. Finally, Equation 14 shows that Pr(O|Λ) is the maximal of the terminal probabilities, δ i (t). It is straightforward to determine which HMM can best describe the observation sequence for the four Λ s, that is, Viterbi algorithm is evaluated for each HMM and the one achieving the highest probability will be selected.
We collect ten baseball games recorded from the TV broadcasts of Chinese Professional Baseball League (CPBL) and Major League Baseball (MLB). The test videos have varying effects such as fading in/out, moving logos, deforming objects and full-frame transitions. We use these various forms of effects from different sources to verify the generality of the proposed method. The videos are compressed into MPEG-2 video streams with the resolution of either 352×240 (videos 1 to 5) or 720×480 (videos 6 to 10). The frame rate is set as 30 fps (frames per second). In each video, we use the first inning of ball game to train the templates. Commercials are removed from the training segment to avoid building the templates based on repeatedly displayed advertisements. It should be noted that this issue may be settled by applying the automatic detection of commercials beforehand . Then, we test our scheme in the first 60 min of the ball games, in which commercials are also removed to facilitate the analysis of data.
We first show the performances of our compressed-domain scene-change detection, which is important to the accuracy of template and pitching view extraction. To save time of examining scene changes by eyes, we use the first innings of videos for testing and the results are shown in Table 1. The precision rate is defined as the number of correct detections divided by the sum of correct and false detections. The recall rate is defined as the number of correct detections divided by the sum of correct detections and misses. We can find that the recall probability of each video is higher than the precision probability. The high recall rates indicate that the misses of scene-change detections are rare in this scheme. Although we may detect some wrong scene changes, it does not affect our scheme much, since the features of the additional shots will be further analyzed.
Transition effect detection
Table 2 shows some information of applying transition effect detection. The second column lists the numbers of processing units formed in the template training process. As mentioned before, we proceed to construct the template as soon as enough processing units are collected to form a group so that the training time can be reduced. The processing time in the training phase is shown in the third column. The fourth column lists the numbers of candidates considered for the transition effect detection, and the fifth column shows the time of matching or logo detection in each one-hour test video. The tests are performed on a computer with Intel Core-2 Quad 2.4 GHz CPU and 2 GB RAM (Intel, Sta. Clara, CA, USA). Although it is not easy to compare the efficiency of our scheme with other existing ones since the information of execution time was seldom reported, we think our scheme is pretty efficient as both the training and detection processes can be finished in a reasonably long period of time. The extracted template frames, along with the corresponding video frames, are demonstrated in Figure 11. The detection results of transition effect are then shown in Table 3. The second column shows the number of transition effects that appear in the test data, which are determined by the human eye. The third and fourth columns demonstrate the numbers of correct and false detections of transition effects, respectively. The average precision rate is as high as 98%, since the template is accurately determined, and the recall rate is 95%. Given that there are so many kinds of materials in baseball videos, the performance is quite good to fulfill the requirements of our targeted application. The cases of misses come from the fact that the associated processing units are not included for the subsequent examination because their scene changes are not detected. More flexible thresholds may reduce the number of misses at the expense of spending more time investigating the video data. The cases of false detections are usually the transition effects that do not relate directly to slow-motion replays but to certain statistical information about the ball game. These effects may have a similar outlook with the targeted ones so the removal of these effects needs further content analysis. Furthermore, the uses of semi-transparent logos in the sportscast these days may make the constructed template less reliable, so the resulting errors may be increased a bit.
Pitching view detection
The experimental results of the pitching view detection are shown in Table 4. We tested the ten baseball games, in which the colors of the players’ jerseys, positions of the players, and textures of fields are different. The second column in Table 4 lists the numbers of traceable pitching views in the test data, which are extracted according to the detected transition effects. The third and fourth columns show the numbers of correct and false detections of pitching views, respectively. The misses happen when the targeted pitching view frames are not detected in the reverse search from the transition effects. The false detections indicate that certain scenes are wrongly identified as the pitching view frames so the reverse search stops before reaching the targeted ones. We can see that the precision and recall rates are both high since the trained pitching view template effectively represents such scenes in the video. It is worth noting that the detection of pitching views can also be done efficiently. The execution time is listed in the last column of Table 4 as the reference and it is around 36 s in average.
The results of the highlight classification are shown in Table 5. The average precision and recall rates are 83% and 85% respectively, which demonstrate that the HMM-based method can achieve reasonably good results. About 90% of the non-highlight events are correctly determined by checking the number of scenes in the replay segment. A brief comparison is shown in Table 6. Compared with the performances of existing HMM-based schemes [8, 52], although our results may not be much superior, most of the other methods exploit the pixel-domain information or such high-level features as extracted objects/faces so that their computational complexity will be higher. We do believe that a more delicate training process in our scheme should help to improve the performance. In our opinion, HMM here serves as one potential approach for effective highlight classification. The major contribution of this research is to extract the more meaningful video segments for analysis so that a practical implementation of highlight extraction is possible. More advanced methods for content classification can surely be coupled well with our scheme based on the transition effect detection.
Some comments about our experiments are as follows: The detections of transition effects can provide us the video segments of interest, which have more unified structures, so we can use low-level or MPEG-domain features for the effective content classification. Our research objective is to design a practical highlight extraction scheme for digital video recorders so we still prefer to adopt the compressed-domain approach and employ the transition effect detection to exclude less possible data from processing. If the restrictions of complexity/cost are a bit relaxed, we may choose to expand/decode some frames and make use of high-level features to improve the performance of our content classification. The other concern is the several empirically set thresholds, which may be affected by such factors as bit rates and resolutions of videos. The problem may become less serious if the manufacturer can test many videos, probably with different levels of compression, recorded by this specific video recorder to decide suitable thresholds. In addition, since the same transition effect will appear repeatedly in the recorded video, the methodology of majority voting is quite effective. We may also adjust the thresholds during the training process to ensure that a template can be successfully made. Nevertheless, commercials have to be excluded from the training process because the same commercial may also appear several times. These commercials may not cause problems in the detection phase though. Finally, there exists a trade-off between execution speed and accuracy. To avoid missing the detections of effects, we may select more candidate frames for testing with the cost of more computation. The same issue exists in the template construction. The more processing units are considered when constructing the template, the better quality the template frame will be and the more execution time will be expected. The major drawback of this work is that our scheme only works on the sports videos with transition effects, although we think that the usage of transition effects is a trend in sportscast nowadays.
We propose to make use of the transition effects inserted by broadcasters for sports videos highlight extraction. The MPEG-compressed domain features, including motion vectors, coding modes, and color information, are used to differentiate the shots containing the transition effects from others. The template of transition effects in the investigated video is obtained after training and can be used to detect the effects in the entire game. After the transition effects are identified, the positions of slow-motion replays can be located and the suitable starting positions of possible video highlights before the replay will be detected by our pitching view model. The video segments of interest can be further analyzed by the trained HMMs to determine which type of highlights the segments belong to. Experimental results demonstrate this promising research direction. We believe that the proposed scheme can be coupled with many existing content analysis algorithms in sports videos to either speed up or improve the performance. The feasibility of the research is illustrated by using baseball videos, and the idea should be applicable to other sports. Since the proposed scheme only utilizes the features extracted/calculated from the MPEG bit stream, we believe that a cost-effective implementation in consumers’ digital video recorders could be achievable.
Hanjalic A: Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Trans. Multimedia 2005, 7(6):1114-1122.
Tjondronegoro D, Chen YP, Pham B: Sports video summarization using highlights and play-breaks. In The 5th International ACM Multimedia Information Retrieval Workshop. ACM, New York; 7 November 2003.
Kokaram A, Rea N, Dahyot R, Tekalp M, Bouthemy P, Gros P, Sezan I: Browsing sports video: trends in sports-related indexing and retrieval work. IEEE Signal Process. Mag 2006, 23(2):47-58.
Assfalg J, Bertini M, Colombo C, del Bimbo A, Nunziati W: Semantic annotation of soccer videos: automatic highlights identification. Comput. Vis. Image Unders 2003, 92(2-3):285-305. 10.1016/j.cviu.2003.06.004
Tjondronegoro D, Chen YP, Pham B: Integrating highlights for more complete sports video summarization. IEEE Multimedia 2004, 11(4):22-37. 10.1109/MMUL.2004.28
Petkovic M, Mihajlovic V, Jonker W, Djordjevic-Kajan S: Multi-modal extraction of highlights from TV Formula One programs. In IEEE International Conference on Multimedia and Expo. Lausanne; 26–29 Aug 2002:817-820.
Assfalg J, Bertini M, Nunziati W, Pala P, ABimbo: Soccer highlights detection and recognition using HMMs. In IEEE International Conference on Multimedia and Expo. Lausanne; 26–29 August 2002:825-828.
Chang P, Han M, Gong Y: Extract highlights from baseball game video with Hidden Markov models. In IEEE International Conference on Image Processing. Rochester; 22–25 September 2002:609-612.
Cheng CC, Hsu CT: Fusion of audio and motion information on HMM-based highlight extraction for baseball games. IEEE Trans Multimedia 2006, 8(3):585-599.
Chen HT, Chou CL, Tsai WC, Lee SY, Lin BSP: HMM-based ball hitting event exploration system for broadcast baseball video. J Vis. Commun. Image Representation 2012, 23(5):767-781. 10.1016/j.jvcir.2012.03.006
Duan LY, Xu M, Tian Q, Xu C, Jin JS: A unified framework for semantic shot classification in sports video. IEEE Trans. Multimedia 2005, 7(6):1066-1083.
Wang X, Xie S, Chen H: An algorithm of soccer goal extraction by using shot features. In International Conference on Computational Intelligence and Software Engineering. Wuhan; 11–13 December 2009:1-4.
Ren R, Jose JM: Temporal salient graph for sports event detection. In 16th IEEE International Conference on Image Processing. Cairo; 7–10 November 2009:4313-4316.
Shih HC, Huang CL: MSN: statistical understanding of broadcasted baseball video using multi-level semantic network. IEEE Trans. Broadcasting 2005, 51(4):449-459. 10.1109/TBC.2005.854169
Lu L, Jiang H, Zhang H: A robust audio classification and segmentation method. In the ninth ACM international conference on Multimedia. ACM Multimedia, Ottawa; 30 September 2001–5 October 2001:203-211.
Rui Y, Gupta A, Acero A: Automatically extracting highlights for TV baseball programs. In The 8th ACM International Conference on Multimedia. ACM Multimedia, Los Angeles; 30 October 2000–3 November 2000:105-115.
Xiong Z, Radhakrishnan R, Divakaran A, Huang TS: Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework. IEEE International Conference on Acoustics, Speech, and Signal Processing Hong Kong, April 2003, 401-404.
Zhang D, Ellis D: Detecting sound events in basketball video archive. Technical Report, Electrical Engineering Department of Columbia University, 2001
Liu J, Dong Y, Huang J, Zhao X, Wang H: Sports audio classification based on MFCC and GMM. In 2nd IEEE International Conference on Broadband Network and Multimedia Technology. Beijing; October 2009:482-485.
Zhong Y, Zhang H, Jain AK: Automatic caption localization in compressed video. IEEE Trans. Pattern Anal. Mach. Intell 2000, 22(4):385-392. 10.1109/34.845381
Zhang D, Rajendran RK, Chang SF: General and domain-specific techniques for detecting and recognizing superimposed text in video. IEEE International Conference on Image Processing Rochester, 2002, 593-596.
Zhang D, Chang SF: Event detection in baseball video using superimposed caption recognition, Juan Les, Pins. In Proceedings of the tenth ACM international conference on Multimedia. ACM Multimedia, New York; 1–6 December 2002.
Lee GG, Kim HK, Kim WY: Highlight generation for basketball video using probabilistic excitement. IEEE International Conference on Multimedia and Expo New York, 28–June 2009 to 3 July 2009, 318-321.
Jung C, Kim J: Player information extraction for semantic annotation in golf videos. IEEE Trans. Broadcasting 2009, 55: 79-83.
Boulton JC: Two mechanisms for the detection of slow motion. J. Opt. Soc. Am.: Optics, Image Science, and, Vision 1987, 4(8):1634-1642. 10.1364/JOSAA.4.001634
Wang L, Liu X, Lin S, Xu GY, Shum HY: Generic slow-motion replay detection in sports video, Singapore. IEEE International Conference on Image Processing 24, 1585-1588.
Ruan X, Li S, Dong Y, Feng J: Study on highlights detection in soccer video based on the location of slow motion replay and goal net recognition. Chinese Conference on Pattern Recognition Beijing, 22–24 October 2008, 1-6.
Pan H, Beek PV, Sezan MI: Detection of slow-motion replay segments in sports video for highlights generation. IEEE International Conference on Acoustics, Speech and Signal Processing Salt Lake City, 7–11 May 2001, 1649-1652.
Kobla V, Dementhon D, Doermann D: Detection of slow-motion replay sequences for identifying sports videos. IEEE Workshop on Multimedia Signal Processing Copenhagen, 13–15 September 1999, 135-140.
Wang J, Chng E, Xu C: Soccer replay detection using scene transition structure analysis. IEEE International Conference on Acoustics, Speech, and Signal Processing 18–23 March 2005, 433-436.
Farn EJ, Chen LH, Liou JH: A new slow-motion replay extractor for soccer game videos. Int. J. Pattern Recognit, Artif. Intell 2003, 17: 1467-1481. 10.1142/S0218001403002964
Lienhart R, Zaccarin A: A system for reliable dissolve detection in videos. IEEE International Conference on Image Processing Thessaloniki, 7–10 October, 2001, 406-409.
Giusto DD, Murroni M, Soro G: A new approach to slow motion effect for digital TV broadcasting services. IEEE Trans. Broadcasting 2007, 53(3):703-710.
Snoek C, Worring M: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools and Appl 2005, 25: 5-35.
Song Y, Wang W: Unified sports video highlight detection based on multi-feature fusion. Third International Conference on Multimedia and Ubiquitous Engineering Qingdao, 4–6 June 2009, 83-87.
Kim HG, Jeong J, Kim JH, Kim JY: Real-time highlight detection in baseball video for TVs with time-shift function. IEEE Trans Consum. Electron 2008, 54(2):831-838.
Chan LC, Chen YS, Liou RW, Kuo CH, Yeh CH, Liu BD: A real time and low cost hardware architecture for video abstraction system. IEEE International Symposium on Circuits and Systems Los Angeles, 27–30 May 2007, 773-776.
Shen J, Tao D, Li X: Modality mixture projections for semantic video event detection. IEEE Trans. Circuits Syst. Video Technol 2008, 18(11):1587-1596.
Xu D, Chang SF: Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans. Pattern Anal. Mach. Intell 2008, 30(11):1985-1997.
Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, Huang TS: SIFT-bag kernel for video event analysis, Vancouver, British Columbia. In Proceedings of the 16th ACM international conference on Multimedia. ACM Multimedia, New York; 2008:229-238.
Bertini M, Cucchiara R, Bimbo AD, Prati A: Semantic adaptation of sport videos with user-centred performance analysis. IEEE Trans. Multimedia 2006, 8(3):433-443.
Shih HC, Hwang JN, Huang CL: Content-based attention ranking using visual and contextual attention model for baseball videos. IEEE Trans. Multimedia 2009, 11(2):244-255.
Zhu G, Huang Q, Xu C, Xing L, Gao W, Yao H: Human behavior analysis for highlight ranking in broadcast racket sports video. IEEE Trans on, Multimedia 2007, 9(6):1167-1182.
Zhu G, Xu C, Huang Q, Rui Y, Jiang S, Gao W, Yao H: Event tactic analysis based on broadcast sports video. IEEE Trans. on Multimedia 2009, 11: 49-67.
Niu Z, Gao X, Tian Q: Tactic analysis based on real-world ball trajectory in soccer video. Pattern Recognit 2012, 45(5):1937-1947. 10.1016/j.patcog.2011.10.023
Papadopoulos GT, Briassouli A, Mezaris V, Kompatsiaris I, Strintzis MG: Statistical motion information extraction and representation for semantic video analysis. IEEE Trans. Circuits and Syst. for Video Technol 2009, 19(10):1513-1528.
Kijak E, Oisel L, Gros P: Hierarchical structure analysis of sport videos using HMMs. IEEE International Conference on Image Processing Barcelona, 14–17 September 2003, 1025-1028.
Namuduri K: Automatic extraction of highlights from a cricket video using MPEG-7 descriptors. In First International Communication Systems and Networks and Workshops. Bangalore; 5–10 January 2009:1-3.
Bach NH, Shinoda K, Furui S: Robust highlight extraction using multi-stream hidden Markov models for baseball video. 2005 International Conference on Image Processing Genoa, 11–14 September 2005, 173-176.
Wang J, Xu C, Chng E, Tian Q: Sports highlight detection from keyword sequences using HMM. 2004 IEEE International Conference on Multimedia and Expo Taipei, 30 June 2004, 599-602.
Delakis M, Gravier G, Gros P: Score oriented Viterbi search in sport video structuring using HMM and segment models, Cairns. 2006 IEEE 8th Workshop on Multimedia Signal Processing 3, 484-487.
Chen HT, Chou CL, Tsai WC, Lee SY, Lin BSP: HMM-based ball hitting event exploration system for broadcast baseball video. J Vis. Commun. Image Representation 2012, 23: 767-781. 10.1016/j.jvcir.2012.03.006
Ouazzani RE, Thami ROH: Highlights recognition and learning in soccer video by using Hidden Markov Models and the Bayesian theorem. International Conference on Multimedia Computing and Systems Ouarzazate, 2–4 April 2009, 304-308.
Ding Y, Fan G: Sports video mining via multichannel segmental Hidden Markov Models. IEEE Trans. on Multimedia 2009, 11(7):1301-1309.
Kwatra V, sargin ME, Gargi U, Tang H: Detecting highlights in sports videos: cricket as a test case. IEEE International Conference on Multimedia and Expo Palo Alto, California, 11–15 July 2011.
Pan H, Li B, Sezan MI: Automatic detection of replay segments in broadcast sports programs by detection of logos in scene transitions. IEEE International Conference on Acoustics, Speech, and Signal Processing Orlando, Florida, 13–17 May 2002, 3385-3388.
Tong X, Lu H, Liu Q, Jin H: Replay detection in broadcasting sports video. In Third International Conference on Image and Graphics (ICIG). Hong Kong; 18–20 December 2004:337-340.
Su PC, Wang YW, Chen CC: Transition logo detection for sports videos highlight extraction. In SPIE Optics East. Boston, Massachusetts; 1–5 October 2006:63910S1-63910S9.
Zhao Z, Shuqiang J, Qingming H, Guangyu Z: Highlight summarization in sports video based on replay detection. In Proceedings in the IEEE International Conference on Multimedia and Expo. Toronto, Ontario; 9–12 July 2006:1613-1616.
Dang Z, Du J, Huang Q, Jjiang S: Replay detection based on semi-automatic logo template sequence extraction in sports video. Fourth International Conference on Image and Graphics Chengdu, 22–24 August 2007, 839-844.
Li W, Chen S, Wang H: A rule-based sports video event detection method. In International Conference on Computational Intelligence and Software Engineering. Wuhan; 11–13 December 2009:1-4.
Xu W, Yi Y: A robust replay detection algorithm for soccer video. IEEE Signal Process. Lett 2011, 18(9):509-512.
Zhao F, Long Y, Wei Z, Wang H: Matching logos for slow motion replay detection in broadcast sports video. In IEEE International Conference on Acoustics, Speech and Signal Processing. Kyoto; 25–30 March 2012:1409-1412.
Roberts RA, Mullis CT: Digital Signal, Processing. Addison Wesley, Reading, MA; 1987.
Satterwhite B, Marques O: Automatic detection of TV commercials. IEEE Potentials 2004, 23(2):9-12.
This research is supported by the National Science Council in Taiwan, under grants NSC97-2221-E-008-072 and NSC101-2221-E-008-121.
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Su, PC., Lan, CH., Wu, CS. et al. Transition effect detection for extracting highlights in baseball videos. J Image Video Proc 2013, 27 (2013). https://doi.org/10.1186/1687-5281-2013-27
- Hide Markov Model
- Transition Effect
- Video Segment
- Ball Game
- Scene Change