# Adaptive Edge-Oriented Shot Boundary Detection

- Don Adjeroh
^{1}Email author, - M. C. Lee
^{2}, - N. Banda
^{1}and - Uma Kandaswamy
^{1}

**2009**:859371

**DOI: **10.1155/2009/859371

© Don Adjeroh et al. 2009

**Received: **20 August 2008

**Accepted: **18 May 2009

**Published: **28 June 2009

## Abstract

We study the problem of video shot boundary detection using an adaptive edge-oriented framework. Our approach is distinct in its use of multiple multilevel features in the required processing. Adaptation is provided by a careful analysis of these multilevel features, based on shot variability. We consider three levels of adaptation: at the feature extraction stage using locally-adaptive edge maps, at the video sequence level, and at the individual shot level. We show how to provide adaptive parameters for the multilevel edge-based approach, and how to determine adaptive thresholds for the shot boundaries based on the characteristics of the particular shot being indexed. The result is a fast adaptive scheme that provides a slightly better performance in terms of robustness, and a five fold efficiency improvement in shot characterization and classification. The reported work has applications beyond direct video indexing, and could be used in real-time applications, such as in dynamic monitoring and modeling of video data traffic in multimedia communications, and in real-time video surveillance. Experimental results are included.

## 1. Introduction

Video shot boundary detection (also called video partitioning or video segmentation) is a fundamental step in video indexing and retrieval, and in general video data management. The general objectives are to segment a given video sequence into its constituent shots, and to identify and classify the different shot transitions in the sequence. Different algorithms have been proposed, for instance, based on simple color histograms [1, 2], pixel color differences [3], color ratio histograms [4], edges [5], and motion [6–8]. In this work, we study the problem of video partitioning using an edge-based approach. Unlike ordinary colors, edges are largely invariant under local illumination changes and are much less affected by the possible motion in the video. To ensure robustness, we use both edge-based and color-based features under a multilevel decomposition framework. With the multiple decompositions, we can avoid the time-consuming problem of motion estimation by a careful choice of the decomposition level to operate at. Improvements in video partitioning have been recorded by performing a dynamic classification of the shots as the video is being analyzed, and then adaptively choosing the shot partitioning parameters based on the predicted class of the shot [9]. Automatic shot classification can also serve as an important step in approaching the elusive problem of capturing semantics or meaning in the video sequence (see, e.g., [14]).

We note that the problem of video shot partitioning (or segmentation) is not only relevant to video indexing and video data management. (See [11–14] for discussion on video query, browsing, and video object management). It is also an important issue in other areas of video communication, such as video compression and video traffic modeling [15]. In particular, for problems such as video traffic characterization and modeling, shot-level adaptation becomes mandatory, if the network is to dynamically allocate limited network resources in response to changing video data traffic.

In this work, we introduce adaptation at different stages in the video analysis process—both at the feature extraction stage and at the later stage of frame difference comparison. We propose a new method for fast shot characterization and classification required for such adaptation, using a new set of edge-based features. We introduce a method for automated threshold selection for adaptive scene partitioning schemes. In the next section, we describe recent reported work that is closely related to our approach. Section 3 presents the multilevel edge-response vectors, the basic features we propose for video partitioning. Shot characterization and adaptation in video partitioning in the context of the edge-based features is described in Section 4. Section 5 presents results on real video sequences. We conclude the paper in Section 6.

## 2. Related Work

The first step in content-based video data management is shot boundary detection. Simply put, it is the process of partitioning a given video sequence into its constituent shots. The purpose is to determine the beginning (and/or end) of different types of transitions that may occur in the sequence. The problem of video partitioning is compounded by the various changes that might occur in the video, (say due to illumination, motion and/or occlusion), and by the different types of shot transitions (such as fades and dissolves). The inherent variability in video shot characteristics, even for shots from the same sequence introduces further complication. The partitioning algorithm depends on the specific features used, and the similarity evaluation functions adopted. Earlier methods for video shot partitioning are described in [2–4, 16–18]. See [19–21] for a survey.

Most approaches to video partitioning make use of the color (or gray level) information in the video. The limitations of color in video partitioning are the problems of illumination variation and motion-induced false alarm. Edge based methods have thus been proposed to reduce the problem of invariance due to illumination and motion. Zabih et al. [5] made explicit use of edges in video indexing, and showed how the *exiting* and *entering* edges can be used to classify different types of shot breaks. Related methods that exploit edge information for shot detection directly in the compressed domain were proposed in [4, 18, 22, 23]. In [4] color ratio features were proposed as an alternative to color histograms, and were used to identify different types of shot changes without decompressing the video. The motivation was that color ratios capture the color boundaries or color edges in the frames. In [18, 23] methods were proposed to extract edges directly from the DCT coefficients, which can then be used for video partitioning. In [22], Abdel-Mottaleb and Krishnamachari described the edge-based information used as part of the descriptions in MPEG-7. Edge descriptors were given as 4-bin histograms, where each bin is for one of the four directions: vertical, horizontal, left-diagonal, and right-diagonal. Other related compressed domain methods are reported in [9, 13, 16].

More recent approaches to the video partitioning problem have been proposed in [9, 24–27]. Li and Lai [28] described methods for video partitioning using motion estimation, where the motion vectors are extracted using optical flow computations. To account for potential changes in the lighting conditions, the optical flow computations included a parameter to model the local illumination changes during motion estimation. Cooper et al. [25, 29] partitioning techniques that exploit possible self-similarity in the video, by classifying temporal patterns in the video sequence using kernel-based correlation. Li and Lee [26] studied video partitioning, with special emphasis on gradual transitions. Yoo et al. [27] studied both gradual and abrupt shot transitions, and proposed methods based on localized edge blocks. For abrupt shot boundaries, they proposed a correlation-based method, based on which localized edge gradients are then used for detecting gradual shot transitions.

The need for adaptation in the video indexing process was first identified in [30] (see also [9]), where they showed that video shots vary considerably from one shot to the other, even for shots that come from the same video sequence. They thus suggested that the results of an indexing scheme could be improved by treating different shots differently, for instance, by use of a different set of analysis parameters. Since then, there has been an increasing attention to the problem. In [31], detailed experiments were carried out using television news video. It was concluded that the selection of similarity thresholds was a major problem, and hence there is a need for adaptive thresholds to capture the different characteristics of broadcast news video. Vansconcelos and Lippman [10, 32] considered the duration of video shots, and showed that the shot duration can be used to predict the position of a new shot partition, and that the short duration depends critically on the video content. They used a statistical model of the shot duration to propose shot break thresholds. By classifying video shots in terms of the shot complexity and shot duration, and then performing indexing adaptively based on the video shot classes, it was shown in [9, 30] that, indeed, adaptation could be used to improve both the precision and recall simultaneously, without introducing an intolerable amount of extra computation. Dawood and Ghanbari [15] used a similar classification to model MPEG video traffic. The problem of video indexing and retrieval is very closely related to that of image indexing. Surveys on video (and/or image) indexing and retrieval can be found in [19, 21, 33, 34]. Video partitioning or segmentation has been reviewed in [20].

In this paper, we study the use of both color and edges in adaptive video partitioning. Our approach is distinct in its use of multilevel edge-based features in video partitioning, and in the provision of adaptation by a careful analysis of these multilevel features, based on the notion of shot variability. Adaptation is provided at three levels—at the feature extraction stage for the locally-adaptive edge maps, at the video sequence level, and at the individual shot level.

## 3. Multilevel Edge-Response Vectors

In our approach, we place emphasis on the structural information in the video, as these are generally invariant under various changes in the video, such as illumination changes, translation, and partial occlusion. Thus, in addition to the intensity values, we also make use of the edges in computing the features to be used. In particular, we use multi scale edges, since these can more easily capture localized structures in the video frames.

### 3.1. Multilevel Image Decomposition

Let
be an
image, with
;
. Given
, we decompose it into different blocks. For each block, we consider its content at different scales, and compute edge-based features at each of these scales. We then use the features to compare two adjacent frames in the video sequence. For simplicity in the discussion, we assume images are square, that is,
. We also assume
, for some integer *p*. The ideas can easily be extended for the general rectangular image.

*b*be the number of blocks at a given decomposition level. We choose

*k*, the level of decomposition, such that , . Let

*s*be the scale, . Then, given the original image, , we can select relevant areas of the image at different scales, . Let be the sub image part selected at scale , where . At the lowest scale ( ), we will have the entire image, viz:

*s*will therefore be , where . For a given decomposition level, we consider each of the -sized blocks and compute the required image features. If we fix the number of scales to 1 (i.e., ) at each level , (i.e., at each level, we select all the image positions within the block to compute the feature), then the multi scale scheme defaults to a simple multilevel representation of the image. Thus, using with

*L*maximum number of levels (i.e., ), we will have an

*N*-dimensional feature vector, where

### 3.2. Edge-Oriented Features

These will be calculated once for each frame, but will be used at different levels of decomposition.

#### 3.2.1. Locally Adaptive-Edge Map

The major motivation for a multilevel approach is that certain variations in an image, such as those due to edges are local in nature, and hence will be better captured by use of local (rather than global) information. For video in particular, this becomes very important. Although some variations (such as panning, tilting, and illumination) in the video could be global with respect to a particular frame, object motion and some other camera operations (such as zooming) are more easily modeled as a local phenomenon. (Note, although zooming could also be global over the video frame, the direction of the motion vectors will vary from one area of the image to the other). We capture global information by using information from the lower levels of decomposition (smaller values of *k*). With higher levels, we can obtain information about more localized structures in the frame. Such localized structures could be treated differently for improved performance.

*r*, at the

*k*th level ( ), we define the edge map as follows:

*r*th block at level

*k*, and is a local threshold. We can choose the threshold simply as

where
is the size of the *r* th block at level *k*,
is a constant. While the above approach to local thresholds is simple and conceptual, it however considers each block independent of the other blocks in the frame. It might be advantageous to consider the local threshold with respect to the global image variations [35]. At a given *k*, we can write
since the block size would all be the same for any block, *r*.

where is a constant (which can be determined empirically).

where , are, respectively, the edge response mean and standard deviation for block , at level .

#### 3.2.2. Edge-Based Features

At a given level , and for each given block , we compute the following features.

The edge points are the pixel positions that lie on the edges—as determined by the thresholds above. We call the combined features including the color features *multilevel edge-response vectors* (MERVs).

### 3.3. Similarity Evaluation Using MERVs

Having extracted the features, the next question is how to find appropriate metrics to compare two video frames using these features. Given two images and , we can compute the distance between them using the general Minkowski distance, or some other metrics. In the following we use the simple city-block distance.

where again and are weights, with . The normalized distances can then be used with the weights in (17) to obtain the overall distance between the frames.

Another important issue is the effect of each individual block in the overall difference. Let
be the weight of feature *f* from the *r* th block at level *k*. That is,
, where
denote respective features based on color, edge response, phase angle, edge length, and edge response at edge points. A simple approach is to adopt a method whereby for a chosen feature *f*, the contribution from every block at each level is given an equal weight. Effectively,
, where
.
is simply the number of blocks at the *k* th level. This makes the features from the lower levels of the decomposition to become more important. As the number of decomposition levels *L* increases, the lower-level features will dominate in the computation of the overall difference, and hence this will become very sensitive to small spatial differences in the frames. This will hence be more susceptible to noise and minute motion variations in the video. For shot classification however, this can be beneficial, since the domination of global movement or features in the video can be avoided.

*k*levels. The blocks that make up the

*k*th level will then share the contribution allocated to that level. A simple way to do this will be by using an equal distribution of the contribution to all the levels:

## 4. Adaptive Video Partitioning

When the distance
is computed for a series of adjacent video frames, the result will be a sequence of frame differences, *FD-sequence* for short. The actual video partitioning is performed by a further analysis of the FD-sequence. Let
be the difference between two adjacent frames,
and
. The FD-sequence is defined as
, where
is the number of frames in the video. The FD-sequence is usually characterized by significant peaks at frame positions where a shot change has occurred. With the FD-sequence, the video partitioning problem then becomes that of determining appropriate thresholds to isolate these "significant peaks" from other peaks that might occur in the sequence. The shot threshold is defined as
. We declare a shot partition at frame
whenever the distance exceeds the threshold: that is, whenever
.

### 4.1. Adaptation at the Video Sequence Level

The description above assumes that video sequences are homogeneous, and hence can all be considered using the same set of parameters. However, video sequences vary considerably from one sequence to the other. First we consider adapting the video analysis algorithm based on the entire video sequence. That is, for each video sequence, we determine the set of analysis parameters that will produce the best results. This set of parameters is then used to analyze all the frames or shots in the video sequence.

Given the weights on the multilevel features (see (17), we can parameterize the analysis algorithm in terms of these weights, and the threshold, . For adaptation at the sequence level, rather than considering all the features for the distance calculation, we consider only the features that are relevant to the video being analyzed. Thus, based on the particular video, we can determine the best pair for segmenting the video.

Parameter sets used in video analysis (weights, *w*). ID's are from 1 to 32.

We observed that different videos may require different contributions from each feature (i.e., different weights, ) for best results. Also, at a given , different thresholds could produce different results. (See Table 6, Section 5). Similarly, for a given video sequence, various sets of weights can produce the same (best) results, but at different thresholds. Conceptually, adaptation at the sequence level should be simple. But there are several problems. First, at the sequence level, the video is still being considered at a very coarse granularity. Video shots are known to vary greatly, even for shots in the same video. Hence, different shots in the same video sequence could be very different in content. More importantly, automated mapping of the pair for each given video is a major problem, requiring a two-pass approach. This makes sequence-level adaptation unsuitable for real-time applications, or for network applications, where dynamic modeling of video data traffic is required.

### 4.2. Shot-Level Adaptation

The above problems can be addressed by considering the individual shots that make up the video. In [9], shots were characterized based on the activity and motion in the shots, and the respective shot duration. Using the characterization, video shots were grouped into nine classes, based on which video partitioning was performed by adaptively choosing different thresholds for each shot class. In the current work, we take a different approach for the problems of video characterization and classification.

#### 4.2.1. Estimating Video Shot Complexity

To make the thresholds sensitive to the different shot classes, we need some methods to make such thresholds locally adaptive. The overall video shot complexity depends on the activity and the motion, while the shot class depends on both the complexity and the duration of the shot. The shot duration has a strong correlation with the amount of motion in the video. The length of the shot is typically inversely proportional to the amount of motion in the video [9]. We can determine the temporal duration as we analyze the shot. We could also determine the motion complexity by computing the motion vectors using motion estimation techniques [36]. However, motion estimation is very computationally intensive.

Since we do not need accurate motion estimation to classify the shots or for adaptive indexing, an estimate of the amount of motion in the shot is enough. Thus, we can approximate the amount of motion using the differences between adjacent frames (e.g., by analyzing the FD-sequence), rather than direct computation of the motion vectors. A similar observation has been made by Tao and Orchard [37], where they noticed that the residual signal generated after motion-compensated predication is highly correlated with the gradient magnitude: the motion compensated error is larger for pixels with larger gradient magnitude on average. They thus suggested that the gradient (from one frame to the other) could be estimated from the reconstructed image using the motion estimates. In this work, we are interested in the reverse procedure; given the gradient information (as captured by the edge response vectors), we wish to estimate the amount of motion in the shot, without explicit motion estimation.

We can estimate both the image activity and the motion by using the already available multilevel edge response vectors, with appropriate weights. For example, if we use (e.g., , for 4 decomposition levels), or if we ignore the global averages altogether, (i.e., the contributions from level ), then the lower-level features (which are increasingly localized) can be used to predict the amount of motion. We could also ignore further higher level features, for instance, levels at . We can estimate the activity by using the MERVs from just one frame in a given shot.

*shot variability*. To estimate the shot variability, we use the mean and standard deviation of the frame-difference sequence (the FD-sequence) within the shot. We compute this for each of the MERV features, and use a weighted average to determine the shot variability. Given two time instants, and , ( ), we compute the shot variability as follows. Let be the duration. Let be the frame difference sequence using a particular multilevel feature, say :

The weights here may not necessarily be the same as those used for the distances.

In [4, 9], different methods were proposed for computing the motion and image complexities, for instance, using the spectral entropy, and other metrics. With the above approach, one problem will be computing the standard deviation at each frame as the shot is progressing. This problem can be solved by doing the computations at only defined periodic intervals (the periods could also be chosen adaptively). However, one advantage of using the shot variability defined above is that the parameters required can be computed incrementally, using the preceding values. We can do this from the general definition of mean and standard deviation.

Shot classification results based on shot variability.

Class | Antelope | Canyon | Crops | Culture | Journal | LAS | Total |
---|---|---|---|---|---|---|---|

I | 20 | 4 | 6 | 28 | 15 | 34 | 107 |

II | 0 | 1 | 0 | 0 | 0 | 0 | 1 |

III | 9 | 0 | 5 | 4 | 40 | 7 | 65 |

IV | 0 | 2 | 0 | 0 | 1 | 0 | 3 |

V | 2 | 0 | 1 | 0 | 5 | 2 | 10 |

VI | 0 | 4 | 0 | 1 | 2 | 0 | 7 |

VII | 0 | 5 | 0 | 0 | 3 | 0 | 8 |

VIII | 1 | 0 | 0 | 0 | 0 | 1 | 2 |

IX | 0 | 2 | 1 | 0 | 1 | 0 | 4 |

Total | 32 | 18 | 13 | 33 | 67 | 44 | 211 |

#### 4.2.2. Adaptive Shot Thresholds

Having characterized and classified the shots based on the shot variability, the next question is to determine the parameters for video shot partitioning for a given shot. Ideally, given the FD sequence, (and assuming that it was obtained from a distance (and not a similarity) measure), we expect that the threshold for shot changes should decrease with increasing shot length, but increase with increasing shot complexity (or variability). Formally, given a video shot , we classify it into a certain shot class, . The problem of shot-level adaptation then is to determine the parameter set (i.e., the pair) that will produce the best results for all shots, , . Here, best results are defined in terms of information retrieval measures of precision and recall.

We take a pragmatic approach to the problem of determining the parameters. Using a training set of video shots, we use a simple clustering technique to determine the pairs that produce the best results for each shot class in the training set. We then use these pairs for analysis of the test video sequences.

When we have , then any member of can be used as the best parameter set. The major problem is when , that is, the intersection is empty, implying that no single parameter set always produced correct results for all the class shots in the training sequences. Two approaches can be used to address this problem.

*c*shots is determined as

## 5. Results

To test the performance of the proposed edge-based adaptive method, we ran some experiments using two sets of video sequences. The first set had 6 sequences taken from standard MPEG-7 sequences, and from available online video sources [31]. For each video sequence, the frame size was fixed at
. The second set had 5 sequences taken from the US National Institute of Standards (NIST) benchmark TRECVID 2001 test sequences. The frame size for sequences in this set was
. The experiments were carried out in a MATLAB Version 7.3.0.267 (R2006b) environment using a personal computer with Intel(R) CPU T2400, running at 1.83 GHz with 1.99 GB RAM. We measure performance in terms of the information retrieval measures of precision and recall. We use the following notation:
set of all positions of true scene cuts in a test video sequence,
set of all positions of scene cuts returned by the system,
subset of
that are *true* scene cuts (i.e., correct detection, or
). Then, precision
, and recall
.

### 5.1. Effectiveness of MERVs on Non-Adaptive Partitioning

Effectiveness of MERVs for non-adaptive video partitioning.

Video | Shots | Retrieved | Correct | False | Miss | Pr | Rc |
---|---|---|---|---|---|---|---|

Antelope | 32 | 37 | 30 | 7 | 2 | 0.81 | 0.94 |

Canyon | 18 | 18 | 18 | 0 | 0 | 1.00 | 1.00 |

Crops | 13 | 16 | 13 | 3 | 0 | 0.81 | 1.00 |

Culture | 33 | 21 | 20 | 1 | 13 | 0.95 | 0.61 |

Journal | 67 | 70 | 65 | 5 | 2 | 0.93 | 0.97 |

LAS | 44 | 35 | 35 | 0 | 7 | 1.00 | 0.83 |

Average | 0.92 | 0.89 |

Results for proposed sequence-level adaptive partitioning.

### 5.2. Adaptive Partitioning

Results for adaptive partitioning using proposed shot variability measure.

Video | Shots | Retrieved | Correct | False | Miss | Pr | Rc |
---|---|---|---|---|---|---|---|

Antelope | 32 | 35 | 31 | 4 | 1 | 0.89 | 0.97 |

Canyon | 18 | 19 | 18 | 1 | 0 | 0.95 | 1.00 |

Crops | 13 | 13 | 12 | 0 | 1 | 1.00 | 0.92 |

Culture | 33 | 27 | 26 | 1 | 7 | 0.96 | 0.78 |

Journal | 67 | 60 | 59 | 1 | 8 | 0.98 | 0.88 |

LAS | 44 | 44 | 44 | 0 | 0 | 1.00 | 1.00 |

Average | 0.96 | 0.93 |

### 5.3. Comparative Results

Comparative results with other video partitioning algorithms. The last three schemes are propsoed in this work.

Video | Other Proposed Techniques | Methods Proposed in this Work | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Color | Motion-vector | Correlation- | Kernel-correlation | Non-adaptive | Sequence-level | Shot-level | |||||||||

histograms | likelihoods [28] | based [27] | [25] | MERVs | adaptation | adaptation | |||||||||

Pr | Rc | Pr | Rc | Pr | Rc | Pr | Rc | BestKernel | Pr | Rc | Pr | Rc | Pr | Rc | |

Antelope | 0.91 | 0.97 | 0.50 | 0.71 | 0.91 | 0.94 | 0.68 | 0.84 | 0.81 | 0.94 | 0.97 | 0.94 | 0.89 | 0.97 | |

Canyon | 0.53 | 1.00 | 0.78 | 0.60 | 1.00 | 0.94 | 0.74 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 | 1.00 | |

Crops | 0.31 | 1.00 | 0.35 | 0.84 | 1.00 | 1.00 | 0.71 | 1.00 | 0.81 | 1.00 | 1.00 | 1.00 | 1.00 | 0.92 | |

Culture | 0.89 | 0.97 | 0.32 | 0.73 | 0.89 | 1.00 | 0.76 | 0.84 | 0.95 | 0.61 | 1.00 | 1.00 | 0.96 | 0.78 | |

Journal | 0.98 | 0.72 | 0.36 | 0.76 | 0.67 | 0.94 | 0.78 | 0.54 | 0.93 | 0.97 | 1.00 | 0.97 | 0.98 | 0.88 | |

LAS | 0.97 | 0.82 | 1.00 | 0.53 | 0.84 | 0.84 | 0.78 | 0.89 | 1.00 | 0.83 | 1.00 | 0.99 | 1.00 | 1.00 | |

Average | 0.77 | 0.91 | 0.55 | 0.70 | 0.89 | 0.94 | 0.74 | 0.85 | 0.92 | 0.89 | 1.00 | 0.98 | 0.96 | 0.93 |

Comparative results with other video partitioning algorithms on TRECVID 2001 dataset.

Number | Color | Motion-vector | Correlation- | Kernel-correlation | Kernel-correlation | Proposed | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Sequence | of shots | histograms | likelihoods [28] | based [27] | [25] ( kernel) | [25] ( kernel) | method | ||||||

Pr | Rc | Pr | Rc | Pr | Rc | Pr | Rc | Pr | Rc | Pr | Rc | ||

anni005 | 38 | 0.64 | 0.83 | 0.46 | 0.53 | 0.87 | 0.89 | 0.71 | 0.64 | 0.72 | 0.64 | 0.87 | 0.91 |

anni006 | 41 | 0.78 | 0.70 | 0.57 | 0.56 | 0.84 | 0.88 | 0.71 | 0.68 | 0.79 | 0.69 | 0.82 | 0.89 |

anni009 | 38 | 0.84 | 0.71 | 0.59 | 0.62 | 0.86 | 0.94 | 0.81 | 0.78 | 0.83 | 0.81 | 0.87 | 0.93 |

BOR08 | 197 | 0.78 | 0.78 | 0.24 | 0.64 | 0.85 | 0.88 | 0.60 | 0.83 | 0.69 | 0.81 | 0.86 | 0.91 |

NAD53 | 83 | 0.69 | 0.62 | 0.46 | 0.73 | 0.79 | 0.94 | 0.69 | 0.84 | 0.75 | 0.80 | 0.81 | 0.97 |

Average | 0.75 | 0.73 | 0.46 | 0.62 | 0.84 | 0.91 | 0.71 | 0.75 | 0.76 | 0.75 | 0.85 | 0.92 |

## 6. Discussion and Conclusion

Although video partitioning is an actively researched area, recent publications [9, 20, 24–28] show that the problem is far from being completely resolved. The major contributions of this paper are on two aspects of the video partitioning problem. The first is the proposed new set of features (the multilevel edge response vectors (MERVs) for video partitioning. The edge-based nature of the features makes them particularly suitable in handling significant illumination variations in the video, while the multilevel decomposition framework makes it possible to adapt the features to the nature of the video frames being considered. The second contribution is on adaptive video partitioning. While adaptive video partitioning was first described in [9, 30], here we propose a new and more efficient method for performing the scene characterization required for scene partitioning, and a method for automated determination of thresholds based on the video shot classes. The proposed method is online—performing scene characterization and classification as the frames in a given shot are being observed, rather than waiting until the end of a given shot (as was done in [9]). This feature, coupled with the improved efficiency in shot characterization makes the approach particularly suitable for fast and online characterization of the video, which is important in both video retrieval, and in video traffic modeling for adaptive network resource allocation [15]. We mention that, while we have provided adaptation based on the MERV features proposed, the general idea of adaptation in video analysis is independent of the specific features being used. For any given feature, the idea of adaptation can be applied by a careful study of the feature in question, and then adapting the analysis parameters using this feature based on the nature of the shot being analyzed.

In conclusion, we have studied the problem of video segmentation, using an adaptive edge-oriented framework. Adaptation is provided by an analysis of the video shot characteristics using the frame difference sequence. In particular, we defined the shot variability measure, based on which the video shots are characterized and then classified. To provide adaptation in the analysis, we determine the best set of parameters for each given shot class, and then analyze the shots that belong to the given class using only these parameter sets. An algorithm for determining the best parameters for each given shot class is presented. We described adaptation at three levels: at the feature extraction stage for the locally-adaptive edge maps, at the video sequence level, and at the individual shot level.

Experimental results show that the proposed multilevel edge-based features provide a performance of about 90% in terms of average precision and recall. In comparison with traditional approaches, the adaptive schemes provide a better performance over non-adaptive approaches, using the same multilevel edge-based features—with video sequence level adaptation producing about 99% performance. Further, the use of shot variability as a measure of shot complexity resulted in a slightly superior performance (about 2% improvement in precision) over a previously proposed method of explicit motion estimation and shot activity analysis. However, in terms of efficiency, using the shot variability led to a five fold improvement in efficiency. The reported work has applications beyond video indexing and retrieval. In particular, given the significant reduction in computations, the approach becomes attractive for real-time applications, such as in dynamic monitoring, characterization and modeling of video data traffic, and in real-time video surveillance.

## Authors’ Affiliations

## References

- Swain MJ, Ballard DH:
**Color indexing.***International Journal of Computer Vision*1991,**7**(1):11-32. 10.1007/BF00130487View ArticleGoogle Scholar - Nagasaka A, Tanaka Y:
**Automatic video indexing and full-video search for object appearances.**In*Visual Database Systems II*. Edited by: Knuth E, Wegner LM. Elsevier; 1992:113-127.Google Scholar - Zhang H, Kankanhalli A, Smoliar SW:
**Automatic partitioning of full-motion video.***Multimedia Systems*1993,**1**(1):10-28. 10.1007/BF01210504View ArticleGoogle Scholar - Adjeroh DA, Lee MC:
**Robust and efficient transform domain video sequence analysis: an approach from the generalized color ratio model.***Journal of Visual Communication and Image Representation*1997,**8**(2):182-207. 10.1006/jvci.1997.0349View ArticleGoogle Scholar - Zabih R, Miller J, Mai K:
**A feature-based algorithm for detecting and classifying production effects.***Multimedia Systems*1999,**7**(2):119-128. 10.1007/s005300050115View ArticleGoogle Scholar - Courtney JD:
**Automatic video indexing via object motion analysis.***Pattern Recognition*1997,**30**(4):607-625. 10.1016/S0031-3203(96)00107-0View ArticleGoogle Scholar - Bouthemy P, Gelgon M, Ganansia F:
**A unified approach to shot change detection and camera motion characterization.***IEEE Transactions on Circuits and Systems for Video Technology*1999,**9**(7):1030-1044. 10.1109/76.795057View ArticleGoogle Scholar - Dagtas S, Al-Khatib W, Ghafoor A, Kashyap RL:
**Models for motion-based video indexing and retrieval.***IEEE Transactions on Image Processing*2000,**9**(1):88-101. 10.1109/83.817601View ArticleGoogle Scholar - Adjeroh DA, Lee MC:
**Scene-adaptive transform domain video partitioning.***IEEE Transactions on Multimedia*2004,**6**(1):58-69. 10.1109/TMM.2003.819578View ArticleGoogle Scholar - Vasconcelos N, Lippman A:
**Statistical models of video structure for content analysis and characterization.***IEEE Transactions on Image Processing*2000,**9**(1):3-19. 10.1109/83.817595View ArticleGoogle Scholar - Subrahmanian VS:
*Principles of Multimedia Database Systems*. Morgan Kaufmann, San Mateo, Calif, USA; 1998.Google Scholar - Kuo TCT, Chen ALP:
**Content-based query processing for video databases.***IEEE Transactions on Multimedia*2000,**2**(1):1-13. 10.1109/6046.825790View ArticleGoogle Scholar - Taskiran C, Chen J-Y, Albiol A, Torres L, Bouman CA, Delp EJ:
**ViBE: a compressed video database structured for active browsing and search.***IEEE Transactions on Multimedia*2004,**6**(1):103-118. 10.1109/TMM.2003.819783View ArticleGoogle Scholar - Cheung S-CS, Zakhor A:
**Fast similarity search and clustering of video sequences on the world-wide-web.***IEEE Transactions on Multimedia*2005,**7**(3):524-537.View ArticleGoogle Scholar - Dawood AHM, Ghanbari M:
**Content-based MPEG video traffic modeling.***IEEE Transactions on Multimedia*1999,**1**(1):77-87. 10.1109/6046.748173View ArticleGoogle Scholar - Arman F, Hsu A, Chiu M-Y:
**Image processing on encoded video sequences.***Multimedia Systems*1994,**1**(5):211-219. 10.1007/BF01268945View ArticleGoogle Scholar - Hampapur A, Jain R, Weymouth TE:
**Production model based digital video segmentation.***Multimedia Tools and Applications*1995,**1**(1):9-46. 10.1007/BF01261224View ArticleGoogle Scholar - Lee S-W, Kim Y-M, Choi SW:
**Fast scene change detection using direct feature extraction from MPEG compressed videos.***IEEE Transactions on Multimedia*2000,**2**(4):240-254. 10.1109/6046.890059View ArticleMathSciNetGoogle Scholar - Ahanger G, Little TDC:
**A survey of technologies for parsing and indexing digital video.***Journal of Visual Communication and Image Representation*1996,**7**(1):28-43. 10.1006/jvci.1996.0004View ArticleGoogle Scholar - Hanjalic A:
**Shot-boundary detection: unraveled and resolved?***IEEE Transactions on Circuits and Systems for Video Technology*2002,**12**(2):90-105. 10.1109/76.988656View ArticleGoogle Scholar - Mandal MK, Idris F, Panchanathan S:
**A critical evaluation of image and video indexing techniques in the compressed domain.***Image and Vision Computing*1999,**17**(7):513-529. 10.1016/S0262-8856(98)00143-7View ArticleGoogle Scholar - Abdel-Mottaleb M, Krishnamachari S:
**Multimedia descriptions based on MPEG-7: extraction and applications.***IEEE Transactions on Multimedia*2004,**6**(3):459-468. 10.1109/TMM.2004.827500View ArticleGoogle Scholar - Shen B, Sethi IK:
**Direct feature extraction from compressed images.***Storage and Retrieval for Still Image and Video Databases IV, February 1996, San Jose, Calif, USA, Proceedings of SPIE***2670:**404-414.View ArticleGoogle Scholar - Bescos J, Cisneros G, Martinez JM, Menendez JM, Cabrera J:
**A unified model for techniques on video-shot transition detection.***IEEE Transactions on Multimedia*2005,**7**(2):293-307.View ArticleGoogle Scholar - Cooper M, Liu T, Rieffel E:
**Video segmentation via temporal pattern classification.***IEEE Transactions on Multimedia*2007,**9**(3):610-618.View ArticleGoogle Scholar - Li S, Lee M-C:
**Effective detection of various wipe transitions.***IEEE Transactions on Circuits and Systems for Video Technology*2007,**17**(6):663-673.View ArticleGoogle Scholar - Yoo H-W, Ryoo H-J, Jang D-S:
**Gradual shot boundary detection using localized edge blocks.***Multimedia Tools and Applications*2006,**28**(3):283-300. 10.1007/s11042-006-7715-8View ArticleGoogle Scholar - Li W-K, Lai S-H:
**Integrated video shot segmentation algorithm.**In*Storage and Retrieval for Media Databases 2003, January 2003, Santa Clara, Calif, USA, Proceedings of SPIE*Edited by: Yeung MM, Lienhart RW, Li C-S.**5021:**264-271.View ArticleGoogle Scholar - Cooper M, Foote J, Adcock J, Casi S:
**Shot boundary detection via similarity analysis.***Proceedings of the TRECVID Workshop, November 2003*Google Scholar - Adjeroh DA, Lee MC:
**Adaptive transform domain video shot analysis.***Proceedings of IEEE International Conference on Multimedia Computing and Systems, June 1997, Ontario, Canada*Google Scholar - O'Toole C, Smeaton A, Murphy N, Marlow S:
**Evaluation of automatic shot boundary detection on a large video test suite.***Proceedings of Conference on Challenge of Image Retrieval, February 1999, Newcastle Upon Tyne, UK*Google Scholar - Vasconcelos N, Lippman A:
**A Bayesian video modeling framework for shot segmentation and content characterization.***Proceedings of Workshop on Content-Based Access to Image and Video Libraries, 1997, San Juan, Puerto Rico, USA* - Rui Y, Huang TS, Chang S-F:
**Image retrieval: current techniques, promising directions, and open issues.***Journal of Visual Communication and Image Representation*1999,**10**(1):39-62. 10.1006/jvci.1999.0413View ArticleGoogle Scholar - Smeulders AWM, Worring M, Santini S, Gupta A, Jain R:
**Content-based image retrieval at the end of the early years.***IEEE Transactions on Pattern Analysis and Machine Intelligence*2000,**22**(12):1349-1380. 10.1109/34.895972View ArticleGoogle Scholar - Al-Fahoum AS, Reza AM:
**Combined edge crispiness and statistical differencing for deblocking JPEG compressed images.***IEEE Transactions on Image Processing*2001,**10**(9):1288-1298. 10.1109/83.941853View ArticleMATHGoogle Scholar - Dufaux F, Moscheni F:
**Motion estimation techniques for digital TV: a review and a new contribution.***Proceedings of the IEEE*1995,**83**(6):858-876. 10.1109/5.387089View ArticleGoogle Scholar - Tao B, Orchard MT:
**Gradient-based residual variance modeling and its applications to motion-compensated video coding.***IEEE Transactions on Image Processing*2001,**10**(1):24-35. 10.1109/83.892440View ArticleMATHGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.