- Research Article
- Open Access
Automatic Reasoning about Causal Events in Surveillance Video
© Neil M. Robertson and Ian D. Reid. 2011
- Received: 1 April 2010
- Accepted: 13 December 2010
- Published: 7 February 2011
We present a new method for explaining causal interactions among people in video. The input to the overall system is video in which people are low/medium resolution. We extract and maintain a set of qualitative descriptions of single-person activity using the low-level vision techniques of spatiotemporal action recognition and gaze-direction approximation. This models the input to the "sensors" of the person agent in the scene and is a general sensing strategy for a person agent in a variety of application domains. The information subsequently available to the reasoning process is deliberately limited to model what an agent would actually be able to sense. The reasoning is therefore not a classical "all-knowing" strategy but uses these "sensed" facts obtained from the agents, combined with generic domain knowledge, to generate causal explanations of interactions. We present results from urban surveillance video.
- Reasoning Process
- Causal Reasoning
- Semantic Label
- Reasoning Engine
- Urban Scene
The goal of intelligent surveillance is to confer upon a computer the ability to not only detect and report on observed activity but to reason about interactions between agents and the scene. Reasoning has, generally, been confined to the Artificial Intelligence (AI) community and few Computer Vision researchers have addressed the problem of generating explanations of dynamic scenes. Rather, the published literature has focussed on two topics in relation to visual surveillance: first, creating low-level vision techniques to detect and classify activities, generally on the basis of the statistics of trajectory information; second, detecting unusual, or inexplicable activity as defined in relation to some model of normality. Both of these strands have shown a considerable degree of success. But recent developments suggest that bringing together the techniques that operate directly on video streams with models of how humans interpret visual scenes will enable a significant step towards automatic video understanding and explanation. An additional benefit will be the ability to query archive footage on the basis of higher-level descriptions to, for example, find all instances of people meeting together.
This work demonstrates progress towards this goal via a new approach to causal reasoning in video. This method is semiautomatic, requiring a guided training phase, yet flexible and represents a serious attempt at connecting low-level visual sensing with high-level reasoning using complex, dynamic visual features. We show results from two different urban surveillance videos.
The scientific state of the art is to output text commentary on very constrained activity such as traffic using simple image features such as trajectory points (see, e.g., ). We propose that an accurate commentary of activity can be acquired when there is a good intermediate description of activity available. This enables more complex and more general sensing of the scene than merely trajectories. In fact we develop a sensing strategy around activity recognition and head-pose estimation. This paper, consequently enables the machine to explain more complex, less constrained activity and interactions among people. The focus and the achievement of the work presented in this paper is to explain interactions between human agents and to do it in a way which can be applied in different domains where people interact.
To aid the reader we now give a brief paper roadmap. In Section 2 we review related prior work in the published literature and, in Section 2.1, we highlight the main contributions of this work in relation to the literature. In Section 3 we discuss the vision algorithms that form the basis of the lowest level of our system. In Section 4, we introduce the reasoning process itself: Section 4.3 presents the full process applied to real urban surveillance scenarios. We include evaluation and discussion of failure modes. We conclude and discuss some future research directions in Section 5.
Making sense of a scene can be thought of as, "Assessing its potential for action, whether instigated by the agent or set in motion by forces already present in the world" . In other words, a causal interpretation is most easily and most commonly judged by the motion effects that take place. Michotte , with Heider and Simmel  showed that it is the kinematics of objects, not their appearance, that produce the perception of causality . There is, nonetheless, a history in scene understanding research of analysing static scenes. In the work [2, 6], for example, the causal explanation of a static scene is found in the answer to the question, Why does not this object fall down? MugShot  which can successfully pick up cups filled with hot fluid, is one example of a system where static causal relationships can be learned. This is an example of an explanation-mediated vision system which has two important aspects for learning: expectations and explanations. The former, if they fail, are opportunities to learn; the latter provide the context and material for learning. In such a system, where knowledge runs out, the system cannot make sense of the scene and a rule has to be introduced to prevent repeated failure. Indeed, Pearl indicates that it is the availability of prior knowledge that allows the inference problem to be structured in such a way as to be amenable to causal reasoning .
Robust computer vision methods have only recently begun to be exploited for obtaining low-level information about complex visual scenes and agents within them . The work of Brand et al. relied on the extraction of very simple, static visual features from images of blocks against a white background . Siskind demonstrated reasoning about the dynamic interactions between tracked blobs (hands, blocks) in simple video sequences . Our work addresses this problem by applying low-level vision techniques to generate probabilistic estimates over qualitative descriptions of human activity in video [10, 11].
"Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors" is an agent, according to Russel and Norvig . An agent is, therefore, analogous to a software function. When human agents are combined, complex behaviour emerges which can model a real-world behaviour as demonstrated by Andrade and Fisher for simulated crowd scenes . There are many types of agent defined in the AI literature. The Belief-Desire-Intention agent, originally developed by Bratman , is believed to model decision-making process humans use in everyday life .
Related to agents, and of most direct relevance to the work of this paper, is the work of Dee and Hogg . In their work, a particular model of human behaviour is verified by comparing how "interesting" the model indicates the observed behaviour is to how worthy of further investigation a human believes the behaviour to be. Their work focuses on inferring what an agent can sense through line-of-sight projection of rays and the subsequent use of a predefined model of goal-directed behaviour to predict how the agent is expected to behave. Not all of the information required for reasoning is automatically extracted from the images.
There have been notable efforts to explain behaviour using low-level information only and to bridge the "semantic gap" [17, 18]. Many of these reported works have applied variants of the HMM [19, 20] from which readable semantic labels are difficult to derive, in contrast to our work. Turaga et al. have considered the importance of the descriptive language used in action-recognition semantics .
On rule-based reasoning, Siler notes that rules have, "…shown the greatest flexibility and similarity to human thought processes…" . These rules can be quickly identified and written down by an expert. A significant positive aspect of rule-based reasoning is that it is easy to update the system's knowledge by adding new rules without changing the reasoning engine . It is also easy to transfer between applications by specifying a new set of rules.
2.1. Reasoning from the Perspective of an Agent versus the Camera
In order to formulate an effective reasoning process in this work we combine a rule-based approach with a visual sensing strategy that models what the agents can actually sense in the scene. The classical approach to reasoning about human activity is to initiate an "all-knowing" visual process to gather information about the entire scene. That is, reasoning takes place from the camera perspective. In this work we shift the emphasis from the camera to the agent within the scene. To do this, we model a generic person agent and consider what information its sensors can realistically gather given the constraints of its environment. Limiting the sensing in the scene to realistically model the agent's perceptive ability has been considered, although not to the extent which we propose or in as challenging an environment as outdoor surveillance for the use of focus of attention .
The reasoning system then takes a limited set of all information which is theoretically available, but in doing so enables more realistic agent-perspective reasoning to take place. The generality of our approach is therefore found not in a common set of rules which can be applied across many different domains but in a common set of facts which can be derived from the sensor of a person agent regardless of the domain in which that agent is operating. Critically, the only element of the entire system which requires re-coding between scenarios is (a) the initial training data and (b) the rule set. Moreover, the time taken to encode rules is considerably reduced by the fact that the lower-level of the system extracts qualitative descriptions which enables a user to write rules in useable code very efficiently (see the appendix for instances). Provided the set of all possible events and interactions is not unbounded, specifying these rules is a much less onerous task than gathering and labelling sufficient quality training examples.
2.2. Contributions of This Work
The main contribution of this work is that we demonstrate an extension to the scientific state of the art by reasoning about dynamic scenes with complex visual features which describe human motion with a significant temporal extent. Previous attempts at causal reasoning have been limited to scenes with simple visual features such as feature points and blobs.
We also introduce a reasoning strategy which is shown to be effective in different application domains where there are interactions between people. This is possible due to the extraction of scene information which models the input to the "sensors" of a general person agent. Notably, the lack of robust vision techniques for information input to the sensors of agents has been identified as a significant weakness in visual surveillance , which is now addressed by this work.
Finally, the generation of plausible human-readable explanations of interactions between people directly from video streams with is achieved which is in contrast to the state of the art which obtains simple commentaries on single-person activity.
We first describe the algorithms which generate descriptions of an agent's instantaneous activity. Full detail can be found in the literature, and we recapitulate the salient details here [10, 11, 25].
The algorithms we employ compute probability distributions over hand-labelled exemplar databases using Bayesian fusion. The maximum a posteriori (MAP) output constitutes qualitative descriptions of (a) gaze direction, that is, where the person is looking in the scene (Section 3.2), (b) spatiotemporal action, for example, "running on the road" (Section 3.3) and (c) behaviour, that is, spatiotemporal actions extended over time such as "crossing the road" (Section 3.4).
Gaze direction is particularly significant for inferring intention and for detecting interactions. Clearly it is not the only cue—proximity and context are also important—but it has been recognised by vision researchers that human gaze is a predictor of intention . For the purposes of causal reasoning, this action-recognition system populates a set of "facts" which collects all the information available to the reasoning engine. This lower-level component of the system answers questions in a probabilistic fashion such as Where is the agent? What is he/she doing? Where are they looking? The language used ultimately to describe interactions is also defined at this stage by the expert's hand-labelled descriptions of the exemplar data.
3.1. Visual Tracking
which is maximised for every frame using an efficient iterative algorithm . We further employ occlusion reasoning to recover the track when a person disappears behind a tree or another person, for example. When the Bhattacharyya coefficient drops below a certain value, the search window is expanded by computing the Bhattacharyya coefficient for a grid of windows around the current location and, provided the target has not disappeared altogether or moved out with even this wider search region, the location can be recovered .
3.2. Gaze Direction Approximation
3.2.1. Fast Sampling from a Database of Labelled Exemplars
Sidenbladh and Black structure a large database of high-dimensional points as a binary tree via principal component analysis of the data set . The children of each node at level in the tree are divided into two sets: those whose th component (relative to the PCA basis) is larger and those whose value is smaller than the mean. In Sidenbladh's application each data point comprised the concatenated joint angles over several frames of human motion capture data. The method, however, applies equally well to our application of image feature data and the pseudorandom search algorithm is identical to that derived in .
where is the sampled nearest-neighbour match from one traversal of the binary tree.
Significantly, the first (where is the number of time intervals in the training data) components are selected.
3.2.2. Combining Head Pose and Body Direction
The sampling method returns a distribution over possible head poses. Used on its own this can be noisy and so we use body direction to smooth the gazing approximation. Note that a number of assumptions are required which are valid in large-scale outdoor surveillance scenes but may not hold in indoor situations or even different social settings (see ). These are, briefly, that the person does not change direction based on gaze, that anatomicallyimpossible gazes (looking backwards) are rejected and that gaze varies smoothly.
The overall body pose relative to the camera frame is approximated using the velocity of the body, obtained via automaticallyinitiated colour-based tracking in the image sequence. By combining direction and head-pose information gaze is determined more robustly than using each feature alone.
We compute the joint posterior distribution over direction of motion and head pose, which gives us the gaze. The priors on these are initially uniform for direction of motion, reflecting the fact that there is no preference for any particular direction in the scene. For head pose however a centred, weighted function models a strong preference for looking forwards rather than sideways. The prior on gaze is defined specified using physical constraints, that is, by considering only physically possible gazes.
The joint prior, is factored as above into where the first term encodes our knowledge that people tend to look straight ahead. Thus the distribution is peaked around , while is taken to be uniform. This encodes our belief that all directions of body pose are equally likely.
While for single frame estimation this formulation fuses the measurements (of head pose and body direction) with prior beliefs, when analysing video data we can further impose smoothness constraints to encode temporal coherence: the joint prior at time is in this case taken to be , where we use the assumption that the current direction is independent of previous gaze. This is motivated by the observation that, in outdoor areas, people tend to have a fixed idea of where to go and this only changes due to major distractions in the visual field. We do recognise that, in a very limited set of cases (primarily indoors), this may in fact be a poor assumption since people may change their motion or pose in response to observing something interesting while gazing around. We also assume that current gaze depends only on current pose and previous gaze which is clearly a robust assumption. The former term, , strikes a balance between the belief that people tend to look where they are going, and temporal consistency of gaze via a mixture .
Now we compute the joint distribution for all 64 possible gazes resulting from possible combinations of 8 head poses and 8 directions. The discretisation of the full 360° into 8 poses is shown in Figure 2. This posterior distribution allows us to maintain probabilistic estimates without committing to a defined gaze, and this is advantageous for further reasoning about overall scene behaviour. Immediately though we can see that gazes which we consider very unlikely given our prior knowledge of human biomechanics (since the head cannot turn beyond 90° relative to the torso ) can be rejected in addition to the obvious benefit that the quality of lower-level match can be incorporated in a mathematically sound way.
Comparison of detection rate for three types of head-pose matching search.
NN (full data)
NN (PC coeffs)
3.3. Spatiotemporal Action Recognition
high-level descriptions can be incorporated by a qualified expert;
by sampling nonparametrically from the data, far less training data is required than is the case for standard, statistic-based learning techniques such as Hidden Markov Models (HMMs);
probabilistic distributions prevent one from committing to a single interpretation of activity too early.
is specified in the conditional probability table for the node , is defined from the frequency of occurrence of data in the training set and is uniform in most cases.
Comprehensive statistics from the analysis of the test sequences are discussed in Section 5.
3.4. Behaviour as a Sequence of Spatiotemporal Actions
3.4.1. The Structure of the Behaviour HMM
walk on the nearside pavement,
walk on the far-side pavement,
walk on the road,
walk in the driveway.
When walking on the near-side pavement (state 1), the person will stay on the nearside pavement.
When walking on the far-side pavement, the person will most likely keep walking on the far-side pavement (state 2), but a transition to the road (state 3) is allowed.
When walking on the road, the person will most likely stay walking on the road (state 3), but can move to the action walking on the nearside pavement (state 1).
When the person is walking in the drive (state 4), no transitions are allowed as this action is not expected to occur.
Similarly, behaviour HMMs are specified for the other behaviours, "Walking-along-nearside-pavement" (which is quite trivial, being a continuous sequence of walking-on-pavement actions) and "Turn-into-drive".
3.4.2. Model Selection
which has a chi-squared distribution parameterised by the difference in the model order. If is greater than the 95% confidence value of the chi-squared distribution for , the result is statistically significant.
Before describing in detail the causal reasoning process and its application in two specific example datasets we define the terminology used in the rest of this work. In particular, we explain the meaning of "events", "rules", and "facts".
4.1. Explanation of Terminology: "Events", "Rules" and "Facts"
Our process for causal reasoning is to first specify a set of "events" and "rules" pertaining to the scene. When an event is observed by a single agent a search through the available evidence which is observable by that agent is performed. This search seeks to explain the current known activity given the predefined rules. The low-level sensing component of the system abstracts the visual information into (ML) text descriptions of activity, the "rules" and "events" can be encoded simply as high-level conditional statements which act on the information available to the sensors of the agent. It should be noted that, while the full MAP distributions are available, only the ML text description is used by the reasoning system. A fully probabilistic reasoning process is much more ambitious and the subject of current work. A number of the rules are given in the appendix. An "event" is simply an occurrence which is predefined as interesting and requiring explanation, such as "cross-the-road".
The events and the rules are changed between scenarios but the reasoning process remains the same. This reasoning process is given in pseudocode in Algorithm 1. Although we do specify the events which require explanation, this is not strictly necessary. One could mandate that it is only unexpected events that initiate the reasoning engine (where "unusual" is defined by some probability threshold on observed activity). Given that unusual activity is not modelled explicitly reasoning about such events requires a more sophisticated system than that which we develop here. We discuss how these might be handled by a rule-based reasoning system in Section 5.3.3. Hence, we specify the events which need explanation and these are preloaded into the system, along with the rules which govern the scene.
check facts for event in events-list
for all frames in sequence do
update facts list
if event occurs then
derive hypotheses from the rule-set
for all hypotheses do
search known facts for hypothesis support
The final piece of information required is a set of "facts" on which the reasoning process operates, searching for an explanation given the "rules" and the "events". The facts are gathered from the low-level sensing procedures as the video is processed and take the form of text descriptions of what is observed. These facts can be augmented with higher-level descriptions which have come from an earlier reasoning process. Thus the set of "facts" contains all the information which is available to the reasoning engine at any given time.
When a trigger event occurs, a search through the rules will take place. This is what we term generating a "hypothesis", that is, postulating that one of the rules is in play. If any of the current facts lend evidence to any rule, the facts are updated. At the end of the video, the set of facts constitutes an "explanation". We show this in operation in the following sections.
4.2. Updating the "Facts"
Meanwhile, to root this explanation in an example, consider that a set of facts correspond to the activity of an individual (spatiotemporal action, gazing direction). The rules shown in the appendix (proximity, meeting, and move-to-road) operate on these facts in a hierarchical manner. That is, the proximity rule uses spatiotemporal action and the visibility of individuals inferred from gazing direction, as seen in Algorithm 2. Then, the set of facts is updated: either the people are "together" or "not together". The "meeting" rule then uses this information to infer whether a meeting between people is occurring, as shown in Algorithm 3. Finally, the move-to-road rule operates on the updated facts which contain the "meeting" event, which is shown in Algorithm 4. A graphical illustration of this process for the "move-to-road" event is shown in the schematic, Figure 13.
proximityThreshold = 100
timeThreshold = 100
for all frames do
distance = (P1 position) − (P2 position)
if distance ≤ proximityThreshold then
if p1action = p2action & P1 visible & P2 visible
together = 1
increment = increment + 1
if increment ≥ timeThresh then
situation = "together"
situation = "not together"
meetingThresh = 50
for to do
if situation( ) = situation( ) then
if situation( ) = "together" then
togetherInc = togetherInc + 1
togetherInc = 0
if togetherInc = meetingThresh then
scenario = "meeting"
else if togetherInc?<?meetingThresh & togetherInc?>?0
scenario = "potential meeting"
scenario = "not meeting"
if event="meeting" then
for to lastFrame do
if scenario = "meeting" then
currentAction = facts·positionLabel( )
explanation = "Person" event "to meet on"
for to lastFrame do
if scenario = "ignore" then
currentAction = facts·positionLabel( )
explanation = "Person" event "to avoid other
4.3. Explaining Two-Person Interactions in an Urban Location
The agent has knowledge of his own state which includes action, behaviour, and gaze direction.
The agent can see other agents when they fall within the visual field, determined by the gaze direction.
The agent can sense anything within a specified range (reflecting the ability to, e.g., hear someone walking behind).
Interactions between agents are possible within a certain proximity.
The set of events which trigger the reasoning engine (left) and the set of rules which can be initiated in search of an explanation (right) are shown here.
Trigger events list
Move to road
Move to pavement
Move to drive
In the analysis of activity which follows it is important to note that there is no all-knowing reasoning process which has access to all the information taking place in the scene. The only information which is available is derived from the sensors of the agent of interest, that is, the agent whose behaviour corresponds to an activity which requires to be explained. As previously stated, this explicitly shifts the focus from the camera to the agent in the scene and thus reasons from the agent's, as opposed to the global, view.
4.3.1. Detecting and Classifying Activities Using Rules
The true reasons for events occurring are not apparent directly from the video. A person who crossed the road in order to meet his friend may have done so because it was prearranged or because he happened to see his acquaintance. It is not possible to distinguish between these hypothetical reasons from the data alone even if the scene rules are completely known. Rather, it requires detailed knowledge of the intention, goals, and history of a specific individual. This is not generally available and certainly not in a surveillance application where the individuals under observation are anonymous. Despite this fact, a "lower" level of causality is still in operation and this can be inferred from our description of the scenario: the person, "…crossed the road in order to meet…". This type of causality is amenable to analysis using the information we can currently obtain from the sensors of the agents.
People meeting with one another is a common occurrence in an urban scene. In fact, recognising groups of people versus independent individuals and, in particular, detecting cooperating individuals, is a core element of the human interpretation of urban scenes. Police surveillance officers, for example, may be interested in an exchange of illegal substances at a meeting of two individuals under observation.
There are many cues humans use to distinguish between people meeting or people ignoring one another. One such cue, discussed in Section 3.2, is that people who are together will generally acknowledge each other's presence by looking at one another periodically and at regular intervals. Other, more obvious cues include proximity. By defining precisely what is required for the event "meeting" to take place we can distinguish between people passing by one another and people meeting together.
The "proximity" of the individuals is first analysed using Algorithm 2. A "potential meeting" is identified when agents are within a predefined proximity in image coordinates for a predefined period of time (typically 100 frames) and also within one another's field of view, that is, they must be looking at one another. Note that the value of proximity is preset in Algorithm 3. The number which ought to be chosen is dependent on many factors including social criteria and cultural norms  and is easily changed. The rule for meeting is that the intermediate state potential meeting must be the current explanation of the interaction. Additionally, the agents must be performing the same spatiotemporal action, for example, they are both walking-on-the-pavement. By contrast, an "ignore" rule is initiated when the conditions for "meeting" are not met but when a "potential meeting" has previously occurred. If none of these agent states are identified, there is no interaction defined.
Again we emphasise that the encoding of these rules is very efficient and extensible. Algorithm 3 in the appendix explicitly defines the rule for the scenario "meeting". As can be seen the meeting rule uses the information determined by the "proximity" rule. Note also that the "meeting" algorithm explicitly requires input from the gaze direction approximation component of the system.
4.3.2. Explaining Interactions between People
IF the event "move-to-road" is followed by event "move-to-pavement" AND the current location is not the same as the location triggering the first event (i.e., the road is crossed) AND, subsequently, a meeting takes place THEN the explanation is that, "the agent crossed the road to meet the other agent".
IF a crossing of the road is observed NOT followed by an interaction THEN the explanation is that the agent crossed the road.
IF a "move-to-road" event is triggered AND subsequently a "move-to-pavement" event but back to the same pavement THEN no explanation is provided UNLESS another agent was in the near vicinity THEN the explanation is that it was necessary to avoid collision.
The pseudocode for this scenario is shown in Algorithm 4 in the appendix. An illustrative schematic of the overall reasoning process for answering this question is shown in Figure 13. Similarly, we generate hypotheses to explain events including "stopping", "move-to-pavement" and "move-to-driveway". It is simple to change between domains by updating the rule set. There is the additional advantage that the rule set is general to all such urban scenes. The output for two different situations is automatically generated and exactly the same reasoning engine and events set may be applied to each scene independently.
Comprehensive data from two different urban scenes was gathered and used to evaluate our method. We first describe the datasets used and the training process, then discuss the evaluation of the reasoning process.
5.1. Dataset 1
The first dataset is illustrated in Figure 18. To obtain the data, two students were asked to act out a set of twelve two- and one-person activities. The activity was recorded using a standard home video camera from the second floor of a domestic building in Oxford. No instruction was given to the "actors", other than a brief outline of the activity. The two-person activities included walking together, meeting, passing one another by. A set of images containing these two-person activities was then extracted for experimentation. This subset of the total dataset comprises 6000 frames at 5 frames per second (fps). From this, a hand-labelled corpus of 665 frames was generated by the authors for training the low-level sensing component of the system. The low-level spatiotemporal action classes derived from these sequences are walking (away, towards, left, and right), running (away, towards, left, and right), and standing still. The people are tracked automatically and the representative (training) action classes are labelled by hand. The head-pose classes remain as described previously and database exemplars of the head-pose under these imaging conditions were extracted automatically and then given a semantic label by hand. The positional locations are defined as nearside pavement, far-side pavement, road, and driveway.
5.2. Dataset 2
5.3.1. Event Recognition
As defined previously, an "event" in the urban surveillance domain corresponds to a specified change in spatiotemporal action which is computed from the combination of location and action. In Dataset 1, 96.7% of the time the Maximum Likelihood selected spatiotemporal action is correct with reference to ground-truth labels. 100% of the time the true model is in the distribution of all models which were sampled from the database. This is measured over 2391 frames. In Dataset 2, in 74% of the tests the ML action was correctly chosen and 89.5% of the time the correct model was in the distribution. This is a fair reflection of the differing pixel resolution available to compute the action descriptor. We tested over 18445 frames of data.
5.3.2. Explanations of Events
The interpretation of the events is dependent on the detection of lower-level events. The explanatory hypotheses have already been discussed in some detail in Section 4.3. Out of the data used in this paper we identified the four events in Dataset 1 which are already listed in Table 2. In Dataset 2 this set is augmented by a series of single-person interactions with the environment. Given the "real" nature of this particular dataset, the events are somewhat uninteresting and correspond to road crossings mainly.
The result drawn from Dataset 2 shows how the reasoning process may be extended. In this case there is no other person and so a new explanation is posited: that of "crossing road". The rules are augmented for the example in Figure 19 with knowledge that the road may legitimately be crossed at the pedestrian crossing, that is, despite there being no evidence for a meeting, crossing at the lights is a plausible reason for the observed behaviour. The accuracy of the method is demonstrated here also by the running commentary generated in Figure 19.
5.3.3. Failure Modes
In contrast to the previously studied problem of reasoning about static scenes with very simple visual features, this work has developed a new system for explaining interactions between people in complex, dynamic scenes. This has been made by possible by our recent work in the area of action recognition, the results of which we have exploited to enable a software "agent" to sense its environment. Using known rules about how agents interact, we created a general method for reasoning about causal interactions between people. The generality of the method is clearly demonstrated by the results from two very different applications. The most pressing area for future work is to implement a fully Bayesian reasoning system.
See Algorithms 2, 3, and 4.
Neil Robertson was supported by the Royal Commission for the Exhibition of 1851 and the UK Ministry of Defence.
- Gerber R, Nagel H, Schreiber H: Deriving textual descriptions of road traffic queues from video sequences. Proceedings of European Conference on Artifical Intelligence, 2002 736-740.Google Scholar
- Brand M, Birnbaum L, Cooper P: Sensible scenes: visual understanding of complex structures through causal analysis. Proceedings of National Conference on Artificial Intelligence, 1993, Washington, DC, USAGoogle Scholar
- Michotte A: The Perception of Causality. Basic Books; 1946. English translation, Methuen, Andover, Mass, USA, 1963Google Scholar
- Heider F, Simmel M: An experimental study of apparent behaviour. American Journal of Psychology 1944, 57: 243-249. 10.2307/1416950View ArticleGoogle Scholar
- Scholl BJ: Innateness and (Bayesian) visual perception. In The Innate Mind: Structure and Contents. Edited by: Carruthers P, Laurence S, Stich S. Oxford University Press, Oxford, UK; 2005:34-52.View ArticleGoogle Scholar
- Cooper PR, Brand MA: A knowledge framework for seeing and learning. In Visual Learning, Volume 2: Symbolic Visual Learning. Edited by: Ikeuchi K, Veloso M. Oxford University Press, Oxford, UK; 1995.Google Scholar
- Pearl J: Causality. Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK; 2000.MATHGoogle Scholar
- Rigolli M, Williams Q, Gooding MJ, Brady M: Driver behavioural classification from trajectory data. Proceedings of the 8th International IEEE Conference on Intelligent Transportation Systems (ITSC '05), September 2005, Vienna, Austria 889-894.Google Scholar
- Siskind JM: Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research 2001, 15: 31-90.MATHGoogle Scholar
- Robertson N, Reid I: Behaviour understanding in video: a combined method. Proceedings of the International Conference on Computer Vision (ICCV '05), October 2005 808-815.Google Scholar
- Robertson N, Reid I: Estimating gaze direction from low-resolution faces in video. Proceedings of the 9th European Conference on Computer Vision (ECCV '06), May 2006, Graz, Austria, Lecture Notes in Computer Science 3952: 402-415.Google Scholar
- Russel S, Norvig P: Artificial Intelligence, A Modern Approach. Prentice-Hall, New York, NY, USA; 1995.MATHGoogle Scholar
- Andrade EL, Fisher RB: Simulation of crowd problems for computer vision. Proceedings of the 1st International Workshop on Crowd Simulation (VCROWDS '05), November 2005, Lausanne, SwitzerlandGoogle Scholar
- Bratman ME: Intention, Plans, and Practical Reason. CSLI Publications, Stanford University; 1988.Google Scholar
- Georgeff M, Pell B, Pollack M, Tambe M, Wooldridge M: The belief-desire-intention model of agency. Proceedings of the 5th International Workshop on Intelligent Agents V : Agent Theories, Architectures, and Languages (ATAL '98), 1999 1-10.View ArticleGoogle Scholar
- Dee H, Hogg D: Detecting inexplicable behaviour. Proceedings of the British Machine Vision Conference, 2004 2: 597-606.Google Scholar
- Medioni G, Cohen I, Brémond F, Hongeng S, Nevatia R: Event detection and analysis from video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001, 23(8):873-889. 10.1109/34.946990View ArticleGoogle Scholar
- Gong S, Xiang T: Recognition of group activities using dynamic probabilistic networks. In Proceedings of the 9th IEEE International Conference on Computer Vision, October 2003, Nice, France. Volume 2. IEEE Computer Society; 742-749.View ArticleGoogle Scholar
- Nguyen NT, Phung DQ, Venkatesh S, Bui H: Learning and detecting activities from movement trajectories using the hierarchical hidden Markov model. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 2: 955-960.Google Scholar
- Hongeng S, Nevatia R: Large-scale event detection using semi-hidden Markov models. In Proceedings of the 9th IEEE International Conference on Computer Vision, October 2003, Nice, France. Volume 2. IEEE Computer Society; 1455-1462.View ArticleGoogle Scholar
- Turaga PK, Veeraraghavan A, Chellappa R: From videos to verbs: mining videos for activities using a cascade of dynamical systems. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007 1-10.Google Scholar
- Buckley JJ: Fuzzy Expert Systems and Fuzzy Reasoning William Siler. 2005.Google Scholar
- Rigolli M, Phil D: , thesis. Department of Engineering Science, University of Oxford; 2006.Google Scholar
- Liu X, Krahnstoever N, Yu T, Tu P: What are customers looking at? Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS '07), September 2007 405-410.Google Scholar
- Robertson N, Reid I: A general method for human activity recognition in video. Computer Vision and Image Understanding 2006, 104(2-3):232-248. 10.1016/j.cviu.2006.07.006View ArticleGoogle Scholar
- Comaniciu D, Meet P: Mean shift analysis and applications. Proceedings of the 7th IEEE International Conference on Computer Vision (ICCV '99), September 1999 2: 1197-1203.View ArticleGoogle Scholar
- Bibby C, Reid I: Visual tracking at sea. Proceedings of the International Conference on Robotics and Applications, 2005, Barcelona, SpainGoogle Scholar
- Sidenbladh H, Black M, Sigal L: Implicit probabilistic models of human motion for synthesis and tracking. Proceedings of the European Conference on Computer Vision, June 2002 1: 784-800.MATHGoogle Scholar
- Rogers EM, Hart WB, Miike Y: Edward T. Hall and the history of intercultural communication: the United States and Japan. Keio Communication Review 2002, 24: 3-26.Google Scholar
- Pang D, Li V: Atlantoaxial rotatory fixation: part 1—biomechanics of normal rotation at the atlantoaxial joint in children. Neurosurgery 2004, 55(3):614-625. 10.1227/01.NEU.0000134386.31806.A6View ArticleMathSciNetGoogle Scholar
- Efros AA, Berg AC, Mori G, Malik J: Recognizing action at a distance. Proceedings of the International Conference on Computer Vision, October 2003 726-733.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.