Sequential Monte Carlo filter based on multiple strategies for a scene specialization classifier
 Houda Maâmatou^{1, 2, 3}Email authorView ORCID ID profile,
 Thierry Chateau^{1},
 Sami Gazzah^{2},
 Yann Goyat^{3} and
 Najoua Essoukri Ben Amara^{2}
https://doi.org/10.1186/s1364001601434
© The Author(s) 2016
Received: 9 May 2016
Accepted: 8 November 2016
Published: 25 November 2016
The Erratum to this article has been published in EURASIP Journal on Image and Video Processing 2017 2017:5
Abstract
Transfer learning approaches have shown interesting results by using knowledge from source domains to learn a specialized classifier/detector for a target domain containing unlabeled data or only a few labeled samples. In this paper, we present a new transductive transfer learning framework based on a sequential Monte Carlo filter to specialize a generic classifier towards a specific scene. The proposed framework utilizes different strategies and approximates iteratively the hidden target distribution as a set of samples in order to learn a specialized classifier. These training samples are selected from both source and target domains according to their weight importance, which indicates that they belong to the target distribution. The resulting classifier is applied to pedestrian and car detection on several challenging traffic scenes. The experiments have demonstrated that our solution improves and outperforms several state of the art’s specialization algorithms on public datasets.
Keywords
1 Introduction
The object detection in an image or in video frames is the first task to perform and the most interesting one in several computer vision applications. A lot of work has focused on pedestrian and vehicle detection for the intelligent development of the transportation system and the videosurveillance trafficscene analysis [1–13]. Most of these papers have proposed objectappearance detectors to improve the performance of the detection task and to avoid—or at least reduce—problems relative to a simple background subtraction algorithm, such as merging and splitting blobs, detecting mobile background objects, and detecting moving shadows. Some researchers [9, 10, 14] have focused on presenting relevant features that drop the false positive rate and raise the detection accuracy, though often leading to a increase in the computational costs of multiscale detection tasks. Other researchers, like Dollár et al. [11, 12], have been interested in reducing the time needed to compute features at each scale of sampled image pyramids without adding complexity or particular hardware requirements to allow fast multiscale detection.
However, a key point of learning appearancebased detectors is the building of a training dataset, where thousands of manual labeled samples are needed. This dataset should cover a large variety of scales, view points, light conditions, and image resolutions. In addition, training a single object detector to deal with various urban scenarios is a very hard task because there can be much variability in traffic scenes like several object categories, different road infrastructures, weather influence on video quality, and time of scene recording (rush hours or offpeak hours, day or night).
The diversity of both positive and negative samples can be very restricted in a video surveillance scene recorded by one static camera. Nevertheless, it was demonstrated in [15–20] that the accuracy of a generic (pedestrian or vehicle) detector would dropoff quickly when it was applied to a specific traffic scene, in which the available data would mismatch the training source one.
An intuitive solution is to build a scenespecialized detector that provides a higher performance than a generic detector using labeled samples from the target scene. On the other hand, labeling data manually for each scene and repeating the training process several times, according to the number of object classes in the target scene, are arduous and timeconsuming tasks. A functional solution to keep away from these tasks is to automatically label samples from the target scene and to transfer only a set of useful samples from the labeled source dataset to the target specialized one. Our work moves along this direction. We suggest an original formalization of transductive transfer learning (TTL) based on a sequential Monte Carlo (SMC) filter [21] to specialize a generic classifier to a target scene. In the proposed formalization, we estimate a hidden target distribution using a source distribution in which we have a set of annotated samples, in order to give an estimated target distribution as an output. We consider samples of the training dataset as realizations of the joint probability distribution between samples’ features and object classes.
 (1)
Original formalization of TTL for classifier specialization based on SMC filter: This formalization is inspired from particle filters, mostly used to solve the problems of object tracking and robot localization [22–24]. We propose to approximate an unknown target distribution as a set of samples that compose the specialized dataset. The aim of our formalization is to automatically label the target data, to attribute weights to samples of both source and target datasets reflecting their relevance, to select relevant samples for the training according to their weights, and to train a scene specialized classifier. Importantly, this formalization is general and can be applied to specialize any classifier.
 (2)
Strategies of sample proposal: In order to use informative samples for training a scenespecialized classifier, we put forward two sampleproposal strategies. The letter gives a set of suggestions composed by true positive samples, false positive ones known as “hard examples,” and samples from background models. These strategies accelerate the specialization process by avoiding handling all the samples of the target database.
 (3)
Strategies of observation: We also suggest two observation strategies to select the correct proposed target samples and to avoid the distortion of the specialized dataset with mislabeled samples. These strategies utilize prior information, extracted from the target video sequence, and visual context cues to assign a weight for each sample returned by the proposal strategies. Our suggested visual cues do not incorporate the score returned by the classifier, which can make the training of the specialized classifier drift, as some previous work did [25–28].
 (4)
Strategy of sampling: In general, the properly classified target samples are not enough to build an efficient target classifier. However, the source dataset may contain some samples that are close to the target ones, which helps training a specialized classifier. Therefore, we put forward a sampling strategy that selects useful samples from both target and source datasets according to their weight importance, reflecting the likelihood that they belong to the target distribution. Differently from the work developed in [25–28], which treated equally the dataset samples, or from the work of Wang et al. [16, 17], which integrated the confidencescore associated to the sample in the training function of the classifier, we utilize the SIR algorithm. The latter transforms the weight of a sample on a number of repetitions, through replacing the samples associated to a high weight by numerous ones and replacing the samples linked to a low weight by few ones, thus giving them identical weights. This makes our approach applicable to specialize any classifier, while treating training samples according to the importance of their weights without modifying the training function as Wang et al. [16, 17] did.
The remainder of the paper is organized as follows. First, some related work is described in Section 2. Then, the proposed approach is presented in Section 3: We describe the general SMC scene specialization framework in Section 3.1 and the several proposed strategies for each filter step in Section 3.2. After that, our experimental results are provided in Section 4. Finally, the paper is summarized in Section 5.
2 Related work
The literature has proven that the transfer learning methods have been successfully utilized in various realworld applications like object recognition and classification. These methods propose to use available annotated data and knowledge acquired through some previous tasks relative to source domains so as to improve a learning system of a target task in a target domain [29]. In this section, we are interested in the work that suggests to develop automatically or with less human effortspecific classifiers or detectors to a target scene.
Mainly three categories of transfer learning methods, related to the suggested approach, were described in [20]. The first category would modify the parameters of a source learning model to improve its accuracy in a target domain [30, 31]. The second one would reduce the difference between the source and target distributions to adapt the classifier to the target domain [32, 33]. The last one would automatically select the training samples that could give a better model for the target task [34, 35]. Except [18, 36], which presented classifiers based on the Convolutional Neural Networks (CNN), most of the work cited above was presented as variants of the Support Vector Machine (SVM).
In this paper, we focus on the last category that uses an automatic labeler to collect data from the target domain. Rosenberg et al. [25] utilized the decision function of an object appearance classifier to select the training samples from one iteration to another. Since the classifier was itself the labeler, it was difficult to set up the decision function. If this latter was selective enough, then only the very similar data would be chosen—even if they did not contain important variability information. Contrarily, there was a risk of introducing wrong data that would degrade the system’s performance over time. To introduce new data containing more diversity, Levin et al. [27] used a system with two independent classifiers to collect unlabeled data. The data labeled with a high confidence, by one of the two classifiers, were added to the training data to retrain both classifiers. Another way to automatically collect new samples is to use an external entity called “oracle.” An oracle may be built utilizing a single algorithm or combining and/or merging multiple algorithms. Nair and Clark [26] presented an oracle based on a background subtraction algorithm, while Chesnais et al. [28] put forward an oracle composed of three independent classifiers (appearance, background extraction, and optical flow). It was noted that the adapted classifier of Nair and Clark [26] was very sensitive to the risk of drifting because the selection of samples would depend only on the background subtraction algorithm. Indeed, several static objects or those with similar background appearance were classified as negative samples and mobile background objects were labeled as objects of interest. Moreover, the proposed methods of Levin et al. [27] and Chesnais et al. [28] were based on the assumption that the classifiers were independent, which could not be easy to validate.
Futhermore, some solutions concatenated the source dataset with new samples, which increased the dataset size during iterations [30–33]. Others were limited only to the use of samples extracted from the target domain [28], which resulted in losing pertinent information of source samples. Ali et al. [37] presented an approach that learned a specific model by propagating a sparsely labeled training video based on object tracking. Inspired from this, Mao and Yin [19] opted for chains of tracked samples (tracklets) to automatically label target data. They linked detection samples returned by an appearanceobject detector into tracklets and propagated labels to uncertain tracklets based on a comparison between their features and those of labeled tracklets. The method used a lot of parameters, which should be determined or estimated empirically, and several sequential thresholding rules, causing an inefficient adaptation of a scenespecific detector.
Another solution was proposed in [15–18, 20, 35, 36]. It collected new samples from the target domain and selected only the useful ones from the source dataset. Wang et al. [17] used different contextual cues such as pedestrian motion, road model (pedestrians, cars...), location, size, and objects’ visual appearances to select positive and negative samples of the target domain. In fact, their method was based on a new SVM variant to select only source samples that were good for the classification in the target scene. The limit of their method was that it can be applied only onto an SVM classifier.
Recently, we have noticed an emergence of work based on deep learning, which presents high performances on classification and detection tasks. Yet, it is known that this type of model requires large datasets and has various parameters to train. In order to take advantage of these classifiers, some work has proposed to transfer the CNN trained on a large source dataset to a target domain with a small dataset. Oquab et al. [38] copied the weight from a CNN trained on the ImageNet dataset to a target network with additional layers for image classification on the Pascal VOC dataset. In [18], Li et al. suggested adapting a generic ConvNet vehicle detector to a scenespecific one by reserving shared filters between source and target data and updating the nonshared filters. In contrary with [18, 38], which needed several labeled data in the target domain, Zeng et al. [36] learnt the distribution of the target domain by opting for Wang’s approach [17] as an input to their deep model to reweight samples from both domains without manual data labeling from the target scene.
Most of the specialization algorithms cited above are based on hardthresholding rules and can drift quickly during training [17], or they are applied only to few classifiers. Nevertheless, our proposed framework overcomes the risk of drifting by propagating a subset of specialized dataset through iterations. It can be used to specialize any classifier while utilizing the same function as a generic classifier and may be applied using several strategies on each step of the filter. Some preliminary results of the work presented in this paper were published in [20]. In this paper, we put forward an extension of our original TTL approach based on an SMC (TTLSMC) filter by other sample proposal and observation strategies and more experiments. The TTLSMC approximates iteratively the joint probability distribution between the samples and the object classes of the target scene by combining only relevant source and target data as a specialized dataset. The latter is used to train a specialized classifier for the target scene.
3 Our proposed approach
This section presents the proposed approach. We describe in Section 3.1 the core of the general specialization framework based on the SMC filter. Then, we suggest in Section 3.2 different strategies that can be used for each filter step.
3.1 SMC scene specialization framework
This subsection introduces the context and gives a detailed description of the proposed framework.
3.1.1 Context
In our work, we assume that the unknown joint distribution between the target samples and the associated labels can be approximated by a set of representative samples. The block diagram of the suggested specialization, at a given iteration k, is illustrated in Fig. 1. Algorithm 1 gives a summary of its process.
Given a source dataset, a generic classifier, which can be learnt from this source dataset, and a video sequence of a target scene, then a specialized classifier and an associated specialized dataset are to be generated. The two latter are the outputs of the distribution approximation provided by the SMC filter.
Let \({\mathcal {D}}_{k} \doteq \{\mathbf {X}_{k}^{(n)}\}_{n=1,..,N}\) be a specialized dataset of size N at an iteration k, where \(\mathbf {X}_{k}^{(n)} \doteq (\mathbf {x}^{(n)},y)\) is the sample number n, with x being its feature vector and y its label, where \(y \in {\mathcal {Y}}\). Basically, \({\mathcal {Y}}=\{1;1\}\), where 1 represents the object and −1 represents the background (or nonobject class). In addition, \(\Theta _{{{\mathcal {D}}}_{k}}\) is a specialized classifier at an iteration k, which is trained on the previous specialized dataset \({\mathcal {D}}_{k1}\). We use a generic classifier Θ _{ g } at the first iteration.
A source dataset \( {{{\mathcal {D}}}^ s} \doteq \{\mathbf {X}^{s (n)} \}_{n = 1,.., N^ s} \) of N ^{ s } labeled samples is defined. Moreover, a large target dataset \( {\mathcal {D}}^ t \doteq \{\mathbf {x}^{t (n)} \}_{n = 1,.., N^ t} \) is available. This dataset is composed of N ^{ t } unlabeled samples provided by a multiscale sliding window extraction strategy applied on the target video sequence and cropped from computed background models.
3.1.2 Classifier specialization based on SMC filter
with C=1/p(Z _{ k+1}Z _{0:k+1}).
Therefore, the SMC filter is used to estimate the unknown joint distribution between the features of the target samples and the associated class labels by a set of samples that are initially unknown. We suppose that the recursion process selects relevant samples for the specialized dataset from one iteration to another, leads to converge to the right target distribution, and makes the resulting classifiers more and more efficient.
The resolution of Eq. 1 is done in three steps: prediction, update, and sampling. The following paragraphs describe the details of each one.
We note \({\tilde {\mathcal {D}}_{k+1}} \doteq \left \{\tilde {\mathbf {X}}_{k+1}^{(n)}\right \}_{n=1,..,\tilde {N}_{k+1}}\) the specialized dataset predicted for an iteration (k+1) where \(\tilde {N}_{k+1}\) is its number of samples and \(\tilde {\mathbf {X}}_{k+1}^{(n)}\) is the n ^{th} predicted sample.
where \((\breve {\mathbf {X}}^{(n)}_{k+1}, \breve {\pi }^{(n)}_{k+1})\) represents a target sample with its associated weight and \(\breve {N}_{k+1}\) is the number of weighted samples.
\(\mathbf {X}^{*(n)}_{k+1}\) is a selected sample n to be in the next specialized dataset \({\mathcal {D}}_{k+1}\); a sample can be selected either from the target dataset or from the source one.
It is to note that in this step we apply the SIR algorithm to approximate the conditional distribution \(p(\breve {\mathbf {X}}_{k+1}\mathbf {Z}_{k+1})\) of the target samples given by the observations. Furthermore, we propose to extend this target set by transferring samples from the source dataset, which mostly resemble those of the target scene, without changing the posterior distribution.
The specialization process stops when the ratio \((\tilde {\mathcal {D}}_{k+1}/\tilde {\mathcal {D}}_{k})\) exceeds a previously fixed threshold α _{ s }. ∙ represents the dataset cardinality. The output classifier will be based only on appearance to detect the interest object (pedestrian or car) on the target scene.
3.2 The different proposed strategies
In this subsection, we propose several strategies in each filter’s step. This filter aims to specialize a classifier to a target scene surveilled by a static camera.
In the description below, we consider a pedestrian as our interest object, but the strategies can be applied for any other objects, e.g., cars and motorbikes.
3.2.1 Sample proposal strategies

Subset 1: It corresponds to subsampling the specialized dataset resulting from the previous iteration to propagate the distribution from one iteration to another. The ratio between the positive and negative classes (typically the same as the one of the source dataset) should be respected. This subset approximates the term p(X _{ k }Z _{0:k }) in Eq. 1, according to Eq. 8:$$ p\left(\mathbf{X}_{k}\mathbf{Z}_{0:k}\right)\approx \left\{\mathbf{X}^{*(n)}_{k+1}\right\}_{n=1,..,N^{*}} $$(8)
where \(\mathbf {X}^{*(n)}_{k+1}\) is the sample n selected from \({\mathcal {D}}_{k}\) to be in the dataset of the next iteration (k+1) and N ^{∗} is the number of samples in this subset with N ^{∗}=α _{ t } N ^{ s }, where α _{ t }∈[0,1]. The parameter α _{ t } determines the number of samples to be propagated from the previous dataset.

Subset 2: To get this subset, we train a new specialized classifier \(\theta _{D_{k}}\) on D _{ k } and use it to detect a pedestrian on a set of frames extracted uniformly from the target videosequence, using a multiscale sliding window technique. This technique covers a pedestrian by several bounding boxes, so a spatial meanshift grouping function is opted for to merge the closest bounding boxes. Moreover, it provides a set of samples classified as a pedestrian, but there are true and false detections. Herein, we suppose that each detection can be either a positive sample or a negative one. Thus, each detection is duplicated: one sample is labeled positively and the other one is labeled negatively. This subset is returned by Eq. 9:$$ \begin{aligned} \left\{\breve{\mathbf{X}}^{(n)}_{k+1}\right\}_{n=1,..,\breve{N}_{k}} \doteq& \\ &\left\{\left(\mathbf{x}^{(n)},y\right)\right\}_{y\in {\mathcal{Y}}\ ; \mathbf{x}^{(n)}\in {\mathcal{D}}^{t} / \Theta_{{{\mathcal{D}}}_{k}}\left(\mathbf{x}^{(n)}\right)>0} \end{aligned} $$(9)
\(\breve {\mathbf {X}}^{(n)}_{k+1}\) is the n ^{th} target sample proposed to be included in the dataset of the next iteration (k+1).

Subset 3: In some cases, the previous specialized classifier would rather miss detections than give false positive ones; and it is difficult to favor a label for several samples in subset 2. This means that we cannot select enough negative target samples to specialize the classifier from subset 2.
In order to avoid such cases, we use computedbackground models (in our case, a median_background and a mean_background) to provide negative target samples and produce subset 3 according to Eq. 10.$$ \begin{aligned} \left\{\breve{\mathbf{X}}^{'(n)}_{k+1}\right\}_{n=1,..,\breve{M}_{k}} \doteq &\\ &\cup \sum_{b_{j} in \{b1,...,bm\}} {\left\{(\mathbf{x}^{'(n)},1)\right\}_{\mathbf{x}^{'(n)}\in b_{j} }} \end{aligned} $$(10)where \(\phantom {\dot {i}\!}(\mathbf {x}^{'(n)},1)\) is a sample cropped from a target background model and labeled negatively. \(\breve {M}_{k}=m*\breve {N}_{k}\) is the number of all background samples.
We crop a sample from each computed background model, at the same position and with the same size of each selected sample returned by the classifier.
3.2.2 Observation strategies
As depicted in Fig. 3, some target samples are misclassified, which are known as “hard examples.” It is unreliable to directly use these samples according to their predicted labels or not to utilize them in the specialization process because they are probably informative. In what follows, we present several strategies of the weighting samples of subset 2 in order to choose the correct proposal using the information extracted from the target scene.
1  Overlap accumulation scores: Our first strategy, called overlap accumulation scores (OAS), is based on two simple spatiotemporal cues: a background extraction overlap score and a temporal accumulation one.
In a traffic scene, it is rare for pedestrians to stay stable for a long time, and a good detection occurs on a foreground blob; whereas, false positive background detections provide some region of interests (ROIs) that appear over time at the same location and with almost the same size.
Functions and notations used in Algorithm 2
Notation: definition 

 p: It is a spatiotemporal ROI position into the target video 
sequence (\({\mathcal {D}}^{t}\)). 
 \(compute\_overlap(\mathbf {p}, {\mathcal {D}}^{t})\):It computes an overlap_score of 
ROI p. 
 \(compute\_accumulation(\mathbf {p},{\mathcal {D}}^{t})\):It computes an accumulation 
_score of ROI p. 
A positive sample will be linked to a weight equal to its overlap score if λ _{ o } exceeds a fixed threshold α _{ p }, which is determined empirically. Otherwise, it will be associated to zero. A similar thinking is used in the case of a negative sample; it will have its accumulation_score as a weight if its λ _{ o } is null and its λ _{ a } is greater than zero. Otherwise, it will be related to a weight equal to zero. Any sample associated to a null weight will be rejected.
2  KLT feature tracker: We propose a second strategy that uses the KLT feature tracker [39, 40]. This latter aims to find for each feature point (called also interest point), detected on the video frame (i), a corresponding feature point, detected on the video frame (i+1).
It is more reliable to consider that a positive sample is a true positive one if its ROI contains a number of foreground feature points higher than the number of background ones. Contrariwise, a negative sample is a true negative one if its ROI contains only background feature points or a very limited number of foreground ones.
Functions and notations used in Algorithm 3
Notation: definition 

 p: It is a spatiotemporal ROI position into the target video 
sequence (\({\mathcal {D}}^{t}\)). 
 compute_FRPts(p _{ i },{FPts _{ j }}_{ j=1..,L }): It computes the fore 
ground feature points of ROI p. 
 compute_BKPts(p _{ i },{FPts _{ j }}_{ j=1..,L }): It computes the back 
ground feature points of ROI p. 
3.2.3 Sampling strategy
where \(\breve {\mathbf {X}}^{*(n)}_{k+1}\) and \(\breve {\mathbf {X}}^{*'(n)}_{k+1}\) are the selected target samples for the next iteration (k+1) from subsets 2 and 3, respectively.
In general, these selectedtarget samples may contain ones with false labels because they are automatically weighted. In addition, they are insufficient to generate an efficient classifier to the target scene. However, the source dataset contains labeled samples that are similar to the target ones and which should be beneficial to the specialization of the classifier.
The specialization process stops when the ratio between the cardinality of two predicted datasets related to two consecutive iterations exceeds α _{ s } (α _{ s }=0.80 fixed empirically in our case). Once the specialization is finished, the obtained classifier can be used for pedestrians’ detection and classification in the target scene based only on their appearance.
4 Experimental results
In this section, we present and discuss the different experiments achieved in order to evaluate the performance of our specialization algorithm.
We used the HOG descriptor as a feature vector and we trained the generic and specialized classifiers utilizing the SVMLight^{2}, for both car and pedestrian cases.
4.1 Datasets
 
CUHK_Square dataset [16]: It is a video surveillance sequence of 60 min, recording a road traffic scene by a stationary camera. We uniformly extracted (as described in [16]) 452 frames from this video, of which the first 352 frames were used for the specialization and the last 100 frames were utilized for the test.
 
MIT traffic dataset [41]: A static camera was used to record a set of 20 short video sequences of 4 min 36 s, each one. From the first 10 videos, we extracted 420 frames for the specialization. Also, 100 frames were extracted from the second 10 videos for the test.
 
Logiroad traffic dataset: It is a record of a traffic scene, which was done by a stationary camera, of almost 20 min. The same reasoning was applied. We uniformly extracted 700 frames from this video, of which the first 600 frames were used for the specialization and the last 100 frames were utilized for the test.
In our evaluation, we opted for the ground truth provided by Wang and Wang in [15] (noted MIT_P) and by Wang et al. (noted CUHK_P) in [16], to test the detection results of pedestrians on the MIT traffic dataset and on the CUHK_Square dataset, respectively. As there was no available carannotated database to test the detection results, we proposed annotations relative to cars on both MIT and Logiroad traffic datasets. We note these latter MIT_C and LOG_C, respectively.
We applied the PASCAL rule [42] to compute the true positive rate and the receiver operating characteristic (ROC) curve, so as to compare the detectors’ performances. A detection will be accepted if the overlap area between the detection window and the blob of the ground truth exceeds 0.5 of the union area. A ROC curve presents the pedestrian detection rate for a given false positive rate per image. blackIt is to note that we use the term “specialized classifier” when the conclusion is true for all classifiers provided by our framework independently from the used strategies. Moreover, we apply the specialized classifier based only on object appearance without prior information at the test stage. In addition, the indication of a detection’s rate hereafter is always relative to one false positive per image (FPPI = 1).
4.2 Convergence evaluation
The Kullback–Leibler divergence (KLD) was another metric evaluation used to measure the convergence of the estimated distribution towards the true target one. We computed the KLD between a set of pedestrians cropped manually from the specialization frames and positive samples of the specialized dataset produced at each iteration. The KLD between two sets of realizations was computed as in the work of Boltz et al. [46]. Figure 9 b indicates that the KLD decreases until having a minimal variation starting from iteration 4 (corresponding to the stopping iteration) on the CUHK_Square dataset. The same interpretation is noticed in the other datasets.
4.3 Effect of sample proposal strategies
Average duration of a specialization’s iteration on several datasets
Dataset  Nb. images  Image size  SMC_WB  SMC_B 

CUHK_P  352  1440×1152  60 min  84 min 
MIT_P  420  4320×2880  210 min  285 min 
MIT_C  420  720×480  14 min  28 min 
LOG_C  600  864×486  22 min  36 min 
4.4 Effect of observation strategies
We make a comparison between two observation strategies: the OAS and the KLT feature tracker in several cases. This comparison aims to prove the performance of the specialized detector compared to the generic one and to show that our proposed specialization is a general framework. It can be applied by combining or substituting many algorithms that extract visual context cues from a video recorded by a static camera.
To correctly evaluate the effect of the observation strategies, we adopt the SMC_B proposal strategy, which has given the best performance in the tests of Section 4.3 for all the experiments. We note SMC_B_OAS a specialized detector trained by applying our framework using the SMC_B as a proposal strategy and the OAS as an observation strategy. Also, SMC_B_KLT is noted when the SMC_B and KLT strategies are used.
On the MIT traffic dataset, in the case of pedestrians, our SMC_B_OAS detector ameliorates the detection rate from 10 to 24% at the first iteration and it starts converging from the fourth iteration with 49% of true positive detections. However, the SMC_B_KLT detector converges by a rise of 22% compared to the performance of the generic detector. In the case of cars, we record for both SMC_B_OAS and SMC_B_KLT detectors a raise in the detection rate by 5% at the first iteration, compared to the one of the generic detector. Then, the detection rate of the SMC_B_OAS moves to about 30% at the fourth iteration against an increase from 9 to 24% recorded by the SMC_B_KLT detector. We notice that the performance goes up weakly after the fourth iteration corresponding to the stopping iteration in our experiments.
In particular, on the Logiroad traffic dataset, the generic detector presents a detection rate equal to 32%. Nevertheless, our specialized SMC_B_OAS detector gives a detection rate equal to 20% at the first iteration and then converges with 45% from the fourth iteration. The performance of the SMC_B_KLT detector decreases to 16% at the first iteration and then goes up to 47% at the stopping iteration. We explain the decline at the first iteration by injecting an interest object (failed to be weighted correctly by the spatiotemporal scores because it is temporarily stationary) as a negative sample in the specialized dataset. This means that this sample is detected by the detector but misclassified by the observation strategy, which may disturb the specialization process.
On the other hand, we record a slight fall in most of the final detection rates of the SMC_B_KLT detector, compared to those reached by the SMC_B_OAS detector. We can clearly see an improvement generated by our proposed specialization framework independently from the strategies used on each step.
4.5 Combination of both observation strategies
Detection performance (in percent) of several detectors according to observation strategy used (at FPPI =1)
Specialised detector  Generic  OAS  KLT  Fusion  

Pedestrian  CUHK  it_f  26.6  53.7  46.5  66.5 
it_c  81.3  59.6  76.7  
MIT  it_f  10  24.2  22  26.3  
it_c  49  44.1  45.8  
Car  MIT  it_f  9  15.8  14.7  17.2 
it_c  28.7  23.8  31.5  
Logiroad  it_f  33.5  20.8  16  25.8  
it_c  45.6  47  46.8 
Table 4 demonstrates again that our framework can be applied utilizing any observation strategy and shows that the combination of the two observation strategies generally improves the classifier performance a bit, but in some cases one strategy gives a better detection rate than Fusion.
4.6 Comparison with stateoftheart algorithms
In our proposed application, we assume that the target scene is monitored by a static camera. This assumption helps us to extract our visual context cues; however, if other context information is able to be extracted with a mobile camera, our approach may be used.

Generic [9]: A HOGSVM detector was built and trained on the INRIA dataset, as proposed in [9] by Dalal and Triggs.

Manual labeling: A target detector was trained on a set of target labeled samples. This latter was composed by all the pedestrians of the specialization images (positive samples), from which a negative set of samples was extracted randomly taking into account that there was no overlap with pedestrian bounding boxes.

Nair 2004 [26]: It was a HOGSVM detector that was created in a similar way to the one suggested in [26], but the HOG descriptor was used as a feature vector and the SVM instead of the Winnow classifier. An automatic adaptation approach picked out the target samples to be added in the initial training dataset using the output of the background subtraction method.

Wang 2014 [17]: A specific target scene detector was trained on both INRIA samples and samples extracted and labeled automatically from the target scene. The target and the source samples that had a high confidence score were selected. The scores were calculated using several contextual cues and the selection was done by a method called “confidenceencoded SVM,” which would favor samples with a high score and would integrate the confidence score in the objective function of the classifier.

Mao 2015 [19]: A detector was trained on target samples labeled automatically by using tracklets and by information propagation from labeled tracklets to uncertain ones.
On the MIT traffic dataset (Fig. 14 b), the detection rate improves from 10 to 47%. The MIT specialized SMC_B_OAS detector exceeds the detector trained on the labeled target samples by about 21%. Compared to Nair 2004’s detector, our specialized SMC_B_OAS detector gives a better detection rate than the one proposed by Nair and Clark for an FPPI less than 1. Otherwise, Nair’s (2004) detector somewhat exceeds our SMC_B_OAS detector. The ROC curves display that our specialized detector gives a comparative detection rate to Wang (2014) detector. It is necessary to mention that shadows, on the MIT video, affect the weighting and the selection of correct positive samples.
It is shown that our SMC specialization process converges after only a few iterations on four cases: two for pedestrian detection and two for car detection. In our experiments, we have used different strategies at each step of our filter, which confirms the generalization of our approach.
We notice that the OAS strategy rejects any positive sample having a weight less than the fixed threshold α _{ p }, which reduces the number of positive samples. Otherwise, a static pedestrian, associated to a negative label, can have a high weight because he/she is detected by the detector at the same location in some frames with a null overlap_score and a high accumulation_score. The KLT feature tracker allows us to select more positive samples but may reduce the negative ones. We note also that the coexecution of both strategies and the combination of outputs (as we did in the test “combination of both strategies”) slightly change the performance of the specialized SMC_B_OAS classifier.
Although the proposed observation strategies validate our general framework, the use of other strategies and the combination with other spatiotemporal information can enhance the performance provided by our approach and accelerate the convergence of the specialization process.
5 Conclusions
The suggested TTLSMC filter automatically specializes a generic detector towards a specific scene. It estimates the unknown target distribution by selecting relevant samples from both source and target datasets. These samples are used to learn a specialized classifier that ameliorates much better the detection rate in the target scene.
Indeed, we have validated the suggested method on several challenging datasets, applied it on a pedestrian and car detection, and tested it with different strategies. The experiments have demonstrated that the proposed specialization gives a good performance starting from the first iteration. Besides, the results have illustrated that our method gives a comparable performance to Wang’s approach on the MIT traffic dataset and exceeds the stateoftheart performance on two public datasets.
As a future work, we are going to aggregate our framework with fast feature computation techniques to accelerate the specialization process, and we are going to extend the proposed approach to a multiobject framework. In addition, we will ameliorate the observation strategies with more spatiotemporal information combined together, and we may apply our algorithm to specialize a CNN classifier.
6 Endnotes
^{1} http://www.cs.ubc.ca/research/flann/
^{2} http://svmlight.joachims.org
^{3} Video sequences provided by Logiroad company
Notes
Declarations
Acknowledgements
This work is within the scope of a cotutelle. It is supported by a CIFRE convention with the company Logiroad and it has been sponsored by the French government research program “Investissements d’avenir” through the IMobS3 Laboratory of Excellence (ANR10LABX1601), by the European Union through the program Regional competitiveness and employment 2007–2013 (ERDF – Auvergne region), and by the Auvergne region.
Authors’ contributions
HM carried out the studies about transfer learning approaches and theory of the sequential Monte Carlo filter, proposed the general framework and observation strategies, performed the whole experiments, and drafted the manuscript. TC validated the proposed framework in a context of transfer learning approach, supervised and participated in the design of the work, and helped to draft the manuscript. SG participated in the validation of theory and experiments. YG offered the Logiroad videos and helped to formalize the experiments. NEBA supervised the whole work and helped to draft the manuscript. All authors read and approved the final manuscript
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 S Alvarez, M Sotelo, I Parra, D Llorca, M Gavilán, in Proceedings of the World Congress on Engineering and Computer Science (WCECS), 2. Vehicle and pedestrian detection in esafety applications, (2009), pp. 1–6.Google Scholar
 R Danescu, F Oniga, S Nedevschi, Modeling and tracking the driving environment with a particlebased occupancy grid. ITS. 12(4), 1331–1342 (2011).Google Scholar
 F Han, Y Shan, R Cekander, HS Sawhney, R Kumar, in Performance Metrics for Intelligent Systems (PMIS) 2006 Workshop. A twostage approach to people and vehicle detection with hogbased SVM (Citeseer, 2006), pp. 133–140.Google Scholar
 BF Lin, YM Chan, LC Fu, PY Hsiao, LA Chuang, SS Huang, MF Lo, Integrating appearance and edge features for sedan vehicle detection in the blindspot area. ITS. 13(2), 737–747 (2012).Google Scholar
 S Sivaraman, MM Trivedi, Vehicle detection by independent parts for urban driver assistance. ITS. 14(4), 1597–1608 (2013).Google Scholar
 D Sun, J Watada, in Intelligent Signal Processing (WISP), 9th International Symposium on. Detecting pedestrians and vehicles in traffic scene based on boosted HOG features and SVM (IEEE, 2015), pp. 1–4.Google Scholar
 Q Yuan, A Thangali, V Ablavsky, S Sclaroff, Learning a family of detectors via multiplicative kernels. PAMI. 33(3), 514–530 (2011).View ArticleGoogle Scholar
 X Zhang, N Zheng, in Intelligent Transportation Systems (ITSC), 13th International IEEE Conference on. Vehicle detection under varying poses using conditional random fields (IEEE, 2010), pp. 875–880.Google Scholar
 N Dalal, B Triggs, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1. Histograms of oriented gradients for human detection (IEEE, 2005), pp. 886–893.Google Scholar
 P Dollár, Z Tu, P Perona, S Belongie, in British Machine Vision Conference, BMVC 2009, Proceedings. Integral channel features (British Machine Vision Association, 2009), pp. 1–11.Google Scholar
 P Dollár, S Belongie, P Perona, in British Machine Vision Conference, BMVC 2010, Proceedings, vol. 2, issue 3. The fastest pedestrian detector in the west (British Machine Vision Association, 2010), pp. 1–11.Google Scholar
 P Dollár, R Appel, S Belongie, P Perona, Fast feature pyramids for object detection. PAMI. 36(8), 1532–1545 (2014).View ArticleGoogle Scholar
 PF Felzenszwalb, RB Girshick, D McAllester, in The TwentyThird IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010. Cascade object detection with deformable part models (IEEE, 2010), pp. 2241–2248.Google Scholar
 P Felzenszwalb, D McAllester, D Ramanan, in Computer Vision and Pattern Recognition, CVPR 2008, IEEE Conference on. A discriminatively trained, multiscale, deformable part model (IEEE, 2008), pp. 1–8.Google Scholar
 M Wang, X Wang, in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. Automatic adaptation of a generic pedestrian detector to a specific traffic scene (IEEE, 2011), pp. 3401–3408.Google Scholar
 M Wang, W Li, X Wang, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. Transferring a generic pedestrian detector towards specific scenes (IEEE, 2012), pp. 3274–3281.Google Scholar
 X Wang, M Wang, W Li, Scenespecific pedestrian detection for static video surveillance. PAMI. 36(2), 361–374 (2014).View ArticleGoogle Scholar
 X Li, M Ye, M Fu, P Xu, T Li, Domain adaption of vehicle detector based on convolutional neural networks. IJCAS. 13(4), 1020–1031 (2015).Google Scholar
 Y Mao, Z Yin, in 2015 IEEE Winter Conference on Applications of Computer Vision (WCACV). Training a scenespecific pedestrian detector using tracklets (IEEE, 2015), pp. 170–176.Google Scholar
 H Maâmatou, T Chateau, S Gazzah, Y Goyat, N Essoukri Ben Amara, in Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016)  Volume 4: VISAPP. Transductive transfer learning to specialize a generic classifier towards a specific scene (SciTePress, 2016), pp. 411–422.Google Scholar
 A Doucet, N De Freitas, N Gordon, Sequential Monte Carlo methods in practice (Springer Science + Business Media, New York, 2001).View ArticleMATHGoogle Scholar
 M Isard, A Blake, Condensation—conditional density propagation for visual tracking. IJCV. 29(1), 5–28 (1998).View ArticleGoogle Scholar
 I Smal, W Niessen, E Meijering, in 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. Advanced particle filtering for multiple object tracking in dynamic fluorescence microscopy images (IEEE, 2007), pp. 1048–1051.Google Scholar
 X Mei, H Ling, Robust visual tracking and vehicle classification via sparse representation. PAMI. 33(11), 2259–2272 (2011).View ArticleGoogle Scholar
 C Rosenberg, M Hebert, H Schneiderman, in Application of Computer Vision, 2005. WACV/MOTIONS ’05 Volume 1. Seventh IEEE Workshops on. Semisupervised selftraining of object detection models (IEEE Press, 2005), pp. 29–36.Google Scholar
 V Nair, JJ Clark, in Computer Vision and Pattern Recognition (CVPR), Proceedings of the 2004 IEEE Conference on, 2. An unsupervised, online learning framework for moving object detection (IEEE, 2004), pp. II–317.Google Scholar
 A Levin, P Viola, Y Freund, in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (ICCV). Unsupervised improvement of visual detectors using cotraining (IEEE, 2003), pp. 626–633.Google Scholar
 T Chesnais, N Allezard, Y Dhome, T Chateau, in VISAPP 2012  Proceedings of the International Conference on Computer Vision Theory and Applications, Volume 1. Automatic process to build a contextualized detector (SciTePress, 2012), pp. 513–520.Google Scholar
 SJ Pan, Q Yang, A survey on transfer learning. KDE. 22(10), 1345–1359 (2010).Google Scholar
 T Tommasi, F Orabona, B Caputo, in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. Safety in numbers: learning categories from few examples with multi model knowledge transfer (IEEE, 2010), pp. 3081–3088.Google Scholar
 Y Aytar, A Zisserman, in 2011 International Conference on Computer Vision (ICCV). Tabula rasa: model transfer for object category detection (IEEE, 2011), pp. 2252–2259.Google Scholar
 SJ Pan, IW Tsang, JT Kwok, Q Yang, Domain adaptation via transfer component analysis. NN. 22(2), 199–210 (2011).Google Scholar
 B Quanz, J Huan, M Mishra, Knowledge transfer with lowquality data: a feature extraction issue. KDE. 24(10), 1789–1802 (2012).Google Scholar
 JJ Lim, R Salakhutdinov, A Torralba, in Advances in Neural Information Processing Systems (NIPS). Transfer learning by borrowing examples for multiclass object detection, (2011), pp. 118–126.Google Scholar
 K Tang, V Ramanathan, L FeiFei, D Koller, in Advances in Neural Information Processing Systems (NIPS). Shifting weights: adapting object detectors from image to video, (2012), pp. 638–646.Google Scholar
 X Zeng, W Ouyang, M Wang, X Wang, in European Conference on Computer Vision (ECCV). Deep learning of scenespecific classifier for pedestrian detection (Springer, 2014), pp. 472–487.Google Scholar
 K Ali, D Hasler, F Fleuret, in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. Flowboost–appearance learning from sparsely annotated video (IEEE, 2011), pp. 1433–1440.Google Scholar
 M Oquab, L Bottou, I Laptev, J Sivic, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). Learning and transferring midlevel image representations using convolutional neural networks (IEEE, 2014), pp. 1717–1724.Google Scholar
 C Tomasi, T Kanade, Detection and tracking of point features (School of Computer Science, Carnegie Mellon Univ. Pittsburgh, Pittsburgh, 1991).Google Scholar
 J Shi, C Tomasi, in Computer Vision and Pattern Recognition (CVPR), 1994 IEEE Conference on. Good features to track (IEEE, 1994), pp. 593–600.Google Scholar
 X Wang, X Ma, WEL Grimson, Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. PAMI. 31(3), 539–555 (2009).View ArticleGoogle Scholar
 M Everingham, L Van Gool, CK Williams, J Winn, A Zisserman, The pascal visual object classes (VOC) challenge. IJCV. 88(2), 303–338 (2010).View ArticleGoogle Scholar
 P Carbonetto, Dorko, Ǵ, C Schmid, H Kück, N De Freitas, Learning to recognize objects with little supervision. IJCV. 77(13), 219–237 (2008).View ArticleGoogle Scholar
 S Agarwal, A Awan, D Roth, Learning to detect objects in images via a sparse, partbased representation. PAMI. 26(11), 1475–1490 (2004).View ArticleGoogle Scholar
 B Philip, P Updike, Caltech Computational Vision Caltech Cars 2001 (Rear). http://www.vision.caltech.edu/archive.html.
 S Boltz, E Debreuve, M Barlaud, Highdimensional statistical measure for regionofinterest tracking. IP. 18(6), 1266–1283 (2009).MathSciNetGoogle Scholar