Automated identification of animal species in camera trap images
EURASIP Journal on Image and Video Processingvolume 2013, Article number: 52 (2013)
Image sensors are increasingly being used in biodiversity monitoring, with each study generating many thousands or millions of pictures. Efficiently identifying the species captured by each image is a critical challenge for the advancement of this field. Here, we present an automated species identification method for wildlife pictures captured by remote camera traps. Our process starts with images that are cropped out of the background. We then use improved sparse coding spatial pyramid matching (ScSPM), which extracts dense SIFT descriptor and cell-structured LBP (cLBP) as the local features, that generates global feature via weighted sparse coding and max pooling using multi-scale pyramid kernel, and classifies the images by a linear support vector machine algorithm. Weighted sparse coding is used to enforce both sparsity and locality of encoding in feature space. We tested the method on a dataset with over 7,000 camera trap images of 18 species from two different field cites, and achieved an average classification accuracy of 82%. Our analysis demonstrates that the combination of SIFT and cLBP can serve as a useful technique for animal species recognition in real, complex scenarios.
Monitoring biodiversity, especially the effects of climate and land-use change on wild populations, is a critical challenge for our society . Sensor networks are a promising approach for collecting the spatio-temporal data at scales needed to address this challenge , especially visual sensors that record images of animals that move across their field of view (i.e. camera traps [3, 4]). However, processing the large volumes of images that such studies generate to identify the species of animals recorded remains a challenge.
At present, all camera-based studies of wildlife use a manual approach where researchers examine each photograph to identify the species in the frame. For studies collecting many tens or hundreds of thousands of photographs, this is a daunting task .
Computer-assisted species recognition on camera-trap images could make this work flow more efficient, and reduce, if not remove, the amount of manual work involved in the process. However, in comparison with the typical video from surveillance of building and street views, camera trap of animals amidst vegetation are more difficult to incorporate into image analysis routines because of low frame rates, background clutter, poor illumination, serious occlusion, and complex pose of the animals.
Inspired by recent object recognition works [6, 7] in the computer vision community, we improved sparse coding spatial pyramid matching (ScSPM) method for species recognition on images collected by camera traps. During the local feature extraction, we combined dense scale-invariant feature transform (SIFT)  of features with cell structured local binary patterns (cLBP)  to represent the object of interest. We apply weighted sparse coding for dictionary learning, and thus enforce both sparsity and locality, since locality may be more important than sparsity, as suggested by Wang et al. . Then we used linear SVM to classify image of species.
We tested our method with images collected by camera traps that were deployed in two different environments, tropical rainforest and temperate forest, that represent a wide variety of backgrounds and conditions. From this collection, we selected sequences and species to keep the data balanced. Then, we manually cropped animals from all the frames to generate a dataset with 7,196 images over 18 different vertebrate species.
2 Related work
Most related works are camera-based studies of wildlife that use image analysis to identify individual animals of select species with unique coat patterns (e.g., spots or stripes). Bolger et al.  applied software to help identify individual animals based on coat patterns for subsequent photographic mark-recapture analysis. The data they used was image based, which is a cost-effective, non-invasive way to study population. The method they used was the SIFT key points extraction and matching. Thus, they only focused on individual animal identification for these strongly marked texture species.
Identifying species from remote camera images remains a major challenge that has not been addressed. In the community of computer vision, there exist a lot of methods to recognize general object. One of the most successful ones is Yang’s work , in which ScSPM is applied. Spatial pyramid matching (SPM) with max pooling  can not only model the spatial layout of local image features, but also achieve translation invariance of animal body. As being easy and simple to construct, the SPM kernel turns out to be highly effective in practice . Sparse coding has been successfully applied to model local features, and to construct overcomplete dictionary that can sparsely represent the local features. Sparse coding can yield better results than vector quantization and hard assignment .
3 Materials and methods
Our pattern extraction and classification program is based on the ScSPM , as shown in Figure 1. The algorithm first extracts local feature descriptor densely. We combine two kinds of local descriptors: SIFT and cLBP. In order to sparsely represent local features, the dictionary is learned via weighted sparse coding, for each kind of descriptor feature. Similar local features can generate similar codes after sparse coding on the dictionary, which is essential for recognition because it retains discriminative information while suppressing the noise. Finally, max pooling using SPM is used to construct the global image feature that converts an image or a bounding box to a single vector. We then apply linear multi-class SVMs to classify the global feature to one category of species, assuming SVMs are trained beforehand using training data.
3.1 Local feature extraction
The camera-trap images contain rich noise and clutter. This requires us to develop a both discriminant and invariant local feature to describe local image patches. Dense SIFT feature, also known as dense histogram of oriented gradients, is successfully used in some recognition work. SIFT descriptor is invariant to moderate scaling and shifting change of edges and linear illuminance variation in image patch; however, it fails when nonlinear illuminance change occurs. cLBP, in contrast, is the perfect local texture descriptor that is invariant to moderate nonlinear illuminance variation. In the area of computer vision, for human detection , HOG and cLBP features are concatenated to obtain the final feature. But the simple concatenation would potentially cause the following problem: the feature space becomes more complex and more difficult to classify. We thus used the procedure of Zhang et al.  to extract HOG and cLBP, and concatenate responses only after coding them separately.
The SIFT descriptor is similar to the HOG. Both are histograms of oriented gradients. The SIFT descriptor is illustrated in Figure 2. After calculating the gradient map for each image, SIFT creates oriented gradient histograms for 4 × 4 grid regions, instead of 2 × 2 as in HOG. The full 128 dimensional SIFT descriptor is created by concatenating the 16 histograms in 16 × 16 image patch.
cLBP is a very good texture descriptor that extracts histogram of the LBP patterns from local cells, as shown in Figure 2. In order to filter out noises, LBP is modified into a uniform LBP pattern . We use the notation to denote LBP feature that takes n sample point with radius r, and the number of 0-to-1 transitions is no more than u. The pattern that satisfies this constraint is called uniform pattern . For example, the pattern 0010010 is a nonuniform pattern for LBP2, and is a uniform pattern for LBP4 because LBP4 allows four 0-to-1 transitions. In our approach, we set u = 2, n = 8, and r = 1. In this setting, the dimension of LBP is 59.
The rationale for combination of SIFT and cLBP is that at pixel level, the oriented gradient has been assigned to 8 bins in SIFT, while in uniform the number of bins is 59. At cell level, 16 cells are used in SIFT while only 1 cell is used in cLBP. So SIFT is very accurate at the cell level but invariant at the pixel level, while the opposite holds for cLBP. The combination of the two solves the trade-off between discrimination and invariance, at both the pixel and the cell level.
3.2 Dictionary learning and weighted sparse coding
The goal of dictionary learning is to capture high-level information, that is, to select some items to describe the distribution of the input space. We get a local image feature set X by randomly sampling in feature space. Then X approximates the distribution of the input space. But X contains a huge number of signals, which make it impossible to use X directly in coding. Dictionary learning aims to generate a compact dictionary that can sparsely represent the incoming signal with minimum error.
Let X be in a D-dimensional features space, i.e. X= . The dictionary is V= [ v 1,⋯,v K ] with K atoms. The traditional dictionary leaning and sparse coding method formulate the problem as follows:
where is the matrix of sparse codes.
Inspired by the work of Wang et al.  in which encoding of features is based on the locality in the feature space, we adapt the original sparse coding to the weighted sparse coding as follows to enforce both sparsity and locality:
where W is a diagonal weighting matrix whose elements are computed as
Many algorithms have been proposed to solve this dictionary learning problem, e.g., . V is well known as a codebook and can be trained and fixed in the testing phase. Recently, there has been a lot of work on supervised dictionary learning (e.g., [17, 18]) to adapt the dictionary for classification purpose, but it is often computationally expensive and cannot handle large multi-class problem well. Thus, our work employs unsupervised dictionary learning using weighted sparse coding, as in Equation 2.
3.3 Linear SPM and multi-scale max pooling
Spatial pyramid matching is an extension of Bag of Words (BoW) method, and it models the spatial layout of local image features at multiple scales. Figure 1 illustrates the whole structure of ScSPM. Let U be the matrix of sparse codes of applying Equation 2 to a descriptor set X, assuming the codebook V is pre-computed. The pooled features from various locations and scales are then concatenated to form a spatial pyramid representation of the image. In each pyramid, a max pooling function is applied on the absolute sparse codes:
where z j is the j th element of z, u ji is the matrix element at j th row and i th column of U. Max pooling is beneficial for translation invariance because the maximum response will be filtered out if it is a small translation.
Let image I i be represented by z i , a simple linear SPM kernel is defined by 
With linear SPM kernel, we can directly use linear SVM, for which the training cost is O(n) in computation, and the testing cost for each image depends on the dimension of feature.
3.4 Multi-class linear SVM
Let be the training data. We stick to the implementation in Yang et al. , and use one-against-all strategy to train L binary linear SVMs that each solve the following unconstrained convex optimization problem:
where if y i = c, otherwise , and is the hinge loss function. The standard hinge loss function is not differentiable everywhere, but here we can use quadratic hinge loss as below instead to make use of gradient-based optimization methods, e.g., LBFGS .
4 Experimental results
4.1 Data set
We used images of wildlife captured with motion-sensitive camera traps (Reconyx RC55, PC800 and HC500, Holmen, WI, USA), which generate sequences of 3.1 Megapixel JPEG images at about 1 frame/s upon triggering by an infrared motion sensor. Color images are captured during the day and gray-scale images are captured at night using and an infrared flash, which is invisible to most animals. We used images from tropical rain forest (Barro Colorado Island, Panama) and temperate forest and heathland (Hoge Veluwe National Park, the Netherlands). Expert zoologists identified the animals in the images. We did not edit the data set for ease of identification, so it includes many of the typical challenges faced by camera trapping data, including cases where the animal is too small or is occluded by vegetation.
In total, we got 10,598 sequences over 57 species. The numbers of sequences of each species were unbalanced. As shown in Table 1, 40 out of 57 species have less than 50 sequences. we exclude these species and remain top 18 species. In order to build a balanced test data set, we chose up to 100 sequences from each species. Where the available number of sequences for a species was less than 100, we choose all of the sequences for that species. After such operation, 1,739 sequences for 18 species remained. Table 1 lists the number of remained sequences and frames for each species.
The camera trapped sequences are of low frame rate (1 frame/s) and short length (about 10 frames/sequence). Two typical image sequences are shown in Figure 3. The first two rows show consecutive frames of the agouti, in which the leaves dangled in the wind. The second two rows are continual frames of the collared peccary. If the peccary suddenly moved close to the camera, the illumination changes a lot because it cut out much of the light. The common motion detection method cannot handle this case very well. In order to get clear data, we manually cropped all the animals from the sequences. Since most of them are empty frames, in which the cameras are activated by motion from background, only 7,196 animal images are kept. Table 2 lists the details of the proposed dataset. During the progress of cropping, we kept the original animal size, color, and aspect ratio. Figure 4 shows the cropped samples for seven species.
4.2 Implementation and result
We developed a species recognition algorithm based on ScSPM, implemented as follows. The images were all converted into gray scale and both the SIFT descriptor and the cLBP descriptor were then extracted from 16 × 16 pixel patches. All the patches of each image were densely sampled on a grid with stepsize of 4 pixels. Both SIFT and cLBP were normalized to be unit norm with dimensions 128 and 59, respectively. For the dictionary learning process, we extracted SIFT and cLBP from 20,000 patches that are randomly sampled on training set. Dictionaries were trained for SIFT and cLBP separately, with the same dictionary size K = 1,024.
Following the standard benchmark procedures, we repeated the experimental process by 10 runs to obtain reliable results. In each run, we randomly selected 70% of the images of each species for training and kept the remaining 30% for testing. We report our final results as a confusion matrix.
We first test our approach on all 18 species, and the classification result is shown in Table 3. In real world scenarios, it is not necessary to distinguish species across the two-place datasets. Thus, we also test our method on the two datasets (Panama and Netherlands) separately. The classification results are shown in Tables 4 and 5.
Since the SIFT and cLBP can describe the texture at different level, we did the experiment using SIFT, cLBP, and the combination of SIFT and cLBP, respectively, to show how the combination improved the performance. The SIFT feature is good at extracting the silhouette of an animal, while cLBP is powerful in describing the skin texture of animals. Thus, it is reasonable to combine SIFT and cLBP. As we can see in Table 6, SIFT feature is more discriminative than cLBP, and the performance is boosted much by combining them.
In Table 3, we can see that the overall accuracy is about 82%. Wood mouse is correctly recognized 100%, which is surprising, considering that none biometric features are used. For over one third of the 18 species, this experiment obtained classification accuracy over 90%, such as paca, ocelot, red deer, and wild boar. As expected, red brocket deer is easily misclassified as white-tailed deer because they are of the same ontology and have the similar appearance. In order to better classify the two species like these, biometric features, such as spots on the fur and shape of antlers, play a key role in species recognition. However, automatically identifying biometric features is a challenging task, to our best knowledge.
We have shown that object recognition techniques from computer vision science can be effectively used to recognize and identify wild mammals on sequences of photographs taken by camera traps in nature, which are notorious for high levels of noise and clutter. Although some species are of the same ontology, the proposed method can detect imperceptible differences between them. The combination of SIFT and cLBP as descriptors of local images features significantly improved the recognition performance, which is abundant in texture description at multiple scales.
In the future work, some biometric features that are important for species analysis will be included in the local features, such as color, spots, and size of the body. Since the original sequences captured with motion-sensitive camera traps have motion information, we will develop an automatic animal segmentation algorithm in the future.
Committee on Grand Challenges in Environmental Sciences NRCUC: Grand Challenges in Environmental Sciences. National Academies Press, Washingthon, DC; 2001.
Porter J, Arzberger P, Braun H, Bryant P, Gage S, Hansen T, Hanson P, Lin C, Lin F, Kratz T, Williams T, Shapiro S, King H, Michener W: Wireless sensor networks for ecology. BioScience 2005, 55(7):561-572. 10.1641/0006-3568(2005)055[0561:WSNFE]2.0.CO;2
Kays R, Tilak S, Kranstauber B, Jansen P, Carbone C, Rowcliffe M, Fountain T, Eggert J, He Z: Monitoring wild animal communities with arrays of motion sensitive camera traps. Int J Res Rev Wireless Sensor Netw 2011, 1: 19-29.
Aguzzi J, Costa C, Fujiwara Y, Iwase R, Menesatti P, Ramirez-E Llorda: A novel morphometry-based protocol of automated video-image analysis for species recognition and activity rhythms monitoring in deep-sea fauna. Sensors 2009, 9(11):8438-8455. 10.3390/s91108438
Fegraus E, Lin K, Ahumada J, Baru C, Chandra S, Youn C: Data acquisition and management software for camera trap data: a case study from the TEAM Network. Ecol. Inform 2011, 6(6):345-353. 10.1016/j.ecoinf.2011.06.003
Yang J, Yu K, Gong Y, Huang T: Linear spatial pyramid matching using sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition. Miami; 20-25 June 2009:1794-1801.
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y: Locality-constrained linear coding for image classification. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). San Francisco, CA; 13-18 June 2010:3360-3367.
Lowe D: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis 2004, 60(2):91-110.
Ahonen T, Hadid A, Pietikainen M: Face description with local binary patterns: application to face recognition. Pattern Anal. Mach. Intell, IEEE Trans 2006, 28(12):2037-2041.
Bolger B, Morrison DT, Vance TA, Lee D, Farid H: A computer-assisted system for photographic mark–recapture analysis. Methods Ecol. Evol 2012, 3(5):813-822. 10.1111/j.2041-210X.2012.00212.x
Serre T, Wolf L, Poggio T: Object recognition with features inspired by visual cortex. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA; 20-26 June 2005:994-1000.
Lazebnik S, Schmid C, Ponce J: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York; 17-22 June 2006:2169-2178.
Wang X, Han T, Yan S: An HOG-LBP human detector with partial occlusion handling. In 2009 IEEE 12th International Conference on Computer Vision. Kyoto, Japan; 27 September - 4 October, 2009:32-39.
Zhang J, Huang K, Yu Y, Tan T: Boosted local structured HOG-LBP for object localization. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),. Colorado Springs, Colorado; 20-25 June 2011:1393-1400.
Ojala T, Pietikäinen M, Harwood D: A comparative study of texture measures with classification based on featured distributions. Pattern Recognit 1996, 29: 51-59. 10.1016/0031-3203(95)00067-4
Lee H, Battle A, Raina R, Ng A: Efficient sparse coding algorithms. Adv. Neural Inf. Process. Syst 2007, 19: 801.
Mairal J, Bach F, Ponce J: Task-driven dictionary learning. Pattern Anal. Mach. Intell, IEEE Trans 2012, 34(4):791-804.
Yang J, Wang J, Huang T: Learning the sparse representation for classification. In 2011 IEEE International Conference on Multimedia and Expo (ICME). Barcelona; 11-15 July 2011:1-6.
This work was supported in part by the National Science Foundation Grant DBI 10-62351. Field data were collected with support from the National Science Foundation (NSF-DEB 0717071 to R.W.K.) and the Netherlands Organization for Scientific Research (863-07-008 to P.A.J.). XY and TW would like to acknowledge support by the National Natural Science Foundation of China Grant 61073094.
The authors declare that they have no competing interests.