Are RGB-based salient object detection methods unsuitable for light field data?

Considering the significant progress made on RGB-based deep salient object detection (SOD) methods, this paper seeks to bridge the gap between those 2D methods and 4D light field data, instead of implementing specific 4D methods. We observe that the performance of 2D methods changes dramatically with the input refocusing on different depths. This paper attempts to make the 2D methods available for light field SOD by learning to select the best single image from the 4D tensor. Given a 2D method, a deep model is proposed to explicitly compare pairs of SOD results on one light field sample. Moreover, a comparator module is designed to integrate the features from a pair, which provides more discriminative representations to classify. Experiments over 13 latest 2D methods and 2 datasets demonstrate the proposed method can bring about 24.0% and 5.3% average improvement of mean absolute error and F-measure, and outperform state-of-the-art 4D methods by a large margin.

Motivation of the proposed method. a All-focus image and saliency maps by state-of-the-art 2D method SCRN [11]. b Focal stacks and corresponding saliency results by SCRN. c Ground truth (the first and third row) and results by state-of-the-art 4D method DUT [7] (the second and forth row). Green boxes denote the best F-measure of SCRN the 4D light field representation is converted into one all-focus image (as shown in the first row of Fig. 1a) and multiple focal stacks at different depths (as shown in the first row of Fig. 1b).
State-of-the-art 4D methods [7][8][9] can be regarded as fusion-based methods, which attempt to combine all light field slices by different strategies. However, the slices of one light field sample 1 are not always beneficial for saliency detection. As shown in Fig. 1b, the variation of focus depth may make the original salient object blurred and the undesired background noticeable. Fusing such feature could lead to inferior results. As shown in the second and forth row of Fig. 1c, fusion-based method DUT [7] detects much false positive region, which could be attributed to the combination of some undesired focal stacks.
The target of this paper is to provide a new perspective to solve 4D SOD. We find that using different input data for 2D methods changes performance dramatically, as shown in Fig. 1b. Some focal stacks happen to correctly focus on the target object and blur the background, which makes the segmentation far easier and better than the other stacks and all-focus image. If we can automatically select proper slice(s) from the light field sample, can 2D methods outperform state-of-the-art 4D ones? To the best of our knowledge, no prior work has attempted to select the input of 2D methods on the task of light field SOD.
To this end, we reformulate the task of light field SOD by selecting the best input of 2D methods, which makes this task continuously benefit from the development of 2D methods. Concretely, this paper proposes to explicitly compare pairs of light field slices. The input of our model is two images randomly selected from the same light field sample. For evaluating a certain 2D method, we also concatenate the two images with their corresponding saliency maps. However, such paired inputs could bring a new challenge that same two images with different order should correspond to different outputs. A comparator module is proposed to ensure the model has the ability of distinguishing which of (2020) 2020: 49 Page 3 of 17 the two inputs is better. The features of two inputs are treated separately as the exemplar and query of the comparator module. The comparator adopts the attention mechanism to reweigh the query feature based on the correlation between the query and exemplar, which allows the whole model to pay more attention to the correlated areas and provide more discriminative representation. To train such a model, the label is formulated as the relative performance of the two SOD results, which can be measured by common SOD metrics, e.g., F-measure [12] and E-measure [13]. Instead of regressing the relative performance, we adopt the binary classification loss, which intends to make the learning process easier and more effective.
Once trained, the model can operate as a bubble sorting algorithm. The model can iteratively compare two slices from the same light field sample until the best one is predicted. However, such testing strategy is highly unstable and easy to fail in the process of prediction if an error is injected. This paper considers the prediction from a global perspective. The score of one slice should be simultaneously determined by the others in the same sample. Thus, at each time, we compare one slice with the others and adopt the average score as the prediction of the target slice. In this way, the strategy brings more information from a global view and offsets any inaccurate predictions by the model.
To sum up, the main contributions of this work are threefold: • This paper demonstrates that there exists an alternative way to perform light field SOD without designing specialized 4D methods, i.e., optimizing and selecting the best input to existing 2D methods. • A novel convolutional neural network (CNN)-based model is proposed to compare any two slices from the same light field sample, the relative performance of which is regularized by an effective attention-based module and a simple binary classification loss.
• We verify the proposed model with 13 latest 2D deep SOD methods over 2 light field datasets. The experimental results demonstrate that the proposed method effectively boosts the performances and makes most of the involved 2D methods outperform state-of-the-art 4D methods.

Related work
Here, we briefly introduce the related work about salient object detection according to different types of data.

RGB-based salient object detection
SOD methods are initially devoted to single color or grayscale image, which can be also regarded as 2D methods. Early works mainly depend on intrinsic cues, e.g., hand-crafted local and global features [14,15] and heuristic priors [12,16], to extract underlying salient regions. In recent years, the introduction of CNN leads the SOD into a new era of rapid development [17]. At the beginning, the CNN is adopted as a feature extractor to provide multi-context information [18,19]. With the introduction of fully convolutional network (FCN), the CNN-based SOD is formulated as a task of pixel-wise estimation. To provide necessary low-and high-level context, most methods attempt to integrate the features from multi-stages of FCN. For example, Li and Yu [20] proposed a multi-scale FCN branch to capture visual contrast among multi-scale feature maps. Hou et al. [21] exploited stage-wise supervision to explicitly learn on multi-scale feature maps. In [22], the features from deep and shallow layers of FCN are iteratively optimized using residual refinement blocks. Besides individually learning a segmentation model, some works [23][24][25][26][27] attempt to utilize boundary information for auxiliary training. Li et al. [23] proposed a novel method to alternately train a contour detection model and a SOD model. In [26], logical interrelations were adopted to constrain the simultaneous training of SOD and edge detection. A edge guidance network [27] is employed to couple edge features with saliency features at multi-scales. Usually, attention modules in CNN are exploited to mimic the visual attention mechanism, which is consistent with the purpose of SOD. A pixel-wise contextual attention network [28] was introduced to selectively attend to informative context locations for each pixel. Chen et al. [29] attempted to reverse the attention results for expanding object regions progressively. The work in [30] proposed an attentive feedback module to refine the features from encoder and pass them to the corresponding decoder.

RGBD-based salient object detection
Depth maps containing various depth cues such as spatial structure and 3D layout provide necessary complementary information for 2D SOD [31,32]. CNN also shows powerful ability in the field of 3D SOD. Qu et al. [33] fused the depth maps with different low-level saliency cues as the input of CNN. Chen and Li [34] designed a novel complementarityaware fusion module to explicitly integrate cross-modal and cross-level features. In [35], the depth cue was processed by an independent encoder network to provide extra prior. Recently, Zhao et al. [32] argued that fusing the CNN features of depth maps in RGB branch is sub-optimal. Before combining with RGB features, they adopted contrast prior to enhance the depth cues.

Light field salient object detection
This task was firstly defined in [6] where objectness was adopted to integrate the saliency candidates from all-focus images and corresponding focal stacks. Zhang et al. [36] extracted background priors by weighting focusness contrast and presented effectiveness of light field data properties. Li et al. [37] built a saliency dictionary by selecting a group of salient candidates from the focal stacks, where saliency was measured by the reconstruction error. In [8], multiple cues, e.g., color, depth, and multiple viewpoints, were generated from light field features and integrated by a random-search-based weighting strategy. Compared with 2D/3D methods, CNN-based light field SOD is still on its primary stage. One of the main reasons is in insufficient labeled data. Recently, Wang et al. [7] introduced a large dataset and adopted CNN models to solve this task. A recurrent attention network was proposed in [7] to integrate every slice in the focal stacks and lately combined with another stream over all-focus images. Similar to [7], Zhang et al. [9] proposed a complicated framework to fuse the focal stacks, which aimed at emphasizing the ones related to the salient object. Piao et al. [38] introduced an asymmetrical two-stream network to distill focusness knowledge to a student network, which is computation-friendly. Instead of learning the implicit relationship among the focal stacks, we attempt to explicitly select the slices of light field sample, which are compatible with the well-developed CNN-based 2D methods.

Problem definition
We consider solving the task of light field SOD by employing existing CNN-based 2D methods. Formally, a complete light field sample consists of one all-focus image I 0 and multiple focal stacks {I n } N n=1 focusing at different depths. The problem of this paper is how to select the best performance from the saliency maps {M n } N n=0 by a given 2D method w.r.t. {I n } N n=0 . The performance y n N n=0 of {M n } N n=0 can be quantified by standard evaluation metrics, such as mean absolute error, F-measure [12], E-measure [13], and S-measure [39].
One straightforward solution is to treat individual images in {I n } N n=0 as independent input and learn to regress the quantitative performances y n N n=0 . When testing, the slice with the maximum predicted score can be selected from each test sample. However, learning such a model may suffer from ambiguous data, where similar quantitative values may correspond to totally different SOD results. Besides, existing light field SOD datasets are quite small. The largest dataset DUTLF [7] only provides 1000 training samples with 7354 focal stacks, which is far from enough to train a regression model.
To achieve satisfied performance with limited data, this paper proposes to explicitly compare pairs of light field slices, which naturally augments the training data. Thus, the problem is reformulated as predicting the relative performance y i − y j between two different slices I i and I j . Such strategy expands the training data by (N + 1) × (N) times. For example, each sample of DUTLF [7] dataset on average has 8.2 slices, which can provide more than 55,000 training samples in the above definition. Such amount of data is enough to support the training of a powerful model.

Overview
The proposed method consists of three key modules: an encoder f e , a comparator f c , and a predictor f p , as shown in Fig. 2. At each iteration, we randomly select two slices I i and I j from a light field sample {I n } N n=0 as an exemplar and a query, respectively. The goal of the proposed method is to predict whether the exemplar I i achieves better performance than the query I j for a given 2D method. As described in above subsection, the relative performance y i − y j is calculated on the saliency maps M i , M j by the given 2D method. Thus, we concatenate the input slice with its corresponding saliency map, which guides the model to learn better.
The  labels y i − y j and y j − y i . To distinguish such similar pairs, we propose a comparator f c to reweigh the query feature F j based on the correlation between the exemplar and query. The comparator adopts co-attention mechanism [40,41] to couple F i , F j and generates the correlated features F ij . The comparator enables the whole model to attend more to the informative regions and provide more discriminative features to the predictor. The predictor f p is formulated as a binary classifier and outputs the probability that exemplar I i is outperforming query I j . The whole model is trained by a binary cross entropy loss: where σ is the Sigmoid function and During the testing phase, the trained model can operate as a bubble sorting algorithm. If σ f p F ij ≥ 0.5, the exemplar I i is considered to achieve better performance than query I j and would be compared with other queries. Otherwise, the query I j is regarded as a better one. This process of comparison passes forward until all slices in the sample are compared. However, such testing strategy is sub-optimal when the trained model cannot guarantee 100% accuracy. The process of comparison would fail if an error is injected. We consider the prediction from a global perspective. We simultaneously compare one slice with the others and adopt the average score as the final prediction of the target slice. The best input slice is simply determined by the maximum score:

The encoder
The encoder module is a Siamese CNN. It maps a pair of images into the same feature space and provides comparable features. Inspired by the works of few-shot learning [42,43], we adopt a simple but effective network as shown in Fig. 3a. The encoder consists of four convolutional blocks and two 2 × 2 max pooling layers. Each convolutional block is a 3 × 3 convolutional layer with 32 channels, followed by a batch normalization layer and a ReLU nonlinearity activation. The two max pooling layers are deployed after the first two blocks to reduce the spatial size of the features.

The comparator
Although paired input can expand the training data, they would incur a problem of same inputs with different order, e.g., Those inputs should correspond to the completely opposite results. To enable the model to distinguish such difference, we propose a comparator as shown in Fig. 4 to asymmetrically reweigh the features F i ∈ R H×W ×C , F j ∈ R H×W ×C embedded by the encoder. The comparator treats F i , F j separately, F i as exemplar and F j as query. Co-attention mechanism [40,41] is employed to calculate the affinity matrix A between F i and F j : where F i ∈ R C×HW and F j ∈ R C×HW are reshaped into matrix form. W ∈ R C×C denotes the interweight, which can be formulated as a fully connected layer. Each row in F T i and each column in F j both define a C-dimension feature at each spatial position of H × W . Thus, each element in A ∈ R HW ×HW represents an affinity score of features corresponding to pairs of spatial location in F i and F j . In this paper, we only reweigh the query feature to emphasize the order difference of same input images. Concretely, A is normalized column-wise to generate attention across the exemplar F i for each position in the query F j : where η denotes the softmax normalization and A (c) denotes the cth column of A, which reflects the relevance of each feature in F i to the cth in F j . Next, we compute the attention contexts of the query in light of the exemplar: where F j ∈ R C×HW . Finally, the output feature F ij ∈ R 2C×H×W of comparator is formulated as the concatenation of F i j and F i . Through above transformation, the order difference is formulated as reweighing the query based on the correlations between exemplar and query. When exchanging the order of i and j, the comparator would output totally different features.

The predictor
Given a calculated feature, the predictor can be regraded as a simple CNN-based classifier to identify the corresponding label. As shown in Fig. 3b, the predictor module consists of two convolutional blocks, two 2 × 2 max-pooling layers, and three fully connected layers. The convolutional block has the same configuration as the one in the encoder. The first block adopts a 1×1 convolution to reduce the feature dimension. The first fully connected layer is followed by a dropout layer with a probability of 0.5 to avoid overfitting. The dimension of fully connected layers is 1024, 256, and 1, respectively.

Datasets
We evaluate the proposed model on two light field SOD datasets: LFSD [6] and DUTLF [7]. LFSD provides 100 light field samples by the Lytro light field camera, including 60 indoor and 40 outdoor scenes. This is the first dataset for solving the light field SOD problem. This dataset is captured at a resolution of 360 × 360. Then, an all-focus image is composed by using online open-source tools. DUTLF is the latest and largest dataset for improving the development of CNN-based light field SOD. It is a more challenging dataset with a wide range of scenes and multiple salient objects. This dataset is captured by a Lytro Illum camera at a resolution of 600 × 400. The DUTLF dataset consists of 1000 training and 465 testing images.

Implementation details
The proposed model is trained end-to-end from scratch with random initialization. All training and testing images are resized to 182 × 182. Thus, the input data dimension of the first fully connected layer in predictor is equal to 32 × 10 × 10. The proposed model is learned on the training set of DUTLF, where data is augmented by horizontal flipping. The training label y i in Eq. (1) is calculated by the E-measure [13]. The network is trained by standard SGD and converges after 30 epochs with batch size of 32. Each entry of the mini-batch consists of two images randomly selected from the same light field sample. The learning rate, momentum, and weight decay of the SGD optimizer are set to 5e − 3, 5e−4, and 0.9, respectively. The learning rate is set to 5e−4 after 20 epochs. Our proposed model is implemented by the publicly available Pytorch library. All the experiments and analyses are conducted on a Nvidia 1080Ti GPU.

Quantitative comparisons
As shown in Table 1, we present the quantitative scores of M, F, E, and S. Baseline in Table 1 denotes the results of 2D methods over all-focus images, while +Ours denotes the results after the selection of our proposed method. Compared with the baseline, our method brings a large improvement, especially on the dataset DUTLF. Concretely, the average improvement of M, F, E, and S on DUTLF is 29.5%, 5.5%, 5.0%, and 5.3%, Table 1 Quantitative comparisons of mean absolute error (M), F-measure (F ), E-measure (E), and S-measure (S) on two light field datasets, i.e., DUTLF [7] and LFSD [6]. Baseline denotes the results of 2D methods over all-focus images, while +Ours denotes the results after the selection of our proposed method. Red and blue denote the best and second scores, respectively. The up arrow ↑ means larger is better while the down arrow ↓ means smaller is better respectively, while the number on LFSD is 18.5%, 5.1%, 5.0%, and 4.6%, respectively. The consistent improvement on different datasets demonstrates that the proposed method is a general strategy for various 2D methods and data. Compared with 4D methods, we observe that CNN-based methods outperform the baselines of all 2D methods. After the refinement of our method, GCPA, F3Net, SCRN, and CPD achieve superior results than custom 4D methods on the dataset DUTLF and LFSD. In summary, combining our method with latest 2D methods provides a new state-of-the-art for the task of light field SOD.

Qualitative comparisons
In Fig. 5, we visually compare five best performing 2D methods, including GCPA, F3Net, SCRN, CPD, and BASNet, and 4D methods Piao et al. and MoLF. For the 2D methods, the first row of each sample is the result on the all-focus image while the second row is the result by our method. At most cases, the proposed method provides a better option than the all-focus image. Compared with state-of-the-art 4D methods, the proposed method effectively suppresses the false positive detection and improves the true positive performance.

Precision-recall curve
In Fig. 6, we compare the PR curves of different methods. For a clearer presentation, we select the top five 2D methods with the best overall performance. For these 2D methods, Comparisons of PR curves on the datasets of DUTLF [7] and LFSD [6]. For a clearer presentation, we select the top five 2D methods with the best overall performance. For 2D methods, the solid lines denote the results by our method while the dash ones denote the results of baseline the solid lines denote the results by our method while the dash ones denote the results of baseline. With our method, the 2D methods always exceed the baselines.

Ablation study
In this subsection, we first analyze the effectiveness of the key components and how it benefits the 2D methods. Next, we investigate the generalization of our method.

Key components
In Table 2, we summarize the contributions of key components, including training data, the loss function, comparator, testing strategy, and the architecture of encoder. Compared with the Baseline, using every possible slice in the column of Avg. always leads to inferior performance, which proves that some slices in one light field sample could bring negative effect. Thus, carefully selecting the proper slice input is necessary. Comparing the rows of Img. Input and Sal. Input, we find that saliency maps are critical, which provide necessary information about the performance of 2D method and guide the model learning.
To demonstrate that classification is more suitable than regression, we replace the binary cross entropy loss in Eq. (1) with L1 loss: As shown in the Reg. Loss rows of Table 2, learning with the classification loss is always beneficial to the performance. Next, we use plain concatenated features F i , F j to take the place of the comparator. The results in the w/o Com. rows of Table 2 indicate that direct concatenation is inferior, especially in the dataset DUTLF. Finally, we evaluate the effect of testing strategy. We test the trained model as a bubble sorting algorithm. As shown in the rows of Bub. Test, proposed Eq. 2 considers one slice from a global perspective and provides more stable and superior prediction. Finally, we analyze the effect of encoder architecture. The column of R-18 and SER-18 denote replacing the proposed encoder with well-designed architecture ResNet-18 [50] and SE-ResNet-18 [51], respectively. However, using different architectures cannot bring further improvement. We attribute the reason to the representation embedded by the encoder to be powerful enough. It seems that the limitation of performance is mainly determined by the comparator and predictor.

The ability of generalization
All above results of a certain 2D method are based on the model trained on its own saliency maps. Actually, the individual model can be deployed on any 2D methods without retraining. To verify the generalization ability of our method, we conduct some analyses in Fig. 7. At each confusion matrix, the method on each row denoted the training data resources. Then, the trained model is evaluated on different methods, which are denoted on each column. Each entry in the confusion matrix denotes the E-measure difference with the method trained on its own saliency maps. We expand the differences by 1000 times for better visualization. We notice that the model trained by most methods generalizes well with minor performance descending, except DGRL, C2SNet, and RAS. From the columns of the matrix, most generalized 2D methods obtain unsatisfied performance on EGNet, PoolNet, DGRL, and RAS. We attribute the main reason to the distribution difference between the saliency maps of these methods that cannot be generalized well and other methods.

Relationship with the 4D methods
Existing CNN-based 4D methods can be summarized as an implicit selection of focal stacks. Attention mechanism [7,9] is adopted to emphasize useful features in the focal slices, where the salient objects happen to be in focus. Such features will be fused with the (2020) 2020:49 Page 13 of 17 Fig. 7 Generalization analysis of the proposed method. We evaluate the model trained by the saliency results of one 2D method (denoted on each row) on other 2D ones (denoted on each column). Each entry in the confusion matrix denotes the E-measure difference with the method trained on its own saliency maps. We expand the differences by 1000 times for better visualization segmentation branch on the all-focus images. Unlike other fusion-based tasks, e.g., video and action recognition, where sequential information is important, light field SOD only needs concentrating on the images where the salient object is in focus. Fusing features from focal stacks could confuse the segmentation. Observing the qualitative results of MoLF in Fig. 5, we find that the boundary of salient object is not as sharp as the one of 2D methods. The main reason may attribute to the features where the salient object is blurred. On the contrary, our method explicitly selects the proper inputs and degenerates light field SOD to a common 2D task. It maintains the strength of existing segmentation networks and thereby provides superior results.

The upper bound analysis
The capacity of the proposed method is limited by the power of 2D methods. In Fig. 8

Typical failure cases
In Fig. 9, we present some typical failure cases of the proposed method. Green boxes denote the slice selected by our method while the red ones denote the best slice by F-measure. The first example demonstrates one case when the 2D methods fail to detect the salient object. Although the proposed method can properly select the best slice, the saliency results are still far away from the ground truth. The second example shows another challenging situation when the saliency regions have smoothly varying depths, as shown in the last three columns of the second example. These saliency results are too similar to be correctly classified by the proposed method. However, the quantitative scores

Processing speed
When testing, we predict the score of each input slice with Eq. (2), which avoids the calculation of bubble sorting. At each time, we build a mini-batch by copying the target slice as the number of the rest slices in the same light field sample. Feeding forward such a mini-batch takes about 0.0137 s. The total running time depends on the number of slices and the processing speed of 2D method. Take 2D method CPD and dataset DUTLF for instance. Each sample of DUTLF dataset on average has 8.2 slices. The processing speed of CPD is about 62 fps. Therefore, the total time to deal with one light field sample is  [38] and MoLF [9] is 0.07 and 0.11 s.

Conclusions
In this paper, we provide an alternative solution for the task of light field SOD. Without designing specialized segmentation network for light field data, a model is proposed to optimize the input of existing 2D methods, which have made significant progress. The proposed model learns to predict the relative performance of any two slices from one light field sample. An attention-based comparator is proposed to emphasize the distinctiveness of same two slices but in different order of comparison. Experiments on 13 latest 2D methods demonstrate that the proposed strategy dramatically improves the performance of 2D methods on 2 light field datasets. Moreover, extra analyses demonstrate that the model trained on one method results has an impressive generalization ability, which means the proposed method can continuously benefit from the improvement of 2D methods.