 Research
 Open Access
 Published:
A feature selection framework for video semantic recognition via integrated crossmedia analysis and embedded learning
EURASIP Journal on Image and Video Processing volume 2019, Article number: 44 (2019)
Abstract
Video data are usually represented by high dimensional features. The performance of video semantic recognition, however, may be deteriorated due to the irrelevant and redundant components included into the high dimensional representations. To improve the performance of video semantic recognition, we propose a new feature selection framework in this paper and validate it through applications of video semantic recognition. Two issues are considered in our framework. First, while those labeled videos are precious, their relevant labeled images are abundant and available in the WEB. Therefore, a supervised transfer learning is proposed to achieve the crossmedia analysis, in which the discriminative features are selected by evaluating feature’s correlation with the classes of videos and relevant images. Second, the labeled videos are normally rare in realworld applications. In our framework, therefore, an unsupervised subspace learning is added to retain the most valuable information and eliminate the feature redundancies by leveraging both labeled and unlabeled videos. The crossmedia analysis and embedded learning are simultaneously learned in a joint framework, which enables our algorithm to utilize the common knowledge of crossmedia analysis and embedded learning as supplementary information to facilitate decision making. An efficient iterative algorithm is proposed to optimize the proposed learningbased feature selection, in which convergence is guaranteed. Experiments on different databases have demonstrated the effectiveness of the proposed algorithm.
Introduction
Video semantics recognition [1] is a fundamental research problem in computer vision [2, 3] and multimedia analysis [4, 5]. However, video data are always represented by high dimensional feature vectors [6], which often incur higher computational costs. The irrelevant and redundant features may also deteriorate the performance of video semantic recognition. In addition, feature selection [7] is able to reduce redundancy and noise information in the original feature representation, thus facilitating subsequent analysis tasks such as video semantic recognition.
Depending on whether the class label information are available, feature selection algorithms can be roughly divided into two groups [8], i.e., supervised feature selection [9] and unsupervised feature selection [10]. Supervised feature selection is able to select discriminative features by evaluating features’ correlation with the classes. Thus, supervised feature selection usually yields better and more reliable performances by using the label information. However, most of the supervised feature selection methods require sufficient labeled training data in order to learn reliable model [11]. Since it is difficult to collect highquality labeled training data in realworld applications [12], it is normally not practical to provide sufficient labeled videos for existing supervised feature selection methods to achieve satisfactory performances of feature selection. Recently, some crossmedia analysis methods [13, 14] have been proposed to address the problem of insufficient number of labeled videos by transferring knowledge from other relevant types of media (e.g., images). Therefore, this type of crossmedia analysis method can be considered as a kind of transfer learning. Moreover, some relevant labeled images are available and easier to collect, which can be leveraged to enhance the feature selection for video semantic recognition. To this end, we propose a supervised transfer learning in our framework, in which the knowledge from images are adapted to improve feature selection for video semantic recognition. Specifically, we use the available images with relevant semantics as our auxiliary resource and feature selection is performed on the target videos. To transfer the information from images to videos, we use the same type of still features to represent both videos and images.
Unsupervised feature selection exploits data variance and separability to evaluate feature relevance without labels. A frequently used criterion is to select the features which best preserve the data distribution or local structure derived from the whole feature set [15]. Recently, some unsupervised feature selection methods based on embedded learning have been proposed. The main advantage of utilizing embedded learning is that it can use the manifold structure of both labeled and unlabeled data to enhance the performance of feature selection. Further, most transfer learning algorithms require that the features extracted from the source domain should have the same type as that in the target domain. In practice, the videos and images in transfer learning usually need to be represented by still features such as SIFT [16]. For example, many videos are key framebased so they cannot be represented by motion features such as STIP [17], which results in losing the underlying temporal information. To completely represent the video semantics and to effectively use the unlabeled videos, we add an unsupervised embedded learning into our proposed framework, based on augmented feature representations. To take full advantages of crossmedia analysis and embedded learning, we assemble them into a joint optimization framework by introducing the joint ℓ_{2,1}norm regularization [18]. In this way, the information from crossmedia analysis and embedded learning can be transferred from one domain to another. Moreover, the problem of overfitting can be alleviated, and thus, the performance of feature selection can be improved. We call the proposed feature selection framework as jointing crossmedia analysis and embedded learning (JCAEL). We summarize the main contributions of this paper as follows:
(1) As JCAEL can transfer the learned knowledge from relevant images to videos for improving the video feature selection, it can directly use some labeled images to address the problem of an insufficient label information. Such a merit ensures that our method is able to uncover the common discriminative features in videos and images of the same class, which provides us with better interpretability of the features.
(2) Our method contains unsupervised embedded learning, which utilizes both labeled and unlabeled videos for feature selection. This advantage guarantees that JCAEL can exploit the variance and separability of all training videos to find the common irrelevant or noisy features and thus generating optimal feature subsets. Meanwhile, videos can be represented by augmented features during the process of embedded learning, and the augmented features present more complete representation of videos, providing us the space to select the precise features of video semantics.
(3) To take the advances of crossmedia analysis and embedded learning, we propose to ensemble them by adding a joint ℓ_{2,1}norm regularization. In this way, our algorithm is able to evaluate the informativeness of features jointly, where the correlation of features is employed. In addition, our proposed also enables crossmedia analysis and embedded learning to share the common components/knowledge of features, so as to uncover common irrelevant features, which results in improving the performance of feature selection for video semantic recognition.
The rest of this paper is organized as follows. The proposed method and its corresponding optimization approach are proposed in Section 2. In Section 3, the experimental results are reported. The conclusion is shown in Section 4.
Proposed method
In this section, we present the framework of JCAEL. To construct this framework efficiently, we develop an iterative algorithm and prove its convergence.
Notations
To adapt knowledge from images to videos, let us denote the representations of the labeled training videos as a still feature: \(X_{v}=\left [x_{v}^{1},x_{v}^{2},\ldots,x_{v}^{n_{l}}\right ]\in R^{d_{s}\times n_{l}}\) where d_{s} is the still feature dimension and n_{l} is the number of the labeled training videos. Let \(Y_{v}=\left [y_{v}^{1},y_{v}^{2},\ldots,y_{v}^{n_{l}}\right ]\in \{0,1\}^{c_{v} \times n_{l}}\) be the labels for the labeled training videos, where c_{v} indicates that there are c_{v} different classes in videos. Similarly, we denote the representations of the images by a still feature: \(X_{i}=\left [x_{i}^{1},x_{i}^{2},\ldots,x_{i}^{n_{i}}\right ]\in R^{d_{s}\times n_{i}}\), where n_{i} is the number of the images. \(Y_{i}=\left [y_{i}^{1},y_{i}^{2},\ldots,y_{i}^{n_{i}}\right ]\in \{0,1\}^{c_{i} \times n_{i}}\) is the label matrix of images, where c_{i} indicates that there are c_{i} different classes in images, \(y_{v}^{kj}\) and \(y_{i}^{kj}\) denote the jth datum of \(y_{v}^{k}\) and \(y_{i}^{k}\), \(y_{v}^{kj}=1\) and \(y_{i}^{kj}=1\) if \(x_{v}^{k}\) and \(x_{i}^{k}\) belong to the jth class; otherwise, we have \(y_{v}^{kj}=0\) and \(y_{i}^{kj}=0\). To fully utilize labeled and unlabeled videos, we use an augmented feature to denote n videos, which can be represented as \(Z_{v}=\left [z_{v}^{1},z_{v}^{2},\ldots,z_{v}^{n}\right ]\in R^{d_{a}\times n}\), where d_{a} is the dimension of the augmented feature. From the basic idea of feature learning, we represent the original data \(z_{v}^{j}\) by its low dimensional embedding, i.e., \(\phantom {\dot {i}\!}p_{j}\in R^{d_{e}}\), where d_{e} is the dimensionality of the embedding. As a result, the embedding of Z_{v} can be denoted as \(P_{v}=\left [p_{v}^{1},p_{v}^{2},\ldots,p_{v}^{n}\right ]\in R^{d_{e}\times n}\).
The proposed framework of JCAEL
We first demonstrate how to exploit the knowledge from labeled videos. To achieve this objective, learning algorithms usually use labeled training videos \({\left (x_{v}^{j},y_{v}^{j}\right)}^{n_{l}}_{j=1}\) to learn a prediction function f that can correlate X_{v} with Y_{v}. A common approach to establish such a mechanism is to minimize the following regularized empirical error:
where loss(.) is the loss function and αΩ(f) is the regularization with α as its parameter.
It has been shown in [19] that the least square loss function gains comparable or better performance to other loss functions, such as the hinge loss, and consequently, we use the least square loss in our algorithm. The ℓ_{2,1}norm regularized feature selection algorithms [20, 21] utilize ℓ_{2,1}norm to control classifiers’ capacity and also ensure there are sparse in rows, making ℓ_{2,1}norm particularly suitable for feature selection. Therefore, we use the ℓ_{2,1}norm to define the regularization, and thus, Eq. (1) can be written as
where \(\phantom {\dot {i}\!}W_{v}\in R^{d_{s}\times c_{v}}\) is the transformation matrix of the labeled videos with respect to the still feature, and ∥.∥_{F} denotes the Frobenius norm of a matrix. α is the regularization parameter. As indicated in [1, 22], the ℓ_{2,1}norm of W_{v} is defined as \(\left \ W_{v} \right \_{2,1} = \sum \limits _{j = 1}^{d_{s}} \sqrt {\sum \limits _{k = 1}^{c_{v}} { \left ({W_{v}^{jk} } \right)}^{2}} \), where \(W_{v}^{jk}\) is the jth row and the kth column element of W_{v}. When minimizing the ℓ_{2,1}norm of W_{v}, some rows of W_{v} shrink to zero, making W_{v} particularly suitable for feature selection.
Now, we show how to exploit the knowledge from labeled images. The fundamental step is to obtain the correlation between the images X_{i} and labels Y_{i}. Similar to Eq. (2), we achieve that by the following objective function:
where \(\phantom {\dot {i}\!}W_{i}\in R^{d_{s}\times c_{i}}\) is the transformation matrix of labeled images with respect to the still feature. When the images and videos share relevant knowledge, we can learn some shared components. Taking the semantics “playing violin” as an example, we may learn shared components about the object “violin,” human action “playing,” and human appearance from both videos and images. To adapt the shared information of feature selection from images to videos, we propose ∥W∥_{2,1} to uncover the common information shared by W_{v} and W_{i}, where W= [W_{v},W_{i}]. By minimizing ∥W∥_{2,1}, we can get sparse rows of W and uncover the common irrelevant or noisy components in both W_{v} and W_{i}. To this end, we propose the following objective function:
where λ is the regularization parameter.
To fully exploit both the labeled and the unlabeled videos with respect to the augmented feature representation, we show how to add the unsupervised subspace learning into Eq. (4). As it has been shown in [23] that the graph Laplacian performs well in unsupervised feature learning, we use graph Laplacian to characterize the manifold structure among the labeled and unlabeled videos. We first construct the similarity matrix S, where for the ith point \(z_{v}^{i}\), its weight can be determined as: \(S_{ij}=exp\left (\frac {\left \ {z_{v}^{i}z_{v}^{j}} \right \_{2}^{2}}{\delta }\right)\) if and only if \(z_{v}^{j}\in \mathcal {N}\left (z_{v}^{i}\right)\) or \(z_{v}^{i}\in \mathcal {N}\left (z_{v}^{j}\right)\), where δ is the width parameter and \(\mathcal {N}\left (z_{v}^{t}\right)\) is the knearest neighborhood set of \(z_{v}^{t}\). Otherwise, S_{ij}=0. As a result, the unsupervised subspace learning can be described as:
where \(p_{v}^{i}\in R^{d_{e}}\) is the low dimensional embedding of the original data \(z_{v}^{i}\), d_{e} is the dimensionality of the embedding, \(I_{d_{e} \times d_{e}}\) is the identity matrix, \(\phantom {\dot {i}\!}W_{z}\in R^{d_{a}\times d_{e}}\) is the transformation matrix of videos with respect to the augmented feature, L=(I_{n×n}−S)^{T}(I_{n×n}−S) is the graph Laplacian, and \(P_{v}=\left [p_{v}^{1},p_{v}^{2},\ldots,p_{v}^{n}\right ]\) and tr(.) represent the trace operator. In Eq. (5), the most valuable information is retained and the feature redundancies are eliminated by using the low dimensional embedding \(p_{v}^{i}\) to represent the original data \(z_{v}^{i}\). To achieve the feature selection, we use ∥W_{z}∥_{2,1} as the regularization term of Eq. (5). Therefore, the feature selection for unsupervised subspace learning can be written as:
As the augmented feature is the combination of still feature and motion feature, the still feature representation is a part of augmented feature representation. Since the still feature representation doesnt have motion features, we set \(\phantom {\dot {i}\!}X_{v}=\,[\!X_{v};0]\in R^{d_{a}\times n_{l}}\) and \(\phantom {\dot {i}\!}X_{i}=\,[\!X_{i};0]\in R^{d_{a}\times n_{i}}\), so that W_{v} and W_{i} does not affect the loss on W_{z}. In addition, we set W= [W_{v},W_{i},W_{z}] and integrate the unsupervised subspace learning in Eq. (6) into the knowledge adaptation in Eq. (4). Finally, we arrive at the whole framework of JCAEL as follows:
In Eq. (7), with the term ∥W∥_{2,1}, our algorithm is able to evaluate the informativeness of the features jointly for both knowledge adaptation and low dimensional embedding. Our algorithm further enables different feature selection functions to share the common components/knowledge across knowledge adaptation and low dimensional embedding. In this way, the information from knowledge adaptation and low dimensional embedding can be transferred from one domain to the other. On the other hand, ∥W∥_{2,1} enables W_{v}, W_{i}, and W_{z} to have the same sparse patterns and share the common components, which can result in an optimal W for feature selection. Since there are four parameters (i.e., W_{i}, W_{z}, P_{v}, and W_{v}) to be estimated in Eq. (7), the objective function in Eq. (7) is not jointly convex with respect to the four parameters, but it is convex with respect to one parameter when we fix the other parameters. Thus, we propose an alternating optimization algorithm [24] to solve the optimization problem of JCAEL.
Optimization
In this section, we introduce an optimization algorithm for the objective function in Eq. (7). As there exist a number of variables to be estimated, we propose an alternating optimization algorithm to solve the optimization problem in Eq. (7). Denote \(W_{v}=\left [w_{v}^{1};w_{v}^{2};\ldots w_{v}^{d_{a}}\right ]\), \(W_{i}=\left [w_{i}^{1};w_{i}^{2};\ldots w_{i}^{d_{a}}\right ]\), \(W_{z}=\left [w_{z}^{1};w_{z}^{2};\ldots w_{z}^{d_{a}}\right ],\) and \(W=\left [w^{1};w^{2};\ldots w^{d_{a}}\right ]\), where d_{a} is the number of features.
(1) By fixing W_{i},W_{z}, P_{v}, and optimizing W_{v}, the objective function in Eq. (7) can be rewritten as:
According to [25], Eq. (8) is equivalent to
where D_{v} and D are diagonal matrices with each element on the diagonal, i.e., \(d_{v}^{kk}\) and d^{kk} (k=1,2,…,d_{a}), are respectively defined as \(d_{v}^{kk}=\frac {1}{2\left \ {w_{v}^{k}} \right \_{2}}\) and \(d^{kk}=\frac {1}{2\left \ {w^{k}} \right \_{2}}\). By setting the derivative of Eq. (9) w.r.t. W_{v} to 0, we have
Therefore, W_{v} can be derived by:
(2) Similarly, by fixing W_{v},W_{z}, P_{v}, and optimizing W_{i}, the objective function in Eq. (7) can be rewritten as:
Similar to Eq. (8), we first denote D_{i} as a diagonal matrix with each element on the diagonal, i.e., \(d_{i}^{kk}\) (k=1,2,…,d_{a}), is defined as \(d_{i}^{kk}=\frac {1}{2\left \ {w_{i}^{k}} \right \_{2}}\). Then, Eq. (12) can be rewritten as
By setting the derivative of Eq. (13) w.r.t. W_{i} to 0, we have
Therefore, W_{i} can be optimally determined as:
(3) By fixing W_{v},W_{i}, P_{v}, and optimizing W_{z}, the objective function in Eq. (7) can be rewritten as:
Similar to Eq. (8), we first denote D_{z} as a diagonal matrix with each element on the diagonal, i.e., \(d_{z}^{kk}\) (k=1,2,…,d_{a}), is defined as \(d_{z}^{kk}=\frac {1}{2\left \ {w_{z}^{k}} \right \_{2}}\). Then, Eq. (16) can be rewritten as
By setting the derivative of Eq. (17) w.r.t. W_{z} to 0, we have
Therefore, we have W_{z} to be optimally determined as:
(4) By fixing W_{v},W_{i} and substituting above W_{z} of Eq. (19) into Eq. (7), we will optimize P_{z}. Denote \(A=Z_{v}Z_{v}^{T}+\alpha D_{z}+\lambda D\), the objective function in Eq. (7) can be rewritten as:
Considering the objective function in Eq. (20) and the constraint \(P_{v}{P_{v}^{T}} = {I_{d_{e} \times d_{e}}}\), the optimization problem becomes
If A and L are fixed, the optimization problem in Eq. (21) can be solved by Eigendecomposition of the matrix \(\left (L+I_{n \times n}Z_{v}^{T}A^{1}Z_{v}\right)\). We pick up the eigenvectors corresponding to the d_{e} smallest eigenvalues.
Based on the above mathematical deduction, we propose an alternating algorithm to optimize the objective function in Eq. (7), which is summarized in Algorithm 1. Once W is obtained, we sort the d_{a} features according to ∥w^{k}∥_{F} (k=1,2,…,d_{a}) in a descending order and select the top ranked ones.
Convergence and computational complexity
Convergence
In this section, we theoretically show that Algorithm 1 proposed in this paper converges. We begin with the following lemma [22].
Lemma 1
For any nonzero vectors w and \(\widehat w\), the following inequality holds:
As a result, the second lemma can be derived as described below.
Lemma 2
By fixing W_{i} and W_{v}, we obtain the global solutions for W_{z} and P_{v} in Eq. (7). Yet, by fixing W_{i}, W_{z}, and P_{v}, we obtain the global solutions for W_{v} in Eq. (7). In the same manner, by fixing W_{v}, W_{z}, and P_{v}, we obtain the global solutions for W_{i} in Eq. (7).
Proof
When W_{i} and W_{v} are fixed, the optimization problem in Eq. (7) is equivalent to the problem described in Eq. (17) and Eq. (21). We can solve the convex optimization problem with respect to W_{z} by setting the derivative of (17) to zero. Further, we can derive the global solution for P_{v} by solving the Eigendecomposition problem with respect to P_{v}. When W_{z}, P_{v}, and W_{i} are fixed, the optimization problem in Eq. (7) is equivalent to the problem described in Eq. (9). We can solve the convex optimization problem with respect to W_{v} by setting the derivative of Eq. (9) to zero. Thus, we derive the global solution for W_{v} in Eq. (7), provided that W_{z}, P_{v}, and W_{i} are fixed. Similarly, we can also derive the same conclusion when W_{i} is fixed. □
Theorem 1
The proposed algorithm monotonically decreases the objective function value of Eq. (7) in each iteration. Next, we prove Theorem 1 as follows.
Proof
Let \(\widehat W_{v}\), \(\widehat W_{i}\), \(\widehat P_{v}\), and \(\widehat W_{z}\) denote the updated W_{v}, W_{i}, P_{v}, and W_{z}, respectively. The loop to update W_{v}, W_{i}, P_{v}, and W_{z} in the proposed algorithm corresponds to the optimal W_{v}, W_{i}, P_{v}, and W_{z} of the following problem:
Since \({\left \ {\text {W}} \right \_{{\text {2,1}}}} = {{\sum }_{k = 1}^{{d_{a}}} {\left \ {{w^{k}}} \right \}_{2}}\) [26], according to Lemma 2, we can obtain:
Then, we have the following inequality:
According to Lemma 1, another inequality can be established as follows:
□
This indicates that, with the updating rule in the proposed algorithm, the objective function value for Eq. (7) monotonically decreases until a convergence is reached.
Computational complexity
For the computational complexity of Algorithm 1, computing the graph Laplacian matrix L is O(n^{2}). During the training, learning W_{v}, W_{i}, and W_{z} involves calculating the inverse of a number of matrices, among which the most complex part is \(O\left (d_{a}^{3}\right)\). To optimize the P_{v}, the most timeconsuming operation is to perform eigendecomposition of the matrix \(ED=\left (L+I_{n \times n}Z_{v}^{T}A^{1}Z_{v}\right)\). Note that ED∈R^{n×n}. The time complexity of this operation is O(n^{3}) approximately. Thus, the computational complexity of JCAEL can be worked out as \(max\left \{O\left (t \times n^{3}\right),O\left (t \times d_{a}^{3}\right)\right \}\), where t is the number of iterations required for convergence. From the experiments, we observe that the algorithm converges within 10∼15 iterations, which indicates that our proposed algorithm is efficient in feature selection for video semantics recognition.
Experimental results and discussion
In this section, we propose the video semantic recognition experiments which evaluate the performance of our jointing crossmedia analysis and embedded learning (JCAEL) for feature selection.
Experimental datasets
In order to evaluate the contribution from crossmedia analysis, we construct three couples of video and image datasets, which include HMDB13 (video dataset) ←“Extensive Images Databases” (EID, image dataset), UCF10 (video dataset) ← Actions Images Databases (AID, image dataset), UCF (video dataset) ←PPMI4 (image dataset), where “ ←” denotes the direction of adaptation from images to videos. The videos and images of HMDB13 ←EID and UCF10 ←AID have the same semantic classes, and UCF ←PPMI4 has different semantic classes for videos and images.
HMDB13 ←EID
The HMDB51 dataset [27] is collected from a variety of sources ranging from digitized movies to YouTube videos. It contains 6766 video sequences that are categorized into 51 classes. This dataset contains simple facial actions, general body movements, and human interactions. In order to increase the number of overlapping classes, we select 13 overlapping classes between HMDB51 and another image datasets as Extensive Images Databases (EID), which includes two open benchmark datasets (i.e., Stanford40 [28] and Still DB [29]). As a result, we call the video dataset as “HMDB13.” Table 1 provides the details of the overlapping classes from EAD to HMDB51.
UCF10 ←AID
The UCF101 [30] is a dataset of realistic action videos collected from YouTube, which has 101 action categories. It gives the largest diversity in terms of actions with the presence of large variation in subject appearances, including scale and pose, related objects, cluttered background, and illumination conditions. Such a challenging diversity is suitable for verifying the effect of information learned from images on video semantics recognition. To further evaluate whether images coming from various sources contribute to the feature selection or not, we select ten overlapping classes between UCF101 and the action image dataset, referred to as Actions Images Databases (AID), which includes four open benchmarking datasets (i.e., action DB [31], PPMI [32], willowactions [33], and still DB). For the convenience of our experiments design and description, we call the video dataset as UCF10. In Table 2, we show the chosen categories of UCF10 ←AID, which are taken as video dataset and image dataset, respectively.
UCF ←PPMI4
The PPMI dataset [32] consists of 7 different musical instruments: bassoon, erhu, flute, French horn, guitar, saxophone, and violin. In order to assess the performance of the proposed algorithm when the image dataset has different classes from that in the video dataset, we choose ten classes from UCF101 and then select four overlapping image categories from PPMI. To this end, we call the video dataset as UCF and image dataset as PPMI4. Table 3 summarizes the selected classes of UCF ←PPMI4.
Experiment setup
For all the datasets, we select 30 images from each overlapping categories for knowledge adaption as the number of images is relatively small. We sample videos for labeled training data and take the remaining videos as the testing data. To evaluate the contribution from unsupervised subspace learning, we conduct experiments to study the performance variance when only a few labeled training samples are provided, and the ratios of labeled video data are set to 5%. For each dataset, we repeat the sampling for 10 times and report the average results. We extract SIFT features [16, 34] from the key frames of videos and images. The STIP features [17] are extracted from videos. We use the standard BagofWords (BoW) method [35, 36] to generate the BoW representation of SIFT and STIP features, where the number of visual words of BagofWords is set to 600. For videos, we obtain a still feature with 600 dimensions and an augmented feature with 1200 dimensions, and for images, we obtain a still feature with 600 dimensions.
Comparison algorithms
To benchmark our proposed jointing crossmedia analysis and embedded learning (JCAEL), we select a number of representative existing state of the arts for performance comparisons, details of which are highlighted below:

Full features (FF) which adopts all the features for classification. It is used as baseline method in this paper.

Fisher score feature selection (FSFS) [37]: a supervised feature selection method built by depending on fully labeled training data to select features with the best discriminating ability.

Feature selection via joint ℓ_{2,1}norms minimization (FSNM) [22]: a supervised feature selection method built by employing joint ℓ_{2,1}norms minimization on both loss function and regularization to realize feature selection across all data points.

ℓ_{2,1}norm least square regression (LSR_{21}) [22]: a supervised feature selection method built upon least square regression by using the ℓ_{2,1}norm as the regularization term.

Multiclass ℓ_{2,1}norm support vector machine (SVM_{21}) [20]: a supervised feature selection method built upon SVM by using the ℓ_{2,1}norm as the regularization term.

Ensemble feature selection (EnFS) [25]: a supervised feature selection method based on transfer learning, which transfer the shared information between different classifiers by adding a joint ℓ_{2,1}norm on multiple feature selection matrices.

Joint embedding learning and sparse regression (JELSR) [26]: a unsupervised feature selection method built by using the local linear approximation weights and ℓ_{2,1}norm regularization.

Jointing crossmedia analysis and embedded learning (JCAEL): our proposed method which is designed for feature selection by adapting knowledge from images based on still feature and utilizing both labeled and unlabeled videos based on augmented feature.
During the process of training and predicting, we use the augmented feature to represent the videos for the baseline methods, including FSFS, FSNM, LSR_{21}, SVM_{21}, and JELSR as these methods cannot use the information adapted from images. For EnFS and JCAEL, we use the still features to represent the image data and use the augmented feature to represent the videos. To fairly compare different feature selection algorithms, we use a “gridresearch” strategy from {10^{−6},10^{−5},…,10^{5},10^{6}} to tune the parameters for all the compared algorithms. By setting the number of selected features as {120,240,…,1200}, we report the best results obtained from different parameters. For the Knearest neighbors of Laplacian matrix L, the parameter is set to k=10. In our experiment, each feature selection algorithm is first performed to select features. Then, three classifiers, i.e., linear multiclass SVM (LMCSVM), least square regression (LSR), and multiclass kNN(MCkNN), are performed based on the selected features respectively to assess the performance of feature selection. For the classifier of least square regression, we learn a threshold from the labeled training data to quantize the continuous label prediction scores to binary. To measure the feature selection performances, we use the average accuracy (AA) over all semantic classes as the evaluation metric, which is defined as:
where c_{v} is the number of action classes. acc_{k} is the accuracy for the kth class.
Experimental results
In order to evaluate the effectiveness of JCAEL, we compare JCAEL with FF, FSFS, FSNM, LSR_{21}, SVM_{21}, EnFS, and JELSR on both HMDB13 ←EID and UCF10 ←AID dataset. The comparison results are summarized in Tables 4 and 5, where the best and the second best results are highlighted in bold and italic, correspondingly. We also conduct a number of experiments to study the performance variance when the ratios of labeled video data are set to 5%, 10%, 20%, 30%, and 40%, and the results are displayed in Figs. 1 and 2.
From the experimental results in Tables 4–5 and Figs. 1–2, we can make the following observations:

(1)
The results of feature selection algorithms are generally better than that of full features (FF). As the classification could be much faster by reducing the feature number, feature selection proves to be more crucial in practical applications.

(2)
As the number of labeled training videos increases, the performance of all methods is improved. This is consistent with the general principle as more information is made available for training.

(3)
The classification using multiclass SVM and multiclass kNN achieve better performance than the least square regression when the ratio of labeled video data are set to 5%. The main reason is that the threshold learned from the small size of training data leads to a bias in the quantization of continuous label prediction scores.

(4)
When the ratio of labeled video data are set to 5%, JELSR is generally the second most competitive algorithm. This indicates that incorporating the additional information contained in the unlabeled training data through unsupervised embedded learning is indeed useful.

(5)
As shown in Figs. 1–2, supervised methods based on transfer learning (EnFS) always achieve better performances than other compared methods when the number of labeled training videos is enough (e.g., the ratio of labeled video data are set to 40%), since EnFS can uncover common irrelevant features by transferring the relative information between different classifiers.

(6)
As shown in Figs. 1–2, our proposed JCAEL remains to be the best performing algorithm among different methods and different cases. The main reason is that our method can take advantages of both transfer learning and embedded learning. We can also see from Tables 4–5 that JCAEL algorithm achieves the best results when only a small number of labeled training videos are available. This advantage is especially desirable for realworld problems since precisely annotated videos are often rare.
Experiment on convergence
In this section, we study the convergence of the proposed JCAEL as described in Algorithm 1. Due to the fact that we solve our objective function using an alternating approach, how fast our algorithm converges is crucial for the whole computational efficiency in practice. Hence, we conduct an experiment to test the convergence of the proposed JCAEL algorithm according to the objective function value in Eq. (7) on both HMDB13 ←EID and UCF10 ←AID datasets, where the ratio of labeled video data are set to 5%. All the results are illustrated as convergence curves, and when the ratio of labeled video data are set to 40%, all the results are summarized in Fig. 3, where all the parameters involved are fixed at their optimal values. From the results shown in Fig. 3, it can be seen that our algorithm converges within a few iterations. For example, it takes no more than 10 iterations for UCF10 ←AID and no more than 15 iterations for HMDB13 ←EID.
Experiment on parameter sensitivity
There are two regularization parameters α and λ in Eq. (7). To learn how they affect the performances, we conduct an experiment to test the parameter sensitivity, where LMCSVM is used to classify the videos. We show the results on both HMDB13 ←EID and UCF10 ←AID in Fig. 4, where the ratio of labeled video data are set to 5%. It can be seen that, for HMDB13 ←EID, the performance is sensitive to the two parameters. For UCF10 ←AID the performance does not change much. In general, our proposed can perform well for these datasets when α and λ are comparable. For example, good performance is obtained when α=0.0001 and λ=100 for HMDB13 ←EID and α=0.001 and λ=10000 for UCF10 ←AID.
Experiment on selected features
As feature selection is aimed at both accuracy and computational efficiency, we perform an experiment to study how the number of selected features can affect the performance. We construct the experiments on both HMDB13 ←EID and UCF10 ←AID when the ratio of labeled video data is set to 5%. Again, LMCSVM is used to classify the videos, and Fig. 5 shows the performance variation w.r.t the number of selected features. From the results illustrated in Fig. 5, the following observations can be made: (1) When the number of selected features is too small, the result is not competitive with using all features for video semantic recognition, which could be attributed to the fact that too much information is lost in this case. For instance, when using less than 360 features of HMDB13 ←EID, the result is worse than using all features. (2) The results arrive at the peak level when using 720 features for HMDB13 ←EID and using 840 features for UCF10 ←AID. The variance shown on the two datasets are related to the properties of the datasets. (3) After all the features are selected, the results are lower than selecting 720 features for HMDB13 ←EID and 840 features for UCF10 ←AID. In conclusion, our method reduces noises, as the results improve on both databases.
Experiment on embedding features
In this section, we would like to investigate the influence of embedding features with different dimensions. We conduct the experiment on both HMDB13 ←EID and UCF10 ←AID when the ratio of labeled video data is set to 5%. With videos being classified by LMCSVM, Fig. 6 shows the performance variation w.r.t the number of selected features. From the illustrated results, two observations can be made: (1) the result arrives at the peak level when using 390 embedding features for HMDB13 ←EID and 10 embedding features for UCF10 ←AID. The variance shown on the two datasets are seen to be related to the properties of the datasets. (2) Without embedded learning, the results is lower than using 390 embedding features for HMDB13 ←EID and 10 embedding features for UCF10 ←AID, even when all the features are used. In conclusion, our proposed JCAEL can achieve good performance due to the fact that the most valuable information is retained and the feature redundancies are eliminated in embedded learning.
Influence of crossmedia analysis and embedded learning
To further investigate the effectiveness of the integrated crossmedia analysis and embedded learning, we construct three new algorithms: (1) embedded learning part (ELP), which is the unsupervised embedded learning part of JCAEL (i.e., Eq. (6)). ELP utilizes both labeled and unlabeled videos as the training dataset, and the augmented feature is used to represent each video by ELP. (2) Crossmedia analysis part (CAP), which is the transfer learning part of JCAEL (i.e., Eq. (4)). CAP transfers the knowledge from images to labeled videos, and only the still feature is used to transfer the knowledge by CAP.
We construct a new experiment to compare JCAEL with ELP and CAP on UCF ←PPMI4 dataset. Other experiment setup is similar to those described in Section 3.2, and the comparison results are shown in Table 6 and Fig. 7.
From the results presented in Table 6 and Fig. 7, we can make the following observations: (1) Among different methods and different labeled ratios, JCAEL perform best. It achieves the highest accuracy in most cases, especially when only few labeled training videos are provided. This is mainly due to the fact that (1) JCAEL benefits from the unsupervised embedded learning which can utilize both labeled and unlabeled data, (2) JCAEL leverages the knowledge from images to boost its performances, and (3) JCAEL integrates transfer learning and embedded learning into a joint optimization framework. In this way, gains from optimization are augmented. (2) The performance of JCAEL is generally better than that of ELP for all the labeled ratios, indicating that the JCAEL is able to use the extra knowledge from images to achieve higher accuracy. (3) JCAEL generally outperforms CAP, indicating that it is beneficial to utilize unlabeled videos for video semantic recognition, especially when the number of labeled data is not sufficient.
To show the influence of the knowledge transferred from images, we shown the confusion matrices of ELP, CAP, and JCAEL when the ratio of labeled video data are set to 5%. The confusion matrices are shown in Fig. 8. Compared with ELP and CAP, JCAEL obtains better results on “playing cello,” “playing flute,” “playing guitar,” and “playing violin.” The main reasons can be highlighted as follows: (1) the extrarelated semantic knowledge is adapted from images to videos and used to obtain the coherent semantics in videos. (2) The unlabeled videos also include more relevant information, which plays positive roles in improving the performance of semantics recognition.
Conclusions
There are many labeled images and unlabeled videos in real world. To achieve good performance for video semantic recognition, we propose a new feature selection framework, which can borrow the knowledge transferred from images to achieve its performance improvements. Meanwhile, it can utilize both labeled and unlabeled videos to enhance the performance of semantic recognition in videos. Extensive experiments validate that the knowledge transferred from images and the information contained in unlabeled videos can be used indeed to select more discriminative features, leading to the enhancement of recognition accuracies of semantics inside videos. In comparison with the existing state of the arts, the experimental results show that the proposed JCAEL has better performances in video semantics recognition. Even under the circumstance that only a few labeled training videos are available, our proposed JCAEL still performs competitive among all the compared existing state of the arts, leading to a high level of flexibility for its applications in real world.
Abbreviations
 AA:

Average accuracy
 BoW:

Bagofwords
 EnFS:

Ensemble feature selection
 FSFS:

Fisher score feature selection
 FSNM:

Feature selection via joint ℓ_{2,1}norm minimization
 JCAEL:

Jointing crossmedia analysis and embedded learning
 JELSR:

Joint embedding learning and sparse regression
 LMCSVM:

Linear multiclass SVM
 LSR:

Least square regression
 LSR_{21} :

ℓ_{2,1}norm least square regression
 MCkNN:

Multiclass kNN
 PCA:

Principal component analysis
 SIFT:

Scaleinvariant feature transform
 STIP:

Spacetime interest points
 SVM_{21} :

Multiclass ℓ_{2,1}norm support vector machine
References
 1
Y. Han, Y. Yang, Y. Yan, Z. Ma, N. Sebe, X. Zhou, Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans. Neural Netw. Learn. Syst.26(2), 252–264 (2015).
 2
L. Maddalena, A. Petrosino, Stopped object detection by learning foreground model in videos. IEEE Trans. Neural Netw. Learn. Syst.24(5), 723–735 (2013).
 3
Y. Xu, Y. Han, R. Hong, Q. Tian, Sequential video vlad: Training the aggregation locally and temporally. IEEE Trans. Circ. Syst. Video Technol.27(10), 4933–4944 (2018).
 4
X. Zhen, L. Shao, D. Tao, X. Li, Embedding motion and structure features for action recognition. IEEE Trans. Circ. Syst. Video Technol.23(7), 1182–1190 (2013).
 5
S. Zhao, Y. Liu, Y. Han, R. Hong, Pooling the convolutional layers in deep convnets for action recognition. IEEE Trans. Circ. Syst. Video Technol.28(8), 1839–1849 (2018).
 6
Y. Han, Y. Yang, F. Wu, R. Hong, Compact and discriminative descriptor inference using multicues. IEEE Trans. Image Process.24(12), 5114–5126 (2015).
 7
G. Lan, C. Hou, F. Nie, T. Luo, D. Yi, Robust feature selection via simultaneous sapped norm and sparse regularizer minimization. Neurocomputing. 283:, 228–240 (2018).
 8
Y. Yang, Z. Ma, A. G. Hauptmann, N. Sebe, Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans. Multimedia. 15(3), 661–669 (2013).
 9
Z. Ma, Y. Yang, N. Sebe, A. G. Hauptmann, Knowledge adaptation with partiallyshared features for event detectionusing few exemplars. IEEE Trans. Pattern. Anal. Mach. Intell.36(9), 1789–1802 (2014).
 10
C. Hou, F. Nie, H. Tao, D. Yi, Multiview unsupervised feature selection with adaptive similarity and view weight. IEEE Trans. Knowl. Data Eng.29(9), 1998–2011 (2017).
 11
Z. Ma, F. Nie, Y. Yang, J. R. Uijlings, N. Sebe, A. G. Hauptmann, Discriminating joint feature analysis for multimedia data understanding. IEEE Trans. Multimedia. 14(6), 1662–1672 (2012).
 12
C. Deng, X. Liu, C. Li, D. Tao, Active multikernel domain adaptation for hyperspectral image classification. Pattern Recogn.77:, 306–315 (2018).
 13
Y. Yang, Y. Yang, H. T. Shen, Effective transfer tagging from image to video. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM). 9(2), 14 (2013).
 14
L. Duan, D. Xu, I. W. H. Tsang, J. Luo, Visual event recognition in videos by learning from web data. IEEE Trans. Pattern. Anal. Mach. Intell.34(9), 1667–1680 (2012).
 15
W. Zhuge, C. Hou, F. Nie, D. Yi, in Pattern Recognition (ICPR), 2016 23rd International Conference On. Unsupervised feature extraction using a learned graph with clustering structure (IEEECancun, 2016), pp. 3597–3602.
 16
J. Zhang, Y. Han, J. Tang, Q. Hu, J. Jiang, in Proceedings of the 22nd ACM International Conference on Multimedia. What can we learn about motion videos from still images? (ACMOrlando, 2014), pp. 973–976.
 17
J. Zhang, Y. Han, J. Jiang, Tensor rank selection for multimedia analysis. J. Vis. Commun. Image Represent.30:, 376–392 (2015).
 18
Y. Han, J. Zhang, Z. Xu, S. I. Yu, in Proceedings of the 17th AAAI Conference on LateBreaking Developments in the Field of Artificial Intelligence. Discriminative multitask feature selection (AAAI PressBellevue, 2013), pp. 41–43.
 19
S. Ji, L. Tang, S. Yu, J. Ye, A sharedsubspace learning framework for multilabel classification. ACM Trans. Knowl. Disc. Data. 4(2), 1–29 (2010).
 20
X. Cai, F. Nie, H. Huang, C. Ding, in IEEE International Conference on Data Mining. Multiclass l2,1norm support vector machine (IEEEBrussels, 2012), pp. 91–100.
 21
J. Zhang, Y. Han, J. Tang, Q. Hu, J. Jiang, Semisupervised imagetovideo adaptation for video action recognition. IEEE Trans. Cybern.47(4), 960–973 (2016).
 22
F. Nie, H. Huang, X. Cai, C. Ding, in International Conference on Neural Information Processing Systems. Efficient and robust feature selection via joint l2,1norms minimization (Curran Associates IncVancouver, 2010), pp. 1813–1821.
 23
X. He, D. Cai, P. Niyogi, in Advances in Neural Information Processing Systems. Laplacian score for feature selection (Curran Associates IncVancouver, 2006), pp. 507–514.
 24
J. Zhang, J. Jiang, Rankoptimized logistic matrix regression toward improved matrix data classification. Neural Comput.30(2), 1–21 (2018).
 25
Y. Han, Y. Yang, X. Zhou, in IJCAI, vol. 13. Coregularized ensemble for feature selection (Beijing, China, 2013), pp. 1380–1386.
 26
C. Hou, F. Nie, D. Yi, Y. Wu, in IJCAI ProceedingsInternational Joint Conference on Artificial Intelligence, vol. 22. Feature selection via joint embedding learning and sparse regression (Barcelona, 2011), pp. 1324–1329.
 27
H. Kuehne, H. Jhuang, R. Stiefelhagen, T. Serre, in IEEE International Conference on Computer Vision, ICCV 2011, November. Hmdb51: A large video database for human motion recognition (IEEEBarcelona, 2012), pp. 2556–2563.
 28
B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, F. F. Li, in International Conference on Computer Vision. Human action recognition by learning bases of action attributes and parts (IEEEBarcelona, 2011), pp. 1331–1338.
 29
N. Ikizler, R. G. Cinbis, S. Pehlivan, P. Duygulu, in International Conference on Pattern Recognition. Recognizing actions from still images (IEEETampa, 2008), pp. 1–4.
 30
O. Deniz, I. Serrano, G. Bueno, T. K. Kim, in International Conference on Computer Vision Theory and Applications. Fast violence detection in video (IEEE, 2015), pp. 478–485.
 31
A. Gupta, A. Kembhavi, L. S. Davis, Observing humanobject interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern. Anal. Mach. Intell.31(10), 1775–89 (2009).
 32
B. Yao, F. F. Li, in Computer Vision and Pattern Recognition. Grouplet: A structured image representation for recognizing human and object interactions (IEEESan Francisco, 2010), pp. 9–16.
 33
V. Delaitre, I. Laptev, J. Sivic, in British Machine Vision Conference, BMVC 2010, August 31  September 3, 2010. Proceedings. Recognizing human actions in still images: a study of bagoffeatures and partbased representations (Aberystwyth, 2010), pp. 1–11.
 34
X. Liu, Z. Li, C. Deng, D. Tao, Distributed adaptive binary quantization for fast nearest neighbor search. IEEE Trans. Image Process.26(11), 5324–5336 (2017).
 35
X. Liu, J. He, B. Lang, Multiple feature kernel hashing for largescale visual search. Pattern Recog.47(2), 748–757 (2014).
 36
X. Liu, J. He, S. F. Chang, Hash bit selection for nearest neighbor search. IEEE Trans. Image Process.26(11), 5367–5380 (2017).
 37
R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification (2nd Edition) (Wiley, 2001).
Acknowledgements
Not applicable.
Funding
This work was supported in part by National Science Foundation of China (under Grant No. 61620106008, 61702165, U1509206, 61472276, 61876130). This work was supported in part by the Hebei Provincial Natural Science Foundation, China (under Grant No. F2016111005). This work was supported in part by the Foundation for Talents Program Fostering of Hebei Province (No. A201803025). This work was supported in part by Shenzhen Commission for Scientific Research & Innovations (under Grant No. JCYJ20160226191842793). This work was supported in part by Tianjin Natural Science Foundation (No. 15JCYBJC15400). This work was supported in part by the Project of Hebei Province Higher Educational Science and Technology Research (under Grant No. QN2017513). This work was supported in part by the Research Foundation for Advanced Talents of Hengshui University (under Grant No. 2018GC01).
Availability of data and materials
Please contact author for data requests.
Author information
Affiliations
Contributions
All authors took part in the work described in this paper. The team conducted literature reading and discussion together. The author JZ designed the proposed algorithm and made the theoretical derivation of mathematics. The author ZZ collected the image and video datasets and preprocessed them, then DA, JL, and ZS together programed to implement and verify the proposed algorithm. The author JZ wrote the first version of this paper, and then, the author YH and JJ repeatedly revised the manuscript. To accomplish the final manuscript submitted, all authors participated into discussion. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Jianmin Jiang.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Authors’ information
Not applicable.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Zhang, J., Han, Y., Jiang, J. et al. A feature selection framework for video semantic recognition via integrated crossmedia analysis and embedded learning. J Image Video Proc. 2019, 44 (2019) doi:10.1186/s1364001904285
Received
Accepted
Published
DOI
Keywords
 Feature selection
 Crossmedia analysis
 Embedded learning