A semi-supervised convolutional neural network based on subspace representation for image classification

Gatto, Bernardo B.; Souza, Lincon S.; dos Santos, Eulanda M.; Fukui, Kazuhiro; S. Júnior, Waldir S.; dos Santos, Kenny V.

doi:10.1186/s13640-020-00507-5

Research
Open access
Published: 16 June 2020

A semi-supervised convolutional neural network based on subspace representation for image classification

Bernardo B. Gatto^1,2,
Lincon S. Souza³,
Eulanda M. dos Santos²,
Kazuhiro Fukui^1,3,
Waldir S. S. Júnior² &
…
Kenny V. dos Santos²

EURASIP Journal on Image and Video Processing volume 2020, Article number: 22 (2020) Cite this article

5150 Accesses
5 Citations
Metrics details

Abstract

This work presents a shallow network based on subspaces with applications in image classification. Recently, shallow networks based on PCA filter banks have been employed to solve many computer vision-related problems including texture classification, face recognition, and scene understanding. These approaches are robust, with a straightforward implementation that enables fast prototyping of practical applications. However, these architectures employ either unsupervised or supervised learning. As a result, they may not achieve highly discriminative features in more complicated computer vision problems containing variations in camera motion, object’s appearance, pose, scale, and texture, due to drawbacks related to each learning paradigm. To cope with this disadvantage, we propose a semi-supervised shallow network equipped with both unsupervised and supervised filter banks, presenting representative and discriminative abilities. Besides, the introduced architecture is flexible, performing favorably on different applications whose amount of supervised data is an issue, making it an attractive choice in practice. The proposed network is evaluated on five datasets. The results show improvement in terms of prediction rate, comparing to current shallow networks.

1 Introduction

In supervised machine learning, classifiers employ labeled data to create models. However, in many practical situations, labeled data is often challenging and expensive to obtain, for example, real-world remote sensing [1], medical image analysis [2], and facial expression recognition [3]. Besides, the difficulty in finding specialists to label data in certain areas may lead projects to be unfeasible [4, 5]. In contrast, models based on unsupervised learning are generated from unlabeled data which, in some scenarios, is readily available, and can be obtained at low cost [6, 7]. For example, meteorological weather data, such as temperature and pressure, can be obtained inexpensively in environmental preservation projects [8, 9]. In addition, unlabeled images and videos can be obtained in social networks and employed to train unsupervised machine learning models [10, 11].

There is often no consensus on how to employ labeled and unlabeled data in conjunction to improve machine learning models due to the large imbalance between labeled and unlabeled data [12, 13]. Therefore, most classification methods produce models based only on labeled datasets, neglecting unlabeled data. In order to solve this problem, there is in the literature a class of learning techniques called semi-supervised learning. This class may be categorized as supervised learning, though it also makes use of unlabeled data for training. In general, these techniques employ a large amount of unlabeled data with a small amount of labeled ones. Many studies show that this kind of combination can provide significant enhancement in learning accuracy over unsupervised learning [14, 15].

In the context of image classification, however, the supervised approach is dominant. Image classification is one of the central problems, covering a diverse range of applications including human-computer interaction [16, 17], image and video retrieval [18, 19], video surveillance [20, 21], biometrics [22, 23], and analysis of social media networks [24, 25]. Considering this context, deep learning methods, such as convolutional neural network (CNN), are currently the state-of-the-art in several applications [26–28].

The literature shows that deep learning [29, 30] has been employed as an alternative to handcrafted features for image classification, like Gabor features [31] and local binary patterns (LBP) [32, 33] for texture and face classification and scale-invariant feature transform (SIFT) [34] and histogram of oriented gradients (HOG) features [35] for object recognition [36, 37]. The central concept of deep learning is that all relevant information required for recognizing image patterns can be structured in an hierarchical model, which can be obtained through iterative learning of the training image patterns. When the amount of available data is large enough (e.g., ImageNet dataset [38]) and there are no computational resources restrictions, deep learning models outperform handcrafted features-based methods [39, 40].

Despite its success, the number of parameters to be trained in a typical deep learning model is huge, consequently requiring a large amount of data to be employed for training, which can lead to a high computational cost, even when computational resources equipped with GPU are available. As a result, the computational complexity required from most of the deep learning architectures prevents some computer vision applications to fully employ the capabilities of deep CNN.

As an alternative, shallow networks have been proposed to exploit the advantageous characteristics of deep learning models, while lightening the computational cost associated with its training. Although these networks hold hierarchical structures, their weights are obtained through non derivative methods, giving them a processing time advantage over the traditional deep network models by several orders of magnitude. For instance, in [41], a convolutional neural network with no pooling layers nor active functions and without end-to-end learning is proposed. Instead, PCA or LDA are employed to replace the convolutional kernels of a CNN. While presenting a simple architecture, this strategy exhibited performance comparable to the state-of-the-art for several image classification tasks. Other examples of similar solutions include LDA [42], Gabor and ICA [43].

Even though shallow networks have been successfully applied in various recognition tasks, such methods can only describe either supervised or unsupervised data and are not able to efficiently exploit both. This paper proposes a convolutional shallow network to solve this issue. In contrast to the conventional networks [41, 43], the filter banks employed by the proposed network are produced by both PCA and generalized difference subspaces (GDS) [44, 45], which preserve the discriminative information among different classes, generating more efficient representations.

Accordingly, the proposed network can operate on both labeled and unlabeled data, improving the performance when only small volumes of labeled data are available. This network is called dual flow subspace network (DFSNet), due to its flexibility in handling both learning paradigms. In addition to its advantages, semi-supervised learning is of theoretical interest, since it makes it possible to understand the mechanisms of human learning [46, 47].

Therefore, our work provides the following contributions:

1.
We introduce a new type of filter bank based on GDS. Different from PCA, the filter banks produced by GDS can efficiently handle labeled data.
2.
We introduce a semi-supervised shallow network based on PCA and GDS, presenting a flexible framework.

In summary, the organization of this work is as follows: Section 2 gives a brief review on shallow networks. Then, in Section 3, we develop the proposed semi-supervised neural network for image classification. Section 4 shows the advantages of DFSNet over current shallow networks by experimental results using CIFAR-10 and ETH-80 databases for object recognition, LFW and FERET databases for face recognition, and NYU Depth V1 database for scene recognition. Finally, conclusions and future work are discussed in the last section.

2 Related work

In this section, we provide a brief review on CNN-like shallow networks. This analysis is important in order to clarify the differences between DFSNet and current methods. In all these examples, the employed techniques can be conducted as CNN-like architectures based on local multistage filter banks [48]. The typical framework of these approaches is shown in Fig. 1. In this framework, the input images are processed by multiple layers, ranging from 2 to 4 layers, followed by a feature mapping and classification. In this section, we will discuss both supervised and unsupervised shallow networks.

PCANet [41] is an unsupervised shallow network based on CNN, where multistage filter banks are learned from the data as principal components at the local image patch level. In PCANet, the eigenvectors of the local patch covariance matrix are employed as filter banks for convolution and feature extraction, followed by binarization and block-wise histogramming. This straightforward shallow network works well in a variety of image classification benchmarks, including handwritten and face recognition, achieving performance comparable to the state-of-the-art.

PCANet has been chosen as the main framework for several applications, including personal identification from ECG signal [49], traffic light recognition [50], remote sensing [51], medical image analysis [52], and automatic ship detection [53]. LDANet follows the same strategy used by PCANet and employs a similar architecture, with the difference that the filter banks used for convolution are obtained through the LDA basis vectors.

DCTNet [48] is an alternative to PCANet, which employs discrete cosine transform (DCT) as filter banks instead of PCA. DCTNet creates its filter banks by DCT, achieving a data-independent network, hence increasing the performance of the network. To reduce the computational complexity of the learning stages of this network, 2D DCT is also employed. Besides the low computational complexity, 2D DCT filter banks are independent of data, therefore generating a learning-free framework. DCTNet has been widely applied to several benchmarks of face databases and has shown performance equivalent or superior to PCANet.

Canonical correlation analysis network (CCANet) is introduced in [54], inspired by the flexibility and accuracy rate of wavelet scattering network (ScatNet) [55–57] and PCANet. It is also an unsupervised shallow network. On the other hand, different from ScatNet and PCANet, CCANet can handle images that are represented by two-view features, introducing more flexibility to the framework. Besides, CCANet produces the convolutional kernels by maximizing the correlation of the projected two-view variables. Therefore, the weights can reflect more discriminative information of the same object compared to PCANet and LDANet. The advantages of CCANet are as follows. First, CCANet can concurrently extract two-view features of a single image, which is assumed to minimize intra-class variance. Second is the reduced number of convolutional stages, in comparison to similar shallow networks. Also, as in PCANet and LDANet, CCANet does not require backpropagation algorithm to fine-tune its parameters. To demonstrate its effectiveness, CCANet was evaluated on several computer vision-related tasks in [54]. The results showed that CCANet outperformed PCANet and LDANet, for object, face, and handwritten digit recognition problems.

Although PCANet and similar networks achieve high recognition rates in several datasets, these networks may not extract discriminative features in more complicated computer vision problems, since PCA does not preserve the relationship between different classes, which can be useful in pattern classification problems. To lighten this issue, the discriminative canonical correlation network (DCCNet) [58] is introduced, where discriminative canonical correlations analysis (DCC) [59, 60] is employed as filter banks. Learning filters from DCC ensure that the network will provide discriminative features, generating more representative information by using supervised data. DCCNet was evaluated in four datasets, including objects and images of house numbers classification, outperforming PCANet, and LDANet in these tasks.

Despite its versatility, PCANet only works with unsupervised convolution filters, not making use of supervised information, when available. To solve this problem, orthogonal subspace network (OSNet) [61] is proposed to make use of supervised data. The central concept of OSNet is to express images as subspaces. In this scenario, the subspace representation is more compact than the traditional image set representation, since it selects the most relevant set of eigenvectors of an image set. To produce discriminative information, a space is computed to decorrelate the between class covariance matrix. Convolutional kernels of OSNet can be efficiently learned from class subspaces and directly employed to produce high discriminant features in a CNN-like architecture. Another benefit of subspace representation is that it requires less memory for storage and less processing time. The effectiveness of OSNet is shown in [61] by experiments using four databases, where OSNet outperformed PCANet.

In order to alleviate the high demand for storage space and computation required to learn deep features representation, a shallow network named compact feature representation (CFR-ELM) was proposed [62]. By using the extreme learning machine (ELM) under a shallow network design, this framework requires less storage space and computational resources, likewise the PCANet. This solution consists of the following steps: first, patch-based mean removal is employed, followed by an ELM auto-encoder (ELM-AE) feature extraction. Then, max pooling is used to compact the features. Finally, hashing and block-wise histogramming provide the post-processed features. The CFR-ELM was evaluated on MNIST, Coil-20/100, ETH-80, and CIFAR-10, demonstrating competitive results to the existing supervised shallow networks.

More recently, cosine convolutional kernel network (Cosine-CKN) [63] was proposed as an unsupervised convolutional network architecture that employs a kernel function designed by a convex combination of a (possibly uncountably infinite) number of cosine kernels. In contrast to the standard CKN, the introduced approximation is more related to CNN, where the inner product operator measures the similarity between filters and image patches. Different from the traditional CNN, Cosine-CKN has fewer hyperparameters, which makes its prototyping and training much faster. Cosine-CKN was evaluated on several datasets, including MNIST, CIFAR-10, C-Cube, and FERET. The experimental results demonstrated that this network reached better recognition accuracy and training time than PCANet and LDANet.

It is important to note that supervised shallow networks are dependent on the availability of labeled data and that unsupervised shallow networks do not have mechanisms to use labeled data, when available. In this case, a shallow network whose architecture allows the use of both labeled and unlabeled data may exhibit a significant advantage, since the network will be able to employ all types of data available, regardless of whether they are labeled or not. Besides, such flexibility also reflects the efficiency of the network, which is expected to provide competitive results concerning accuracy.

Finally, we should point out that PCA and LDA can be regarded as subspace-based methods, which is a class of learning techniques that employs subspaces to represent the data. Accordingly, we can introduce more sophisticated subspace methods such as GDS, where the discriminability of features is enhanced with the orthogonalization process of the different class subspaces. GDS has been employed in image set classification problems, achieving robustness to illuminations conditions. Due to its low computational cost, GDS is preferred compared to other supervised methods such as DCC or LDA. Another merit of using GDS is that it is robust to small sample size, which is a persistent problem in computer vision related problems [64].

By using supervised and unsupervised subspaces, we can introduce a shallow network capable of efficiently exploiting both learning paradigms, providing a very flexible architecture. After a thorough search of the relevant literature, we believe that this is the first work that introduces a semi-supervised shallow network based on subspaces for image classification. In Fig. 2, we show a conceptual schema of a semi-supervised shallow network for image classification. In the next section, we give details on the proposed architecture.

3 Proposed method

Inspired by shallow networks architectures, this section presents a semi-supervised network for image classification. The content of this section is organized as follows. First, we provide notations for the main concepts. Next, we explain the representation of the training images by patches. Then, we define the procedure of learning convolution filters through subspaces to generate supervised and unsupervised filter banks. After that, we describe the process of creating the final feature mapping.

3.1 Notations

In the context of this work, we will use the following notations. Scalars are denoted by upper case letters (e.g., N_u,M_u, N, M, K), vectors are denoted by lowercase letters, and matrices are denoted by boldface uppercase letters (e.g., v, A,X_u,X_s). Calligraphic letters will be assigned to orthogonal basis vectors (e.g., S,ℳ) as well as to filter banks F. The set of filters $\{{\phi }_{i}\}_{i=1}^{D}$ contains D elements, e.g., $\{{\phi }_{1}, \dots, {\phi }_{D}\}$. Given a matrix $\mathbf {A} \in \mathbb {R}^{M \times N}, \mathbf {A}^{T} \in \mathbb {R}^{N \times M}$ denotes its transpose.

3.2 Problem setting

Let us consider a learning problem with two training sets X_u and X_s, where X_u contains N_u unlabeled and X_s contains N_s labeled images of size M×N.

The objective of DFSNet is to extract discriminative and representative structures in a way to maximize the classification result subject to its training data resources. Precisely, subspaces should be obtained from unsupervised and supervised training sets hierarchically, such that the features of different abstractions can be efficiently represented.

Then, given X_u and X_s, we should implement a mechanism that produces 2Z filter banks, where Z denotes the number of convolutional layers in the network, in such manner that each layer will be equipped with an unsupervised F_u and a supervised F_s filter bank.

3.3 Representation by patches

We extract patches of size K=K₁×K₂ from X_u and X_s. This procedure is performed by taking a patch around each pixel from each one of the N_u+N_s training images. Here, we denote the set of unsupervised and supervised patches as P_u and P_s, respectively. Given that each image patch will have size K(=K₁×K₂), the sets P_u and P_s will then contain M_u=N_uMN and M_s=N_sMN patches, respectively.

3.4 Producing unsupervised filter banks

The procedure for building unsupervised filters can be implemented in several ways. The literature points out that data-dependent filters (e.g. PCA, CCA) and data-independent filters (e.g. FFT, DCT, Wavelet transform) can be used to generate unsupervised filters. In our proposal, we will use PCA filter banks due to its flexibility in handling different applications [65, 66] and its fast training and test processing times.

The procedure to calculate PCA filters is carried as follows: we use the unsupervised patch set $\mathbf {P}_{u} = \{p_{i} \in \mathbb {R}^{K}\}_{i=1}^{M_{u}}$; the empirical mean vector is computed as $\overline {p} = \frac {1}{M_{u}}\sum \limits _{i=1}^{M_{u}} p_{i} \in \mathbb {R}^{K}$ of P_u. After that, we subtract the mean vector of each vector p_i to form the data centered set $\overline {\mathbf {P}_{u}}$. Once we obtain $\overline {\mathbf {P}_{u}}$, we can now build the feature matrix $\mathbf {A} \in \mathbb {R}^{{M_{u}} \times K}$ containing in its rows each element of $\overline {\mathbf {P}_{u}}$.

Once the feature matrix A is obtained, we can compute the autocorrelation matrix $\mathbf {C}_{u} = \mathbf {A}^{T}\mathbf {A} \in \mathbb {R}^{K \times K}$. Now that we are equipped with the autocorrelation matrix C_u, we can move forward to calculate the matrix U_u of eigenvectors which diagonalizes the autocorrelation matrix C_u:

$$ \mathbf{D}_{u} = \mathbf{U}_{u}^{-1}\mathbf{C}_{u} \mathbf{U}_{u}, $$

(1)

In Eq. 1, U_u is an K×K orthogonal matrix, i.e., $ \mathbf {U}_{u} \mathbf {U}_{u}^{T} = \mathbf {U}_{u}^{T} \mathbf {U}_{u} = \mathbf {I}$, where I is an K×K identity matrix. The columns of U_u that correspond to nonzero singular values compound a set of orthonormal basis vectors for the range of C_u. D_u is the diagonal matrix of eigenvalues of C_u.

The unsupervised filter bank F_u is defined by the first D_u vectors of U_u in descending order according to the eigenvalues of the matrix D_u. Therefore, we define the unsupervised filter bank F_u as follows:

$$ \mathcal{F}_{u} = \mathbf{U}_{u} \mathbf{R}_{u}, $$

(2)

where R_u is a K×K matrix containing 1 on its first D_u principal diagonal entries and 0 elsewhere. After this procedure, we should have an unsupervised filter bank $\mathcal{F}_{u} \in \mathbb {R}^{D_{u} \times K}$.

3.5 Producing supervised filter banks

There are also many types of supervised methods that can be employed to implement efficient supervised filters for DFSNet, such as LDA and DCC. In this work, we use GDS, which is suitable for the semi-supervised problem setting since it can work well with even a small quantity of supervised data. This problem setting, well known as small sample size problem, is very challenging for LDA and DCC due to its inability to estimate the within-class scatter matrix adequately in such circumstances. In contrast, GDS avoids this issue by introducing the subspace representation, which can be stably estimated from even few samples [64]. Practical examples exist in literature, for instance, illumination subspace can be generated from a set of at most 9 frontal face images. In this example, the subspace produced by GDS represents the explicit information about the object shape [44, 67], which is not achievable by LDA or DCC. Besides, the computational cost of GDS is relatively low for a supervised subspace-based method [68, 69].

To create the supervised filter banks, we will use the supervised patch set $\mathbf {P}_{s} = \{p_{i} \in \mathbb {R}^{K}\}_{i=1}^{M_{s}}$. For a C class classification problem, it is required to compute a set of C feature matrices $\{\mathbf {A}_{j}\}_{j=1}^{C}$. For each feature matrix A_j, we need to compute the autocorrelation matrix C_j=A_j^TA_j.

Equipped with all C autocorrelation matrices, we can move forward to calculate the matrix U_j of eigenvectors which diagonalizes the autocorrelation matrix C_j:

$$ \mathbf D_{j} = {\mathbf U_{j}}^{-1}\mathbf C_{j} \mathbf U_{j}, \quad j=\{1, \ldots, C\}. $$

(3)

In Eq. 3, each U_j is a K×K matrix satisfying U_jU_j^T=U_j^TU_j=I. The columns of U_j that correspond to nonzero singular values compound a set of orthonormal basis vectors for the range of C_j. D_j is the diagonal matrix of eigenvalues of C_j. It is important to note that GDS does not center the data at the mean [44, 70], contrasting to the feature matrix created using PCA. In addition, unlike PCA, GDS produces a subspace for each class independently, in order to exploit the correlations among the different classes. Once all the basis vectors U_j have been obtained, we can then calculate the total projection matrix G as follows:

$$ \mathbf G = \sum\limits_{j=1}^{C} {\mathbf{U}_{j}}^{T} \mathbf{U}_{j}. $$

(4)

The eigen-decomposition of the total projection matrix G produces a K×K orthogonal matrix U_s. The sum subspace S, spanned by U_s, can be decomposed into the sum of the following subspaces:

$$ \mathcal{S} = \mathcal{M} \oplus \mathcal{D}, $$

(5)

where D is the generalized difference subspace. By using this decomposition, we can formulate the subspace that represents the differences among all the subspaces just excluding the subspace ℳ from the sum subspace S. In practical terms, the filter bank F_s is defined by the remaining D_s vectors of S after excluding the D_ℳ first vectors. This procedure can be implemented by the following expression:

$$ \mathcal{F}_{s} = \mathbf{U}_{s}\mathbf{R}_{s}, $$

(6)

where R_s is a K×K matrix containing 0 on its first D_ℳ principal diagonal entries, 1 on the remaining D_s principal diagonal entries, and 0 elsewhere. After this procedure, we should have a supervised filter bank $\mathcal {F}_{s} \in \mathbb {R}^{D_{s} \times K}$.

3.6 Filtering an input image

Here, we describe how to filter an input image using the unsupervised and supervised filter banks developed previously. Since the filter banks are D_u and D_s−dimensional subspaces, we can use each eigenvector of $\mathcal{F}_{u} = \{{\phi }_{r}\}_{r=1}^{D_{u}}$ and $\mathcal{F}_{s} = \{{\psi }_{t}\}_{t=1}^{D_{s}}$ as convolutional filters. Therefore, given an input image $\mathbf {P}_{in} \in \mathbb {R}^{N \times M}$, the goal here is to filter P_in as follows:

$$ \mathbf{V}_{\mathit{r}} = \text{map}_{K}(\phi_{r}) * \mathbf{P_{\mathit{in}}}, \quad r=\{1, \ldots, D_{u} \}. $$

(7)

$$ \mathbf{W}_{\mathit{t}} = \text{map}_{K}(\psi_{t}) * \mathbf{P}_{\mathit{in}}, \quad t=\{1, \ldots, D_{s} \}. $$

(8)

In Eqs. 7 and 8, the operator map_K(·) maps an input vector $y \in \mathbb {R}^{K_{1} K_{2}}$ onto a matrix $\mathbf {Y} \in \mathbb {R}^{{K_{1} \times K_{2}}}$. The symbol ∗ refers to a convolution with zero-padding in the boundary of the image patch.

It is important to note that the output of the first layer of our proposed network will produce D_s+D_u images. By using the unsupervised and supervised filtered images V_r and W_t, more subspaces can be learned to create more layers. Usually, more than one layer is employed in shallow networks, so more features can be extracted from P_in. For instance, for a Z=2 layers network, we should learn 4 filter banks, where $\mathcal{F}_{u}^{1}, \mathcal{F}_{s}^{1}$ may be learned from X_u and X_s, and $\mathcal{F}_{u}^{2}$ and $\mathcal{F}_{s}^{2}$ can be learned from V_r and W_t. Figure 3 shows the convolution processes using two basis vectors.

3.7 Feature mapping

The feature vectors generated by the convolutional layers of shallow networks are usually very large, since there are no pooling layers. As the model becomes deeper (i.e., the number of layers increases), the number of feature maps grows exponentially. The fast growth of the feature vector severely limits feature extraction performance and processing efficiency. To solve this weakness, it is required to employ a specific layer to reduce the dimensionality of the feature vector generated by convolutional layers.

After filtering the input image P_in, the produced filtered images are concatenated to achieve a high dimensional vector, for example, given a feature vector generated from a network with the following set of parameters: K₁=K₂=8, input image size of M=N=28,D_u=D_s=5, and Z=1. Then, the final feature vector will be a (D_u+D_s)(MN)=7840−dimensional vector. In this simple simulation, it is clear that a dimensionality reduction technique is required.

For the Zth layer, $N_{u}^{Z} + N_{s}^{Z}$ images will be generated as a result of successive Z convolutions. The number of images in the final convolutional layer depends on the dimension of the unsupervised and supervised subspaces of each layer and can be obtained as follows:

$$ N_{u}^{Z} = \prod_{z = 1}^{Z} D_{u}^{z}. $$

(9)

$$ N_{s}^{Z} = \prod_{z = 1}^{Z} D_{s}^{z}. $$

(10)

Following the procedure of PCANet, we can convert the filtered images to a set of $N_{u}^{Z-1} + N_{s}^{Z-1}$ images as follows:

$$ \mathbf{T}_{u}^{m} = \sum\limits_{z=1}^{N_{u}^{Z}} 2^{(z-1)}\mathrm{H}(\mathbf{V}_{m}), \quad m=\{1, \ldots, N_{u}^{Z-1}\}. $$

(11)

$$ \mathbf{T}_{s}^{n} = \sum\limits_{z=1}^{N_{s}^{Z}} 2^{(z-1)}\mathrm{H}(\mathbf{W}_{n}), \quad n=\{1, \ldots, N_{s}^{Z-1}\}. $$

(12)

In Eqs. 11 and 12, the filtered images V_m and W_n are binarized using a Heaviside step-like function H(·), whose value is 1 for positive entries and 0 otherwise. After this procedure, we achieve $N_{u}^{Z-1} + N_{s}^{Z-1}$ integer-valued $\mathbf {T}_{u}^{m}$ and $\mathbf {T}_{s}^{n}$ images with pixel value in the range $[0, 2^{N_{u}^{Z}}-1]$ and $[0, 2^{N_{s}^{Z}}-1]$, respectively. It is worth noting that this dimensionality reduction is also employed in shallow networks-based transfer learning [71]. Then, each $\mathbf {T}_{u}^{m}$ and $\mathbf {T}_{s}^{n}$ images are partitioned into B blocks, where block-wise histogram is applied. At last, the feature f=[f_u,f_s] of the input image P_in is defined as the set of block-wise histograms b_h:

$$ f_{u} = [\mathbf{b}_{h}(\mathbf{T}_{u}^{1}), \mathbf{b}_{h}(\mathbf{T}_{u}^{2}), \ldots, \mathbf{b}_{h}(\mathbf{T}_{u}^{N_{u}^{Z-1}})]^{T}. $$

(13)

$$ f_{s} = [\mathbf{b}_{h}(\mathbf{T}_{s}^{1}), \mathbf{b}_{h}(\mathbf{T}_{s}^{2}), \ldots, \mathbf{b}_{h}(\mathbf{T}_{s}^{N_{s}^{Z-1}})]^{T}. $$

(14)

Most modern networks [72] make use of features of each layer, creating a huge vector. Although the idea is appealing, we chose to use the strategy employed by PCANet, since it is more similar to the procedure used by CNN. In the investigated shallow networks, SVM is applied for the classification. The same classifier is then used with DFSNet.

One of the advantages of our proposed shallow network is its reduced number of parameters compared to deep learning networks. The hyper-parameters of DFSNet are as follows: the filter size K, the number of layers Z, the number of filters in each layer $D_{u}^{1}, D_{u}^{2}, \ldots, D_{u}^{Z}$ and $D_{s}^{1}, D_{s}^{2}, \ldots, D_{s}^{Z}$, and the block size B for the histogram. Figure 4 presents the proposed shallow network equipped with two convolutional layers and a feature mapping layer.

4 Experimental results and discussion

In this section, the effectiveness of the proposed network is evaluated using five datasets: CIFAR-10 [73], LFW [74], NYU Depth V1 [75], ETH-80 [76], and FERET [77], which include varied classification tasks such as face recognition, indoor scene recognition, and object classification. Our experiments are broken down into three main series. First, the visualization of the filters produced by the proposed network using the ALOI [78] dataset is provided to verify the similarities among them. Then, feature separability of DFSNet in different scenarios is analyzed, including when only unsupervised data is available and when just supervised data is employed. Finally, a comparison with current shallow networks is presented.

4.1 Visualization of the filters produced by the proposed method

In this experiment, the unsupervised and supervised filters are presented and analyzed. DFSNet is trained using the ALOI database with 50% of unsupervised data and 50% of supervised data in order to make a clear comparison.

ALOI is a database containing 72000 images and 1000 classes. These images were obtained from several points of view and with variations in the illumination. The ALOI dataset version that contains only changes of point of view was utilized. For sake of simplicity, DFSNet was trained with 1 layer, where K₁=K₂=8. ALOI database provides good examples of high similar classes, which may expose the difficulties in extracting discriminative patterns. For visualization purposes, filters employed RGB data. Figure 5 shows samples of the ALOI dataset employed in this experiment.

Figure 6 presents the filters and the filtered images produced by the proposed network. Figure 6a shows the unsupervised filters produced by PCA, which are distributed in each row according to their eigenvalue in decreasing order, from left to right. Thus, the leftmost filter of each row is the most representative filter. Regarding the filters produced by PCA, it is possible to observe that the first filters are very similar to edge and contour detectors and that the following filters are very similar to texture and color detectors. Although these filters provide an interpretable view, they are not discriminative, since PCA does not account for the relation between patterns of different image classes.

Figure 6b presents the supervised filters generated by GDS. Again, the leftmost filter of each row is the most discriminative one. In this experiment, we set D_ℳ=2, since this value reduces information loss. From the filtered images, we can notice that the ones produced by GDS exhibit higher variability than the filtered images produced by the PCA filter banks. For example, images filtered by PCA are very similar in terms of color aspects, while images filtered by GDS present more color variability. This phenomenon is directly related to the GDS approach, which acts by exposing discriminatory characteristics (that is, features that are not present in other classes of images), while images filtered by PCA focus on common patterns (i.e., the principal components). According to this observation, we can confirm that images filtered by GDS produce more distinctive features than features provided by PCA.

Moreover, in filters produced by GDS, it may be observed that it is difficult to find visually interpretable patterns, such as those found in filters created by PCA. This behavior is specially due to the fact that GDS evaluates the differences between edges, contours, color, and textures generated by all classes. As a result, GDS filters provide less visual interpretability, since they represent the differences between all subspaces combinations.

4.2 Analyzing feature separability in different scenarios

The objective of this experiment is to determine whether supervised information improves the discriminability ability of DFSNet. To perform this experiment, the proposed method is trained using only 1 layer in 4 different scenarios: (1) when no supervised data is available, (2) when unsupervised data is abundant (80% of unsupervised and 20% of supervised data), (3) when unsupervised and supervised data are balanced (50% of each), and (4) when supervised data is abundant (20% of unsupervised and 80% of supervised data).

The multidimensional scaling (MDS) [79] is used to visualize features obtained from 5 classes of ALOI dataset. These classes, whose images are shown in Fig. 5, were selected due to their high similarity regarding shape and color. For example, first and second classes, called here classes A and B respectively, present a similar shape, whereas the three remaining classes (C, D, and E) exhibit identical texture and color.

Figure 7a shows the scatter when only unsupervised data is available. In this scenario, the proposed network is reduced to PCANet, where the filter banks are produced using only unsupervised data. This plot suggests that patterns of the classes C, D, and E present a high rate of overlap, where it is challenging for a classifier to generate appropriate separation hyperplanes.

In Fig. 7b, where unsupervised data is still abundant, but a few amount of labeled data is also used, patterns of the classes C, D, and E present lower overlap when compared to the previous scenario. In this case, a classifier trained with an appropriate kernel may learn a feasible solution. The situation where unsupervised data is abundant is the most realistic among all scenarios investigated in this section.

Figure 7c shows the illustration where unsupervised and supervised data are balanced. In this scheme, as expected, Fig. 7c suggests that the overlap between patterns is lower than in the previous scenario and may reflect the influence of supervised data. Here, GDS has sufficient supervised data to reduce overlap between the classes considerably and, visually, class C is well separated from classes D and E.

Finally, as it was also expected, Fig. 7d exhibits the best scenario, when supervised data is abundant. In this illustration, the extracted features are mostly supervised and reveal the discriminative ability of GDS to remove overlap between classes. Among all the investigated scenarios, this is less realistic regarding the semi-supervised learning paradigm.

4.3 Comparison with related shallow networks

In this section, we compare DFSNet to the following unsupervised shallow networks: PCANet, DCTNet, CCANet, and CFR-ELM, as well as to the supervised shallow networks: LDANet, DCCNet, OSNet, and CKNet. In the following, we describe the employed datasets and, after that, we show the experimental results.

4.3.1 Datasets and experimental settings

For face recognition evaluation, the FERET dataset [77] is employed. FERET comprises 1196 images from 429 subjects. Images were taken under varying lighting conditions, with diverse expressions and throughout 3 years. The dataset is divided into gallery and probe. The probe set is subdivided into 4 sections, as follows: Fb containing different expressions, Fc including varying lighting conditions, dup-I obtained within the period of 3 to 4 months, and finally, dup-II obtained after 1 and a half year apart from the initial dataset development. We employed 150×90 grayscale images with K₁=K₂=5,L₁=L₂=8 and the size of non-overlapping blocks was set to 15×15. The dimension of the produced features was reduced to 1000 by whitening PCA in order to facilitate the comparison with the other shallow networks. These parameter values were chosen experimentally

We employ ETH-80 dataset for object recognition. ETH-80 contains images of 8 object categories, where each category includes 10 object subcategories in 41 different image orientations, resulting in 410 images per category. In total, ETH-80 database contains 3280 images. We resized the images to 64 pixels. ETH-80 provides images with and without background. To analyze the behavior of the learning methods, we used the object images with background. In this experiment, we set L₁=L₂=8,K₁=K₂=7, block size 7×7, and block overlapping ratio 0.5. Since ETH-80 does not explicitly provide a training set, we conduct 10 experimental runs with 2000 training images, which were randomly selected for each run.

We use LFW dataset [74] for a more challenging face recognition evaluation. It consists of images of faces collected from the web. The faces were detected using Viola-Jones face detector and cropped into 150×80 pixels. LFW dataset is specially challenging because it was designed for studying the problem of unconstrained face recognition. Following the standard evaluation protocol, we perform 10-fold cross validation using the provided 10 subsets, where each subset contains 300 intra-class pairs and 300 inter-class pairs. In this experiment, we set K₁=K₂=7,L₁=L₂=8, and 15×13 for the non-overlapping block size. We report the average result of the 10 folds. For the final feature, we employ WPCA with a size 3000. Contrasting to the experimental setup reported in [41], we do not employ the square-root operation on the final feature to maintain consistency with the other experiments provided in this work.

For object recognition, we use CIFAR-10 [73] dataset that consists of 50,000 training and 10,000 test images. The large variability in scale, viewpoint, illumination, and background clutter of images in CIFAR-10 poses a significant challenge for classification. In this experiment, we set K₁=K₂=5,L₁=40,L₂=10, and 8×8 for the overlapping block size with overlapping ratio of 0.5. Different from the experimental setup reported in [41], we do not employ spatial pyramid pooling in order to evaluate only the convolution method. Instead, we employ WPCA to produce a final feature vector of size 1000.

We also use NYU Depth V1 dataset [75] that was collected by the New York University. The dataset includes depth information, which contains both geometric information and distance of objects. NYU Depth V1 dataset consists of 2347 pairs of images grouped into 7 categories, including bathroom, bedroom, bookstore, cafe, kitchen, living room, and office. In this experiment, we employ K₁=K₂=7 and L₁=L₂=8. Exceptionally for LDANet, the number of filters is set to 6, since the reduced dimensionality must be less than the number of classes. For fair comparison, we adopt the same parameter setting for all the evaluated networks and we report results for the RGB data.

4.3.2 Results

Since the amount of unsupervised and supervised data may vary according to different applications, four versions of DFSNet are provided as follows: (1) when unsupervised data is abundant (80% and 20% of unsupervised and supervised data, respectively), (2) when unsupervised data is slightly more than the supervised one (60% and 40% of unsupervised and supervised data, respectively), (3) when there is slightly more supervised data than unsupervised one (40% and 60% of unsupervised and supervised data, respectively), and (4) when supervised data is abundant (20% and 80% of unsupervised and supervised data, respectively).

For an adequate comparison, the Coiflets and Daubechies orthogonal wavelet transform are used to extract the low-frequency sub-images of the original images to generate two view features for the CCANet [54]. Besides, the TR normalization introduced in [48] is not employed so that we can evaluate the surface networks only in relation to their convolutional filters. As in PCANet, LDANet, and DCTNet, linear SVM is adopted for the classification step due to be relatively less prone to overfitting than its non-linear version.

Surprisingly, the investigated shallow networks obtained comparable recognition rates, regardless of the learning paradigm used. Although the difference is small, in some scenarios, it is evident that one learning paradigm presents an advantage over the other. More precisely, when the amount of training data is not enough to learn a robust model, unsupervised methods offer an advantage. This observation is visible in the FERET database, where DCTNet has shown superior results compared to the other methods. When the amount of training data is sufficient to learn a robust model, supervised methods have an advantage, as in the example of the CIFAR-10 database, where DCCNet produced a very competitive recognition rate. This observation suggests that applications may benefit from models that employ both learning paradigms, thus exploiting training data efficiently. More precisely, the required amount of labeled data to improve the accuracy of the method is relatively low, establishing a better compromise between the advantages of both supervised and unsupervised paradigm.

According to Table 1, PCANet and LDANet consistently produce high recognition rates. PCANet is very competitive, even compared to supervised methods, such as LDANet and OSNet. This is an indication of the advantages that the multistage model employed by shallow networks can provide, even in the absence of labeled data. Among the unsupervised methods, PCANet presented the highest recognition results.

Table 1 Recognition rates of the proposed and the related shallow networks

Full size table

Despite being built using only random Fourier features, CKNet is extremely competitive on FERET, ETH-80, and CIFAR-10 datasets. This method is very similar to DCTNet, with the difference that in DCTNet, filters are selected deterministically. CKNet presents the ability to decode textures, which is inherited from the Fourier descriptors. Besides, Fourier transform introduces translation, scalable, and rotation invariance to the features.

DCCNet and OSNet are subspace-based methods that exploit the concept of constraint subspace to create more discriminative features. The fundamental difference between these methods is that DCCNet employs an iterative process to create its constraint subspace, while OSNet produces it through the decomposition of the principal subspace ℳ. As a result, DCCNet is good on CIFAR-10, where the number of classes is low, and the number of training samples is high, due to the iterative method of calculating the constraint subspace. Also, DCCNet can represent nonlinear structures, which may be found in the CIFAR-10 database. OSNet is competitive on ETH-80, overcoming DCCNet. In this dataset, the restricted number of training examples benefits subspace methods based on decompositions, also suggesting that the iterative method employed by DCCNet requires more samples to obtain a more efficient constraint subspace.

Compared to PCANet and LDANet, CCANet presents competitive results on CIFAR-10and ETH-80, while performing not so well on the remaining datasets. This observation suggests that CCANet is recommended in problems involving object recognition. When applied to the face recognition datasets, PCANet and LDANet perform efficiently compared to CCANet. In comparison to PCANet and LDANet, CCANet has the disadvantage of easily overfit to noise correlations between datasets, weakening its discriminative capability.

DCTNet presents particularly good results in face recognition, achieving high accuracy on LFW and FERET, which are competitive results compared to PCANet and LDANet. DCTNet benefits from the ability of DCT to concentrate energy in a few first coefficients. The filter banks employed by DCTNet make use of the first coefficients and discard the high frequencies that generally represent noise. As a result, the feature vector produced by DCTNet can be viewed as denoised data, which shows good results on face recognition datasets.

The CFR-ELM provided impressive results on CIFAR-10 and ETH-80. The method achieved competitive results on CIFAR-10, outperforming the unsupervised methods in addition to producing competitive results to DCCNet and CKNet. These results suggest that the nonlinear adaptive processing capacity of CFR-ELM inherited from the ELM can learn a rich representation for CIFAR-10. The CFR-ELM attained the highest results on the ETH-80, suggesting that object classification tasks can benefit from the auto-encoder mechanism employed by CFR-ELM.

The proposed network demonstrated superior classification rate when compared to the other evaluated shallow networks, confirming the efficiency of employing the unsupervised and supervised subspaces as convolutional layers. When 20% of the information is supervised, the proposed method performs competitively. These results confirm that the supervised subspace provided by GDS produces discriminative features that improve the classification rate. CFR-EML performed slightly better on ETH-80. This result may be somewhat predictable from that the nonlinear adaptive processing of CFR-EML works effectively on the other datasets. This point suggests that by adding some nonlinear processing in the generation of the filters, we may improve our method further.

Here, we highlight that the proposed network attained superior recognition rate compared to the other shallow networks in the CIFAR-10 database. This observation may have been influenced by the amount of training data that the database presents, as well as the reduced number of classes. Once a database presents a large amount of training data, DFSNet can learn discriminative structures efficiently.

Given a small set of labeled data and abundant unlabeled data, GDS attempts to select the most discriminative subspace from the image classes, providing complementary information. Feature fusion in neural networks by concatenation or by addition have demonstrated to be a powerful strategy to provide deeper representations [80–82]. In this approach, features from adjacent layers are concatenated to produce a more representative feature. In DFSNet, we can observe that PCA and GDS work in a similar aspect, since GDS is based on the SVD of the PCA basis vectors.

Another justification for the proposed architecture is the benefits of using networks in parallel, such as the Siamese [83, 84] and Two-Stream [85, 86] networks. These networks have the purpose of extracting more information from data, using an architecture where there are two networks in parallel.

5 Conclusions and future work

In this paper, a new shallow network is proposed and tested on face recognition, object recognition, and scene understanding. Unlike conventional shallow networks, the proposed network is capable of manipulating both supervised and unsupervised data. This ability makes the proposed network efficient even when a small amount of supervised data is available. Another advantage of the proposed method is its independence from automatic differentiation algorithms. Because their convolution filters are formed by a decomposition performed by SVD per layer, this method has advantage when employed in contexts where time is a limiting factor. The results obtained in datasets CIFAR-10, LFW, NYU Depth V1, ETH-80, and FERET show that the proposed network is capable of producing highly discriminative features compared to networks of similar architectures.

The number of layers is a limitation directly associated with the network capacity. Modern neural networks that produce competitive results, in general, have a very large number of layers. We understand that the nature of the subspace method causes such a limitation. Since the basis vectors that span the subspaces are a subset of the basis vectors produced by PCA, an amount of information, even though small, is lost. The subspace used as the first convolution filter bank represents a total of 90% of the variation found in the database. As the second subspace is produced through the images processed by the first subspace and also has a cutoff margin, the information obtained by the second subspace is of the order of 81%, following the same threshold factor. This value becomes even lower if we add a third layer. Using the same threshold factor, this layer will represent only about 72% of the dataset. Without an optimization method that can adjust the subspaces to a more suitable direction, adding more layers makes the method slower and, worse, weakening the network representation.

The second limitation of our method is the absence of pooling. Although the results produced by shallow networks in general (PCANet, LDANet, and CCANet) are very competitive, the feature vector provided by such networks are very large. Since there is no dimensionality reduction mechanisms between the layers, the produced features have exponential growth according to the number of layers. This problem restricts these networks to no more than four layers. A pooling method would add robustness to pattern rotations and dimensionality reduction, which would make feature size independent of the number of layers.

Usually, the training algorithms for neural networks are iterative and, consequently, require some initial set of parameters from which to start the iterations. Also, training neural networks is a challenging task that most methods are significantly affected by selection of the initialization parameters. Motivated by this challenge, the proposed method can be an alternative to the random initialization process. In this direction, the filter banks of the proposed network can be employed as the filter banks of a deep neural network during its initialization stage. Since the proposed networks produce better results than RandNet [41], it is expected that employing the basis vectors of a subspace may provide better accuracy in fewer iterations.

An important research direction is to extend the proposed network to handle tensor data, which is recommended for video analysis, like gesture and action recognition. Tensor subspaces exist in literature and may provide convolutional filters for such networks. In addition, it is possible to employ CFR-ELM instead of PCANet in the semi-supervised framework. The learning paradigm employed in this work can be extended to deeper architectures, which can exhibit the same advantages (e.g., computational cost). In the same research line, the proposed network can be employed as an initialization method for deeper networks.

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Abbreviations

CNN:: Convolutional neural network
PCA:: Principal component analysis
LDA:: Linear discriminant analysis
DCC:: Discriminative canonical correlations analysis
LBP:: Local binary patterns
SIFT:: Scale-invariant feature transform
HOG:: Histogram of oriented gradients
GDS:: Generalized difference subspaces
DCT:: Discrete cosine transform
CCANet:: Canonical correlation analysis network
PCANet:: Principal component analysis network
DCTNet:: Discrete cosine transform network
DCCNet:: Discriminative canonical correlation network
LDANet:: Linear discriminant analysis network
OSNet:: Orthogonal subspace network
Cosine-CKN:: Cosine convolutional kernel network
CFR-ELM:: Compact feature representation
MDS:: Multidimensional scaling

References

Z. Gong, P. Zhong, Y. Yu, W. Hu, Diversity-promoting deep structural metric learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens.56(1), 371–390 (2018).
Article Google Scholar
N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, J. Liang, Convolutional neural networks for medical image analysis: Full training or fine tuning?IEEE Trans. Med. Imaging. 35(5), 1299–1312 (2016).
Article Google Scholar
A. T. Lopes, E. de Aguiar, A. F. De Souza, T. Oliveira-Santos, Facial expression recognition with convolutional neural networks: coping with few data and the training sample order. Pattern Recog.61:, 610–628 (2017).
Article Google Scholar
X. Gao, T. Zhang, Unsupervised learning to detect loops using deep neural networks for visual slam system. Auton. Robot.41(1), 1–18 (2017).
Article MathSciNet Google Scholar
X. Xie, H. Liu, M. Edmonds, F. Gaol, S. Qi, Y. Zhu, B. Rothrock, S. C. Zhu, in 2018 IEEE International Conference on Robotics and Automation (ICRA). Unsupervised learning of hierarchical models for hand-object interactions (IEEE, 2018), pp. 1–9.
A. M. Dai, Q. V. Le, in Advances in neural information processing systems. Semi-supervised sequence learning, (2015), pp. 3079–3087.
A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, T. Brox, in Advances in Neural Information Processing Systems. Discriminative unsupervised feature learning with convolutional neural networks, (2014), pp. 766–774.
I. Bougoudis, K. Demertzis, L. Iliadis, Fast and low cost prediction of extreme air pollution values with hybrid unsupervised learning. Integr. Comput. Aided Eng.23(2), 115–127 (2016).
Article Google Scholar
M. C. Thomas, W. Zhu, J. A. Romagnoli, Data mining and clustering in chemical process databases for monitoring and knowledge discovery. J. Process Control. 67:, 160–175 (2018).
Article Google Scholar
M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, E. Muharemagic, Deep learning applications and challenges in big data analytics. J. Big Data. 2(1), 1 (2015).
Article Google Scholar
Q. Zhang, L. T. Yang, Z. Chen, Deep computation model for unsupervised feature learning on big data. IEEE Trans. Serv. Comput.9(1), 161–171 (2016).
Article Google Scholar
A. M. Dai, Q. V. Le, in Advances in neural information processing systems. Semi-supervised sequence learning, (2015), pp. 3079–3087.
M. I. Jordan, T. M. Mitchell, Machine learning: trends, perspectives, and prospects. Science. 349:, 255–260 (2015).
Article MathSciNet MATH Google Scholar
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, in International conference on machine learning. Decaf: a deep convolutional activation feature for generic visual recognition, (2014), pp. 647–655.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, in Advances in Neural Information Processing Systems. Improved techniques for training gans, (2016), pp. 2234–2242.
A. Holzinger, Interactive machine learning for health informatics: when do we need the human-in-the-loop?Brain Inf.3(2), 119–131 (2016).
Article Google Scholar
S. S. Rautaray, A. Agrawal, Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev.43(1), 1–54 (2015).
Article Google Scholar
J. Song, L. Gao, L. Liu, X. Zhu, N. Sebe, Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recog.75:, 175–187 (2018).
Article Google Scholar
R. Xia, Y. Pan, H. Lai, C. Liu, S. Yan, in AAAI. Supervised hashing for image retrieval via image representation learning, (2014), p. 2.
T. Bouwmans, E. H. Zahzah, Robust PCA via principal component pursuit: a review for a comparative evaluation in video surveillance. Comput Vision Image Underst.122:, 22–34 (2014).
Article Google Scholar
S. Ojha, S. Sakhare, in Pervasive Computing (ICPC), 2015 International Conference on. Image processing techniques for object tracking in video surveillance-a survey (IEEE, 2015), pp. 1–6.
K. Jaseena, B. C. Kovoor, A survey on deep learning techniques for big data in biometrics. Int. J. Adv. Res. Comput. Sci.9(1) (2018).
K. Sundararajan, D. L. Woodard, Deep learning for biometrics: a survey. ACM Comput. Surv. (CSUR). 51(3), 65 (2018).
Article Google Scholar
X. Geng, H. Zhang, J. Bian, T. S. Chua, in Proceedings of the IEEE International Conference on Computer Vision. Learning image and user features for recommendation in social networks, (2015), pp. 4274–4282.
J. Wang, M. Korayem, S. Blanco, D. J. Crandall, in Proceedings of the 2016 ACM on Multimedia Conference. Tracking natural events through social media and computer vision (ACM, 2016), pp. 1097–1101.
D. Ciregan, U. Meier, J. Schmidhuber, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. Multi-column deep neural networks for image classification (IEEE, 2012), pp. 3642–3649.
C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell.35(8), 1915–1929 (2013).
Article Google Scholar
Y. Sun, Y. Chen, X. Wang, X. Tang, in Advances in Neural Information Processing Systems. Deep learning face representation by joint identification-verification, (2014), pp. 1988–1996.
L. Nanni, S. Ghidoni, S. Brahnam, Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn.71:, 158–172 (2017).
Article Google Scholar
F. Zhu, L. Shao, J. Xie, Y. Fang, From handcrafted to learned representations for human action recognition: a survey. Image Vision Comput.55:, 42–52 (2016).
Article Google Scholar
M. R. Turner, Texture discrimination by gabor functions. Biol. Cybern.55(2-3), 71–82 (1986).
Google Scholar
T. Ojala, M. Pietikäinen, D. Harwood, A comparative study of texture measures with classification based on featured distributions. Pattern Recog.29(1), 51–59 (1996).
Article Google Scholar
T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell.24(7), 971–987 (2002).
Article MATH Google Scholar
D. G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.60(2), 91–110 (2004).
Article Google Scholar
N. Dalal, B. Triggs, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. Histograms of oriented gradients for human detection (IEEE, 2005), pp. 886–893.
K. Lai, L. Bo, X. Ren, D. Fox, in Robotics and Automation (ICRA) 2011 IEEE International Conference on. A large-scale hierarchical multi-view RGB-D object dataset (IEEE, 2011), pp. 1817–1824.
Q. Zhu, M. C. Yeh, K. T. Cheng, S. Avidan, in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2. Fast human detection using a cascade of histograms of oriented gradients (IEEE, 2006), pp. 1491–1498.
A Krizhevsky, I Sutskever, G. E Hinton, in Advances in neural information processing systems. Imagenet classification with deep convolutional neural networks, (2012), pp. 1097–1105.
M. A. Alsheikh, D. Niyato, S. Lin, H. P. Tan, Z. Han, Mobile big data analytics using deep learning and Apache Spark. IEEE Netw.30(3), 22–29 (2016).
Article Google Scholar
Y. Qian, J. Dong, W. Wang, T. Tan, in Media Watermarking, Security, and Forensics 2015, vol. 9409. Deep learning for steganalysis via convolutional neural networks, (2015), p. International Society for Optics and Photonics.
T. H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, Y. Ma, PCANet: a simple deep learning baseline for image classification?IEEE Trans. Image Process.24(12), 5017–5032 (2015).
Article MathSciNet MATH Google Scholar
M. Dorfer, R. Kelz, G. Widmer, Deep linear discriminant analysis. arXiv preprint arXiv:1511.04707 (2015).
C. Y. Low, A. B. J. Teoh, C. J. Ng, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multi-fold Gabor filter convolution descriptor for face recognition (IEEE, 2016), pp. 2094–2098.
K. Fukui, A. Maki, Difference subspace and its generalization for subspace-based methods. IEEE transactions on pattern analysis and machine intelligence. 37(11), 2164–2177 (2015).
Article Google Scholar
M. Nishiyama, O. Yamaguchi, K. Fukui, in International Conference on Audio-and Video-Based Biometric Person Authentication. Face recognition with the multiple constrained mutual subspace method (Springer, 2005), pp. 71–80.
S. Ding, X. Xi, Z. Liu, H. Qiao, B. Zhang, A novel manifold regularized online semi-supervised learning model. Cogn. Comput.10(1), 49–61 (2018).
Article Google Scholar
T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, et al., Never-ending learning. Communications of the ACM. 61(5), 103–115 (2018).
Article Google Scholar
C. J. Ng, A. B. J. Teoh, in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Dctnet: a simple learning-free approach for face recognition (IEEE, 2015), pp. 761–768.
J. N. Lee, Y. H. Byeon, S. B. Pan, K. C. Kwak, An EigenECG network approach based on PCANet for personal identification from ECG signal. Sensors. 18(11), 4024 (2018).
Article Google Scholar
T. Almeida, H. Macedo, L. Matos, N. Vasconcelos, Prototyping a traffic light recognition device with expert knowledge. Information. 9(11), 278 (2018).
Article Google Scholar
Y. Zi, F. Xie, Z. Jiang, A cloud detection method for Landsat 8 images based on PCANet. Remote Sens.10(6), 877 (2018).
Article Google Scholar
X. Zhu, M. Ding, T. Huang, X. Jin, X. Zhang, PCANet-based structural representation for nonrigid multimodal medical image registration. Sensors. 18(5), 1477 (2018).
Article Google Scholar
N. Wang, B. Li, Q. Xu, Y. Wang, Automatic ship detection in optical remote sensing images based on anomaly detection and SPP-PCANet. Remote Sens.11(1), 47 (2018). https://doi.org/10.3390/rs11010047.
Article Google Scholar
X. Yang, W. Liu, D. Tao, J. Cheng, Canonical correlation analysis networks for two-view image recognition. Inf. Sci.385:, 338–352 (2017).
Article Google Scholar
J. Bruna, S. Mallat, Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell.35(8), 1872–1886 (2013).
Article Google Scholar
E. Oyallon, S. Mallat, L. Sifre, Generic deep networks with wavelet scattering. arXiv preprint arXiv:1312.5940 (2013).
L. Sifre, S. Mallat, in Proceedings of the IEEE conference on computer vision and pattern recognition. Rotation, scaling and deformation invariant scattering for texture discrimination, (2013), pp. 1233–1240.
B. B. Gatto, E. M. dos Santos, in Image Processing (ICIP) 2017 IEEE International Conference on. Discriminative canonical correlation analysis network for image classification (IEEE, 2017), pp. 4487–4491.
T. K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intell.29(6), 1005–1018 (2007).
Article Google Scholar
T. K. Kim, B. Stenger, J. Kittler, R. Cipolla, Incremental linear discriminant analysis using sufficient spanning sets and its applications. Int. J. Comput. Vis.91(2), 216–232 (2011).
Article MathSciNet MATH Google Scholar
B. B. Gatto, E. M. dos Santos, K. Fukui, in Document Analysis and Recognition (ICDAR) 2017 14th IAPR International Conference on, vol. 1. Subspace-based convolutional network for handwritten character recognition (IEEE, 2017), pp. 1044–1049.
D. Cui, G. Zhang, W. Han, L. Lekamalage Chamara Kasun, K. Hu Huang, in Proceedings of the IEEE International Conference on Computer Vision Workshops. Compact feature representation for image classification using ELMs, (2017), pp. 1015–1022.
M. R. Mohammadnia-Qaraei, R. Monsefi, K. Ghiasi-Shirazi, Convolutional kernel networks based on a convex combination of cosine kernels. Pattern Recogn. Lett. (2018).
K. Fukui, N. Sogi, T. Kobayashi, J. H. Xue, A. Maki, Discriminant analysis based on projection onto generalized difference subspace. arXiv preprint arXiv:1910.13113 (2019).
Y. Sun, L. Zheng, W. Deng, S. Wang, in Computer Vision (ICCV) 2017 IEEE International Conference on. SVDNet for pedestrian retrieval (IEEE, 2017), pp. 3820–3828.
Z. Zou, Z. Shi, Ship detection in spaceborne optical image with SVD networks. IEEE Trans. Geosci. Remote Sens.54(10), 5832–5845 (2016).
Article Google Scholar
K. C. Lee, J. Ho, D. J. Kriegman, Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. Pattern Anal. Mach. Intell., 684–698 (2005).
Z. Q. Zhao, S. T. Xu, D. Liu, W. D. Tian, Z. D. Jiang, A review of image set classification. Neurocomputing (2018).
L Chen, N Hassanpour, Survey: How good are the current advances in image set based face identification?–Experiments on three popular benchmarks with a naïve approach. Comput. Vis. Image Underst.160:, 1–23 (2017).
Article Google Scholar
H. Tan, Y. Gao, Z. Ma, Regularized constraint subspace based method for image set classification. Pattern Recogn.76:, 434–448 (2018).
Article Google Scholar
L. Nanni, S. Ghidoni, S. Brahnam, Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn.71:, 158–172 (2017).
Article Google Scholar
S. Wazarkar, B. N. Keshavamurthy, A survey on image data analysis through clustering techniques for real world applications. J. Visual Commun. Image Represent.55:, 596–626 (2018).
Article Google Scholar
A. Krizhevsky, Learning multiple layers of features from tiny images. Master’s thesis (University of Tront, 2009).
G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Tech. rep., Technical Report 07-49 (University of Massachusetts, Amherst, 2007).
Google Scholar
N. Silberman, R. Fergus, in Computer Vision Workshops (ICCV Workshops) 2011 IEEE International Conference on. Indoor scene segmentation using a structured light sensor (IEEE, 2011), pp. 601–608.
B. Leibe, B. Schiele, in Computer Vision and Pattern Recognition, 2003. Proceedings 2003 IEEE Computer Society Conference on, vol. 2. Analyzing appearance and contour based methods for object categorization (IEEE, 2003), pp. II–409.
P. J. Phillips, H. Moon, S. A. Rizvi, P. J. Rauss, The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 22(10), 1090–1104 (2000).
Article Google Scholar
J. M. Geusebroek, G. J. Burghouts, A. W. Smeulders, The Amsterdam library of object images. Int. J. Comput. Vis.61(1), 103–112 (2005).
Article Google Scholar
I. Borg, P. J. Groenen, P. Mair, Applied multidimensional scaling and unfolding (Springer, 2017).
G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, in CVPR. Densely connected convolutional networks, (2017).
C. T. Chung, C. Y. Tsai, C. H. Liu, L. S. Lee, Unsupervised iterative deep learning of speech features and acoustic tokens with applications to spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process.25(10), 1914–1928 (2017).
Article Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE conference on computer vision and pattern recognition. Deep residual learning for image recognition, (2016), pp. 770–778.
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. Torr, in European conference on computer vision. Fully-convolutional siamese networks for object tracking (Springer, 2016), pp. 850–865.
R. R. Varior, M. Haloi, G. Wang, in European Conference on Computer Vision. Gated Siamese convolutional neural network architecture for human re-identification (Springer, 2016), pp. 791–808.
C. Feichtenhofer, A. Pinz, A. Zisserman, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Convolutional two-stream network fusion for video action recognition, (2016), pp. 1933–1941.
X. Peng, C. Schmid, in European Conference on Computer Vision. Multi-region two-stream R-CNN for action detection (Springer, 2016), pp. 744–759.

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their constructive comments that contributed to improving the revised version of the manuscript.

Funding

This work was supported by JSPS KAKENHI grant number 19K20335 and the Foundation for Research Support of the State of Amazonas (FAPEAM).

Author information

Authors and Affiliations

Center for Artificial Intelligence Research (C-AIR), Tsukuba, Japan
Bernardo B. Gatto & Kazuhiro Fukui
Federal University of Amazonas, Manaus, Brazil
Bernardo B. Gatto, Eulanda M. dos Santos, Waldir S. S. Júnior & Kenny V. dos Santos
University of Tsukuba, Tsukuba, Japan
Lincon S. Souza & Kazuhiro Fukui

Authors

Bernardo B. Gatto
View author publications
You can also search for this author in PubMed Google Scholar
Lincon S. Souza
View author publications
You can also search for this author in PubMed Google Scholar
Eulanda M. dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Fukui
View author publications
You can also search for this author in PubMed Google Scholar
Waldir S. S. Júnior
View author publications
You can also search for this author in PubMed Google Scholar
Kenny V. dos Santos
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

BG, LS, and ES conceived the idea, developed the method, and conducted the experiment. KF, WJ, and KS were involved in the extensive discussions and evaluations and read and approved the final manuscript.

Corresponding author

Correspondence to Bernardo B. Gatto.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gatto, B.B., Souza, L.S., dos Santos, E.M. et al. A semi-supervised convolutional neural network based on subspace representation for image classification. J Image Video Proc. 2020, 22 (2020). https://doi.org/10.1186/s13640-020-00507-5

Download citation

Received: 01 April 2019
Accepted: 20 April 2020
Published: 16 June 2020
DOI: https://doi.org/10.1186/s13640-020-00507-5

A semi-supervised convolutional neural network based on subspace representation for image classification

Abstract

1 Introduction

2 Related work

3 Proposed method

3.1 Notations

3.2 Problem setting

3.3 Representation by patches

3.4 Producing unsupervised filter banks

3.5 Producing supervised filter banks

3.6 Filtering an input image

3.7 Feature mapping

4 Experimental results and discussion

4.1 Visualization of the filters produced by the proposed method

4.2 Analyzing feature separability in different scenarios

4.3 Comparison with related shallow networks

4.3.1 Datasets and experimental settings

4.3.2 Results

5 Conclusions and future work

Availability of data and materials

Abbreviations

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords