The analysis pipeline employed in the current approach is presented in Fig. 2. The first step entails preprocessing, followed by motion representation and feature extraction from the motion images. Dimensionality reduction is performed next, to provide the classifier with the appropriate feature descriptors.
Preprocessing
Meaningful processing of video frames requires, first, extraction of the region of interest (face). Detection of 2D facial landmarks (c.f. Fig. 1) and extraction of aligned facial images, of size 112 ×112 pixels, were accomplished using OpenFace, an open source application [13]. A binary “success” score is provided for each frame, with “0” and “1” indicating unsuccessful and successful detection, respectively. In the present work, only successfully detected frames were retained for further processing.
Motion representation
It is well supported in clinical literature that most of the non-verbal signs of depression are dynamic by nature [6, 7]. Therefore, the use of video-based methods (dynamic), as opposed to frame-based (static), is preferable. In the proposed work, three different motion history images were implemented: (a) the Motion History Image (MHI) as derived from the basic algorithm, (b) the Landmark Motion History Image (LMHI) which relies on facial landmarks, and (c) the Gabor Motion History Image (GMHI). More details regarding the specific motion representation algorithms are presented below with implementation examples illustrated in Fig. 4.
Motion History Image
The MHI is a grayscale image, where white pixels correspond to the most recent movement in the video, intermediate grayscale values to corresponding less recent movements, and black pixels to the absence of movement. The MHI algorithm, with slight variations as explained next, is applied on the aligned face image sequences derived from the preprocessed data using OpenFace as described in Section 3.1.
The MHI H, with a resolution equal to the one of the aligned faces, is computed based on an update function Ψ(x,y) as follows:
$$ H_{i}(x,y)=\left\{\begin{array}{cc} 0 & i=1\\ i \cdot s &\Psi_{i}(x,y)=1 \\ H_{(i-1)}(x,y) & \text{otherwise} \end{array}\right. $$
(1)
where s=255/N, N is the total number of video frames, (x,y) is the position of the corresponding pixel, and i is the frame number. Ψ
i
(x,y) represents the presence of movement, derived from the comparison of consecutive frames, using a threshold ξ:
$$ \Psi_{i}(x,y) = \left\{\begin{array}{cc} 1 & D_{i}(x,y)\geq \xi\\ 0 & \text{otherwise} \end{array}\right. $$
(2)
where D
i
(x,y) is defined as a difference distance:
$$ D_{i}(x,y)=\left| I_{i}(x,y)-I_{(i-1)}(x,y) \right| $$
(3)
I
i
(x,y) is the pixel intensity value in (x,y) at the ith frame. The final MHI is the H
N
(x,y).
Landmark Motion History Image
The LMHI originally proposed in Pampouchidou et al. [23] considers the landmarks derived from OpenFace. The landmarks considered are the ones which correspond to the facial features (eyes, eyebrows, nose-tip, and mouth), while the face outline is excluded.
This step was taken in order to emphasize inner facial movements and ignore the overall head movements. This is achieved by co-registering the involved landmarks using affine transformation before computing the LMHI, through alignment of the points corresponding to the temples, chin, and inner and outer corners of the eyes (landmarks {1, 9, 17, 37, 40, 43, 46}).
LMHI differs from the conventional MHI in that image intensities are not considered, but only the facial landmarks, which are detected in each frame. The adopted LMHI algorithm is similar to MHI, by maintaining the same H
i
as in (1) and modifying Ψ
i
as follows:
$$ \Psi_{i}(x,y) = \left\{\begin{array}{cc} 1 & (x,y)\in L_{i}\\ 0 & \text{otherwise} \end{array}\right. $$
(4)
where L
i
corresponds to the selected landmarks as detected in the ith frame.
Gabor Motion History Image
GMHI is another variant of MHI, where Gabor-inhibited images substitute original image intensities. The motivation for implementing this variant is that it focuses on the important details of the facial features and thus extracts the most relevant information. The motion representation algorithm is identical to the one described in Section 3.2.1, but the input image I is the result of the Gabor inhibition. The process of obtaining the Gabor-inhibited image is explained in detail below.
The Gabor wavelet at position (x,y) is given by:
$$ {} \Psi_{\lambda,\theta,\phi,\sigma,\gamma}(x,y)= exp\left(-\frac{x^{'2}+\gamma^{2}y^{'2}}{2\sigma^{2}}\right)\cos\left(2\pi\frac{x^{'}}{\lambda}+\phi\right) $$
(5)
with
$$ \begin{aligned} x^{'} = x\; \cos\;\theta + y\; \sin\;\theta \\ y^{'} = -x\; \sin\;\theta + y\; \cos\;\theta \end{aligned} $$
(6)
where λ stands for the wavelength, θ for the orientation, ϕ for the phase offset, σ for the standard deviation of the Gaussian, and γ for the spatial aspect ratio [34].
The input image is usually filtered with many wavelets for multiple orientations and wavelengths. The energy filter response is obtained by combining the convolutions obtained from two different phase offsets (ϕ
0=0 and ϕ
1=π/2) using the L2-norm. Background texture suppression is applied on the filter response, by removing a DoG-filtered image from the original response for each orientation [27]. Finally, the mean response of Gabor filtering is used to combine the responses across the different orientations, resulting in the pseudo-image used to compute the GMHI. An example of applying the common Gabor and the Gabor-inhibited algorithms to an aligned face image is illustrated in Fig. 3, where the Gabor-inhibited image appears to be sharper and with less texture in uniform regions than the original Gabor response.
Feature extraction
Feature extraction was implemented in the present work using two alternative approaches. The first employs appearance-based descriptors, popular in facial image analysis, while the second presents a preliminary attempt to address the problem based on deep learning methods. In both cases, the features were extracted from motion images, instead of the original video recordings.
Appearance-based descriptors
The appearance-based descriptors employed in the present work include the Histogram of Oriented Gradients (HOG), the Local Binary Patterns (LBP), and the Local Phase Quantization (LPQ). Additionally, the combined histogram, mean. and standard deviation of the motion-image gray values are also considered as a single descriptor [Hist-Mean-Std]. Specifically for the histogram, zero values (absence of movement) are disregarded, and only the bins of the remaining 255 gray values are considered, resulting in a 1 ×257 feature vector in addition to mean and standard deviation. The rest of the descriptors are explained next and illustrated in Fig. 4 for each motion image.
Histogram of oriented gradients (HOG) [35] entails counting gradient orientations in a dense grid. Each image is divided into uniform and non-overlapping cells; the weighted histogram of binned gradient orientations for each cell is computed and subsequently combined to form the final feature vector. HOG results in a 1 ×6084 feature vector.
Local Binary Patterns (LBP) [36] entails dividing the image into partially overlapping cells. Each pixel of the cell is compared to its neighbors to produce a binary value (pattern). The resulting descriptor is a histogram which represents the occurrence of different patterns. LBP for two sets of {radius, neighborhood} results to feature vectors of size 1 ×59 for {1,8} and size 1 ×243 for {2,16}.
Local Phase Quantization (LPQ) [37] is computed in the frequency domain, based on the Fourier transform, for each pixel. Local Fourier coefficients are computed, while their phase information results in binary coefficients after scalar quantization. The final descriptor corresponds to the histogram of the binary coefficients, and it consists of 1 ×256 features.
Visual Graphic Geometry
Visual Graphic Geometry (VGG) is a CNN variant proposed by Simonyan and Zisserman [38]. Using VGG, they achieved 92.7% top-5 test accuracy on the ImageNet Dataset, which comprises of over 14 million images in 1000 classes. The microarchitecture of VGG16 can be seen in Fig. 5.
The RGB image, with pixel values ranging between 0 and 255, is normalized by subtracting the mean pixel value. The input to VGG (a fixed-size 224 ×224 RGB image) passes through a stack of conv. layers, where the very small filters are of receptive field size 3 ×3 to capture the notion of left/right, up/down, and center. The convolution stride is fixed to 1 pixel; the spatial padding of a conv. layer input is such that the spatial resolution is preserved after convolution, i.e., the padding is 1 pixel for 3 ×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 ×2 pixel window, with stride 2. A stack of conv. layers (which has a different depth in different architectures) is followed by three fully connected (FC) layers: the first two have 4096 channels each and the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks. All hidden layers are characterized by non-linearity afforded by ReLU [30].
In the present work, a pre-trained VGG16 network was employed at different VGG layers, for each motion history image separately as well as combined in the form of an RGB image. The proposed method involves transfer learning, by employing the pre-trained VGG. It is applied on the motion history images in order to extract features before the 1st fully connected layer, as well as after the 1st, 2nd, and 3rd fully connected layers. Specifically, the D version of the network was chosen, as it has shown excellent results in related medical applications. The extracted features are subsequently used for classification purposes in the exact same manner as the appearance-based descriptors. The different implementations are explained in what follows.
Before fully connected layer (BFCL) provides the features to the fully connected layer of the VGG16, as shown in Fig. 5. Filter size is 14 ×14 with 512 kernels. Mean and max values are calculated from each filter, each resulting in a matrix of size 1 ×512 for each image.
After fully connected layer (AFCL) is the second approach. There are three fully connected layers in VGG16. Layers 1 and 2 operate on a feature matrix of size 1 ×4096, and layer 3 on a feature matrix of size 1 ×1000.
Dimensionality reduction and classification
In the present work, principal component analysis (PCA) was employed to achieve dimensionality reduction. PCA is one of the most popular methods for this purpose and is based on the linear transformation of the original feature vector, into a set of uncorrelated principal components. For a dataset of size N×M (i.e., N samples and M features), PCA identifies a M×M coefficient matrix (component loadings) that maps each data vector from the original space to a new space of M principal components. However, by properly selecting a smaller set of K<M components, the dimensionality of the data can be reduced while still retaining much of the information (i.e., variance) in the original dataset.
In the present work, classification was based on a supervised learning model using Support Vector Machines (SVMs). SVM is a non-probabilistic binary linear classifier which, based on the training samples, attempts to identify an optimal hyperplane that maximizes the distance between the two classes.