Skip to main content

Age estimation algorithm of facial images based on multi-label sorting


Multi-label sorting learning has been successful in many fields. It can not only express the complex semantic information of learning objects, but also present good generalization ability in dealing with complex things. This paper proposes age estimation algorithm of facial images based on multi-label sorting. This estimation algorithm is for the lack of facial age dataset, and it changes the traditional multi-valued classification method, simplified the problem of tedious steps to estimate age and shortened the time for model training. A series of experiments on two age datasets shows that the algorithm has achieved very good results in evaluating indicators, and these indicators include MAE (mean absolute error), CS (cumulative score), and convergence rate. When compared with some classic algorithms of age estimation, the efficiency and accuracy of the algorithm are verified.


In recent years, multi-label sorting learning technology has been widely used in the research fields of document classification, image recognition, gene function prediction, and so on. However, the technology is relatively less applied in the field of age estimation of facial images. In the age estimation dataset, the annotation method is usually a facial image corresponding to an accurate age value, but there are many problems with such a simple annotation method. The most important of these problems is that using an accurate age value to represent the true age of a facial image is unreliable and unstable, due to the slow changes in the face’s appearance and the slight differences in facial images between similar ages as the age increases; it is easy to be confused with age classification. In addition, how to use the limited facial dataset to establish a good age estimation model has always been a problem followed with interest in the technical research process of facial age estimation, and the reason why this problem has not been resolved is due to the very small number of samples in the age dataset of facial images.

In the research of face age estimation, many scholars put forward a conclusion of classical algorithm. Geng et al. proposed Aging Patterns Subspace (AGES) algorithm [1]; by constructing a representative subspace, the aging pattern of human face is modeled, but the modeling time is longer. Guo et al. proposed the support vector machine (SVM) algorithm [2, 3] and support vector regression (SVR) algorithm [4, 5]; their algorithm is simple and robust, but it will consume a lot of machine memory and computing time. Hong et al. proposed k-nearest neighbor sorting (kNN) algorithm [6]; this algorithm is simple and accurate, but it needs a lot of calculation. In addition, Li et al. proposed Ordinal Hyperplanes Ranker (OHRank) algorithm [7], Geng et al. proposed Improved Iterative Scaling-Learning from Label Distribution (IIS-LLD) algorithm [8], and Yin et al. proposed Conditional Probability Neural Network (CPN) algorithm [9]. Among them, IIS-LLD and CPNN are one of the label distribution (LD) algorithms [10]. All the algorithms in reference [7,8,9] can be used to realize face age recognition. However, a large number of training samples are required, and the number of training samples directly affects the accuracy of recognition. Liao et al. proposed a face age feature extraction method based on deep convolution neural network [11]; it has strong discrimination and robustness, but the implementation of neural network is more complex, and the training time of model is longer.

In this paper, a face age estimation algorithm based on multi-label sorting is proposed, which can solve the problems of insufficient training samples and long training time of the model. At the same time, face age recognition of different races and different genders also has a better discrimination. Compared with other classical algorithms, it can achieve better face age recognition effect.

A review of multi-label sorting learning

Overview of multi-label sorting learning

Multi-label sorting learning can naturally express the complex semantic information of learning objects and has been successful in many fields. It was originally applied in the field of text processing; for example, it often has wrong classification due to the ambiguity of words in the document classification task. In addressing this challenge, at first, Schapire and Singer [12] and others improved the AdaBoost algorithm and achieved significant results in BoostTexter, a Boosting-based multi-label document classification system. After that, multi-label learning began to attract the attention of more and more researchers and often appeared in many research fields such as content identification of multi-media, bioinformatics, and information retrieval.

Learning a good classifier requires a lot of training sample data. However, due to some practical problems, there are very few samples about this kind of markers, so it is impossible to train a highly accurate classification model. For the label between different categories, the traditional classification learning believes that it has almost no correlation to exist independently. However, multi-label learning believes that it may have essential interrelationships for model training. Nevertheless, it is undeniable that making full use of useful information between class labels can effectively reduce the difficulty of learning tasks under limited training data conditions.

Theoretical content of multi-label sorting learning

The key to multi-label learning is to learn possible relationships [13] between the output label of the training samples, using these relationships to optimize mathematical models and improve model accuracy. First, it requires mathematical representation of the input training samples, where XRd indicates that the feature distribution of each training sample is within the d-dimensional feature space, where Y = {y1, y2, , ym} indicates a collection that covers all training sample category labels. Now, the existing training sample set is D = {(x1, Y1), (x2, Y2), , (xn, Yn)}, in which each sample is represented by a multi-label; xi = [xi1, xi2, , xid]T represents the characteristics of the ith training sample; YiY represents the category of annotation by the ith training sample, which is a subset of the label set. In order to meet the requirements of model training, it is necessary to uniformly train the number of class labels in the sample and use yi = [yi1, yi2, , yim]T to represent the label vector corresponding to the sample. In yij {−1, +1}。, if yij =  + 1, sample xi is marked as category yj; otherwise, sample xi is not marked as category yj. The purpose of multi-label learning is to get the mapping function from input to output: g : X → 2Y according to the training sample set. In the design of mapping functions, it is often converted to optimal solution to the function f = X × Y → R according to the model requirements. f(x, y) indicates the possibility that the sample x is marked as y, and the greater f(x, y) is, the greater the probability that the sample x is marked as class y. Therefore, the purpose of training is to make higher confidence level of category between each training sample and its relevant sample, while the lower confidence level of category is between the training sample and its irrelevant sample.

After learning the mapping function, we need to use an evaluation indicator to measure the performance of it. For the multi-label attribute of the sample and learning the correlation between multiple label at the same time, multi-label sorting learning adopts the RL (ranking loss) function [14] as a standard to measure the function of the model. The specific calculation formula is shown in Eq. (1):

$$ \mathrm{RL}\left(f,D\right)=\frac{1}{n}{\sum}_{i=1}^n\frac{1}{\left|{Y}_i\right|\left|{\overline{Y}}_i\right|}\left\{\left(y,\overline{y}\right)|f\left({x}_i,y\right)\le f\left({x}_i,\overline{y}\right),\left(y,\overline{y}\right)\in {Y}_i\times {\overline{Y}}_i\right\} $$

where Yi and \( {\overline{Y}}_i \) represent xi’s relevant label set and irrelevant label set, respectively. This indicator is used to calculate the proportion of errors in the label sorting process caused by relevant and irrelevant tags. Therefore, the smaller the value of RL is, the better. When RL is taken to 0, it means that the irrelevant tags are all behind the relevant tags on all the samples.

Method—age estimation mode of multi-label sorting

The primary difficulty faced by the research is the lack of training samples on age estimation of face images. For better age estimation results, multi-label replaces the original single label to represent marked face samples, and the sorting learning is based on the correlation between the age tag and the face image. So age estimation algorithm can well establish the mapping relationship between face image and age based on multi-label sorting learning.

The multi-label sorting learning is undoubtedly an important means to effectively alleviate the specificity of age estimation and the inaccurate age estimates caused by insufficient training data. This kind of multi-label sorting algorithm first expresses facial image with only single age label as a set of vectors, and the size of each element in the vector represents the correlation degree between the facial image and the corresponding age label. At the same time, it sorts the ages in ascending order and makes full use of the ordered information between age labels [15, 16] to integrate the estimation models of each age into a model matrix. It implements the age estimation model through the model matrix, instead of building an age estimation model by constructing multiple binary classifiers. In the model building process, it introduces trace norm [17] of the model matrix to control the complexity of its model algorithm and uses the matrix recovery theory to achieve the optimal solution of the model [18, 19]. After the age estimation model is acquired, a correlation vector is obtained through prediction of the face sample when performing the algorithm model test, all the elements in the vector are sorted in descending order, and the age label with the largest correlation is selected as the estimated age, while the basis for selection is that the larger sorting value indicates the higher relevance between the face image and the age label. This age estimation model shows a great advantage by making full use of the limited face age estimation dataset and successfully introducing a multi-label sorting technique, and at the same time, it is simple and effective to shorten the time of model training [20] in the operation of the matrix.

Based on multi-label face samples

In this paper, in order to adopt multi-label sorting to learn the age estimation model to let multiple age labels mark face images, first, X = [x1, x2, , xn] Rd × n represents the input of the training sample set, T = {t1, t2, , tm} is labeled collection of all age labels, t1t2tm represents the order relationship between the age label, and Y = [y1, y2, , yn]  {0, 1}m × n is the age marker status corresponding to the training sample set. Wherein, if the sample xi is marked with the age label ti, the two are related, correspondingly to the elements yij = 1 and yij = 0 in the age label vector yi; otherwise, they are not related. According to the label’s representation method, a multiple age label is used instead of a single age label in the age dataset.

Convert a single-label sample to a multi-label sample, where each face image corresponds to a label vector. For the traditional face age estimation problem, it is often converted into multiple category classification problem based on individual age labels, then the positive and negative samples are divided for each age value to construct binary classifier. After the use of multi-label representation, simple and effective matrix operations can not only achieve the age estimation algorithm for multi-label learning, but also learn the relationship between ages.

The establishment of age estimation model

After multi-label representation of face images, on the one hand, all age labels are integrated in ascending order. On the other hand, the traditional face age estimation problem based on multi-age classification is converted into the study of age matrix. In order to control the complexity of the age estimation model, the matrix norm of the model is introduced, and the matrix estimation theory is used to solve function of age estimation. The specific process of the model is shown in Fig. 1.

Fig. 1
figure 1

Algorithm flow chart

First, perform a mathematical description of the correlation function before the age estimation function is established. Assuming that (z) is a loss function, the prediction function fi(x) is used to estimate the age of test sample as the ti confidence level, generally the confidence value is between 0 and 1. The learned model marks the age of the age tag and gives it a higher score. In order to measure whether the predicted age value predicted by age estimation function is accurate, c is used to calculate the sorting loss of face sample xi marked as vector yi between ages tj and tk. The specific calculation Eq. (2) is as follows:

$$ {\mathrm{\mathcal{E}}}_{j,k}\left({x}_i,{y}_i\right)=I\left({y}_{ij}\ne {y}_{ik}\right)\mathrm{\ell}\left(\left({y}_{ij}-{y}_{ik}\right)\left({f}_j\left({x}_i\right)-{f}_k\left({x}_i\right)\right)\right) $$

In which the indicating function is I(z), the confidence value is 1 when z is logical “true,” and the confidence value is 0 when z is logical “false.” The function shows that if the age label of face sample xi is tj instead of tk, that is yij = 1 and yik = 0, the resulting correlation measurement should have fj(xi) > fk(xi) according to the prediction function. The smaller the age difference between the age labels tj and tk, the closer the predicted correlation measurements are, then the smaller the sorting loss between them; on the contrary, the larger the gap of age label, the greater the similarity difference calculated, then the more likely to cause larger sorting loss. Therefore, it is in line with the definition that the correct age label is ranked in front of the incorrect age label in the face age estimation process. When the age label of face sample xi is neither tj nor tk, then there is no sorting loss between the age labels tj and tk. From this, it can be seen that the sorting loss is caused by a piece of face image sample on all age labels as follows (3):

$$ \varepsilon \left({x}_i,{y}_i\right)={\sum}_{j,k=1}^m{\varepsilon}_{j,k}\left({x}_i,{y}_i\right) $$

The sorting loss over the entire training data set is calculated based on a single face sample image and then obtained according to \( \sum \limits_{i=1}^n\mathrm{\varepsilonup}\left({x}_i,{y}_i\right) \). To simplify the calculation, strictly limit the prediction function to a linear function \( {f}_i\left(\mathrm{x}\right)={w}_i^Tx \). Combine the prediction functions corresponding to all age labels into a parameter matrix, and define W = [w1, w2, , wm] Rd × m as the matrix parameter to be learned. Assuming that f(W) is the sorting loss over the entire training data set, define f(W) according to the previous multi-label sorting theory as shown in Eq. (4):

$$ f(W)=\frac{1}{n}{\sum \limits}_{i=1}^n{\sum \limits}_{j,k=1}^m{\varepsilon}_{j,k}\left({x}_i,{y}_i\right) $$
$$ =\frac{1}{n}{\sum}_{i=1}^n{\sum}_{j,k=1}^mI\left({y}_{ij}\ne {y}_{ik}\right)\mathrm{\ell}\left(\left({y}_{ij}-{y}_{ik}\right)\left({w}_j^T{x}_i-{w}_k^T{x}_i\right)\right) $$

In the case of serious shortage of training samples, this practice is easy to produce over fitting phenomenon by directly searching the value of W to minimize the sorting loss f(W). Base on this, a long-term age estimation study found that the face age estimation process is slowly changing and there is an orderly correlation between the age labels. To take full advantage of this correlation, we believe that the prediction functions about W are linearly dependent, so W will also cause the matrix W to be low rank, and its final optimization problem can be expressed by Eq. (5):

$$ {}_{W\in \psi }{}^{\min }\ f(W)\kern0.5em s.t.\psi =\left\{W\in {R}^{d\times m},\operatorname{rank}(W)\le r,{\left|\left|W\right|\right|}_2\le s\right\} $$

Among them, ,||·||2 denotes the matrix spectral norm and ψ denotes the range of matrix W; the range consists of the complexity of the control prediction model and the low rank matrix characteristics. Because it is non-convex, the amount of calculation to directly solve Eq. (5) is very large. For convenience calculations, introduce the inequality Eq. (6):

$$ {\left|\left|W\right|\right|}_{\ast}\le \operatorname{rank}(W){\left|\left|W\right|\right|}_2 $$

Among them, ||·|| denotes the matrix trace norm. Use the inequality (2–5) to replace the non-convex item with the Eq. (7):

$$ {\psi}^{,}=\left\{W\in {R}^{d\times m}:{\left|\left|W\right|\right|}_{\ast}\le sr\right\} $$

So the problem of original optimization becomes a solution to \( {}_{W\in \psi }{}^{\min }\ f(W) \).

In order to further simplify the model, the final age estimation objective function is established by the value range of matrix as regular terms and the optimization problem as Eq. (8) shows:

$$ {}_{W\in {R}^{d\times m}}{}^{\mathit{\min}}\ F(W):= \kern0.5em f(W)+\lambda {\left|\left|W\right|\right|}_{\ast } $$

Among them, the regular item parameter takes the value λ > 0. The parameter is used to balance the loss of the regular term and the training sample set and prevents the over fitting of the objective function.

Optimal solution

After establishing a multi-label ordered objective function, a gradient descent algorithm is used to prove that the objective function is a convex function. In order to simplify the calculation, the logistic function [21] (z) = log(1 + ez) is used in this paper as the formula in Eq. (2). At the time of the t iteration, under the premise that the solution of Eq. (8) is Wt, firstly calculate the gradient of the objective function F(W) at W = Wt, and if the gradient is set to F(Wt), then get the updated solution of the objective function as shown in Eq. (9):

$$ {W}_{t+1}={W}_t-{\eta}_t\nabla F\left({W}_t\right) $$

Among them, ηt represents the step length that is updated at the tth iteration, which is generally set to a value greater than 0. Since UtΣt is gradient of W in W = Wt, \( {W}_t={U}_t{\Sigma}_t{V}_t^T \) is SVD (singular value decomposition) decomposition of Wt, then:

$$ \nabla F\left({W}_t\right)=\lambda {U}_t{V}_t^T+\frac{1}{n}{\sum}_{i=1}^n{\sum}_{j,k=1}^m{\alpha}_{jk}^i{x}_i{\left({e}_j^m-{e}_k^m\right)}^T $$

where, in the formula, \( {\alpha}_{jk}^i=I\left({y}_{ij}\ne {y}_{ik}\right){\ell}^{\prime}\left(\left({y}_{ij}-{y}_{ik}\right){x}_i^T\left({w}_j-{w}_k\right)\right) \) and \( {e}_j^m \) a is an m vector, in which only the jth element is 1 and the other positions are 0.

Because of the complexity of the SVD decomposition calculation, it is easy to cause huge computational cost when solving the gradient Eq. (10). In the calculation of the gradient of the smooth objective function, many researchers find that its convergence rate can reach O(T−2). In recent years, they have found a similar pattern, which is that the objective function can be solved in an accelerated optimization manner if it contains a smooth term and a trace norm regular term. However, in this paper, accelerated proximal gradient (APG) algorithm [22] is used to solve the optimization problem of the objective Eq. (9). The available update values are shown in Eq. (11) according to the APG algorithm:

$$ {W}_t=\arg\ \underset{w}{\min}\frac{1}{2{\eta}_t}\left|\left|\mathrm{W}-{W}_t^{\hbox{'}}\right|\right|{}_F{}^2+\lambda {\left|\left|W\right|\right|}_{\ast } $$

Among them, \( {W}_t^{\prime }={W}_{t-1}-{\eta}_t\nabla f\left({W}_{t-1}\right) \) is the optimal solution of the objective function according to the SVD decomposition algorithm, as shown in Eq. (12):

$$ {W}_t=U{\Sigma}_{\lambda \eta t}{V}^T $$

In Eq. (12), Σληt is the diagonal matrix, and (Σληt)ij = max {0, Σij − ληt} is also.

After the optimal solution is solved, we need to determine the step value ηt of each iteration, which plays an important role in the acceleration algorithm. In this paper, a simple linear search is used to find the most suitable ηt. Assume that Pη(Wt − 1) is the optimal solution of Eq. (11) and Qη(Pη(Wt − 1), Wk − 1) is the optimal value calculated by Eq. (11). First, assume a step value and then search for the optimal step value for each iteration based on F(Pη(Wt − 1)) > Qη(Pη(Wt − 1), Wk − 1). The objective function can get the optimal matrix after the optimization calculation according to Table 1.

Table 1 Multi-label sorting calculation algorithm

Age estimation prediction and model evaluation criteria

Age estimation prediction

Facial age recognition algorithm based on multi-label sorting makes full use of the ordered information between age labels in training samples. Through the multi-label sorting function established in the previous section and the optimization solution, the age characteristic matrix is finally obtained. Assuming that the obtained age characteristic matrix is a, the prediction function of age estimation for this algorithm constructed thereby is as shown in Eq. (13).

$$ {y}_t={W}_{\ast}^T{x}_t $$

Among them, xt is the facial feature vector of the test face sample and yt is the age label relevance vector calculated from the prediction Eq. (13). The size of each element represents the correlation degree between the tested face sample and the corresponding age label in this vector. So all elements are sorted in a descending order according to the correlation degree in the vector, which gives the result that it is most likely to approach the real age when the correlation between the top-ranked age value and the tested face sample is maximized. Therefore, the age of facial image can be estimated, thereby completing the design and implementation of the entire face age estimation algorithm based on multi-tag sorting.

Estimation model evaluation criteria

The face age estimation algorithm mainly uses mean absolute error (MAE) and cumulative score (CS) as the standard to measure the accuracy of age estimation [23].

Mean absolute error (MAE)

The mean absolute error is the average value of the absolute error between the predicted age and the true age through the age estimation of all tested face images. The specific formula is shown in Eq. (14):

$$ \mathrm{MAE}=\frac{1}{\widehat{N}}{\sum}_{i=1}^{\widehat{N}}\left|{a}_i-{\widehat{a}}_i\right| $$

Among them, \( \widehat{N} \) is the number of tested face samples and ai and \( {\widehat{a}}_i \) are the true age of the ith tested face sample and the predicted age obtained by the age estimation algorithm, respectively. MAE visually describes the accuracy of sample set estimation through the age estimation algorithm. The smaller the MAE value, the higher the accuracy of the age estimation.

Cumulative score (CS)

The cumulative score indicates that the tested face image is predicted by the age estimation process, while the difference between the age estimated by the algorithm and the true age of the face image is less than or equal to the ratio between the number of test samples with predetermined threshold and the total number of test samples. The specific formula is shown in Eq. (15):

$$ \mathrm{CS}(e)=\frac{1}{\widehat{N}}{\sum}_{i=1}^{\widehat{N}}g\left(\left|{a}_i-{\widehat{a}}_i\right|-e\right) $$

Among them, g(·) is a Boolean function, and if x ≤ 0, then g(x) = 1; conversely, g(x) = 0; e is the fault tolerance rate, which is the set threshold; \( \widehat{N} \) is the number of test samples; ai and \( {\widehat{a}}_i \) are the true age of the ith tested face sample and predicted age, respectively. In the case of determining the value of threshold e, the larger the value of CS, the more samples satisfy the condition and the better the effect of age estimation. This threshold is generally set to 10 years old because it has reached the upper limit of the maximum age estimation error.

The performance of the age estimation method can be evaluated from different perspectives based on two evaluation criteria. The MAE values reflect the error level of the age estimation algorithm as a whole; the CS reflects the accuracy of the age estimation method through the error statistic curve within each age error range. These two evaluation methods are complementary and coordinated.

Experimental result and discussions

To verify the accuracy of the proposed age estimation algorithm, a series of test experiments were conducted on the two authoritative age datasets FG-NET and Refined-MORPH in this paper. At the same time, to test the performance of the algorithm, the experimental comparison of the algorithm introduced in this paper will work with multiple mainstream algorithms with higher accuracy of age estimation.

Experiment setup

In order to test the performance of multi-label sorting algorithm, this paper selects two public age datasets FG-NET and Refined-MORP; according to the characteristics of the respective datasets, different facial features were extracted and the experiments were organized and tested.

(1) The FG-NET dataset collects images of 1002 face images of 82 different individuals scanned by old photographs. Each of these images has an average of 6 to 18 face images of different ages, and the age distribution is 0–69 years old. The AAM (active appearance model feature) [24] is extracted from face features using am_tools; it also incorporates face shape and texture information at the same time, thus fully embodying the change characteristics of the facial skull and slack skin during human growth. In the process of extracting the AAM feature of the face image, first, calibrate 68 key feature points of face, then extract face facial features with reference to the AAM feature section, and finally pick out the 95% AAM feature used by 95% of tests. The AAM feature is selected from the extracted features using a PCA (principal component analysis) dimension reduction algorithm, and its data dimension reaches 200 dimensions after selected. According to the characteristics of the age dataset itself, LOPO (leave-one-person-out) processing is used to divide training sets and test sets when conducting the experimental design. That is all face images of an object are selected as test set, and all remaining face images are used as training set. After 82 experiments, the average of all results was used as the final age estimate. Since the FG-NET dataset has smaller number of pictures and is aimed at object acquisition, the use of the LOPO processing method is more precise and scientific for experimental design.

(2) In order to avoid the influence of gender and ethnicity on face age estimation studies, this paper uses the Refined-MORPH dataset. The dataset is carefully selected out of 21,060 face data samples from the MORPH-II dataset. These samples include 2570 white female face photos (white female, WF), 7960 white male face photos (white male, WM), 2570 black female face photos (black female, BF), and 7960 black male face photos (black male, BM). The number distribution of the dataset is balanced on the white and black pictures. To verify the influence of gender and ethnicity for face age estimation, a total of 16 experiments were set up for the following four aspects: no difference between gender and race, cross-race, trans-sex, and cross-race and gender. Before conducting the test experiment, the 4096-dimensional BIF feature was extracted from this dataset. It has been favored by many researchers in recent years in the field of face age estimation.

In the process of experimental comparison, in order to ensure the scientificity and consistency of the experiment and eliminate some influences, the age estimation algorithm proposed in this paper and other comparison algorithms are performed under the same conditions including the dataset and face recognition. To verify the performance of multi-label order learning on age estimation, this chapter selects several classic age estimation algorithms as comparison objects, including commonly used classification or regression algorithms to solve the problem of age estimation. In the experiment, UBSVM tool software was used to train SVM and SVR of face age estimation algorithm model. In the kNN algorithm, the k value is set to 30 according to general experience. The experiments of AGES, OHRank, and LD were performed according to the algorithm and parameters designed by the author. But beyond that, the two datasets also verify the convergence of the algorithm.

Meanwhile, in order to further verify the validity of the proposed multi-label learning in this paper and evaluate the impact of the ordering loss (R) and the norm (T) on age estimation in the objective function, compare it with the original classification loss (C) and the F norm (F) and compare it with the four groups of target loss functions: C&F, C&T, R&F, and R&T.

Analysis of results

According to the above experimental setup, this paper compares the proposed age estimation algorithm with the corresponding algorithm on the FG-NET and Refined-MORPH datasets and conducts detailed analysis and evaluation. In order to make a scientific evaluation of the algorithm, two evaluation indexes the MAE and the CS were used in this paper.

FG-NET dataset

Calculate in accordance with MAE and CS curves on the FG-NET dataset. The calculated MAE values are shown in Table 2 for all algorithms.

Table 2 MAE results of FG-NET dataset

Table 2 shows the MAE values of the AGES, SVM, SVR, kNN, OHRank, IIS-LLD, and CPNN algorithms in the FG-NET dataset. It is thus clear that although the new proposed OHRank and CPNN algorithms have been reduced in age estimation error in recent years, the multi-label sorting algorithm proposed in this paper has reduced the estimated error rate by 3% and 9% respectively compared to the two, and it has achieved the best effect in all comparison algorithms. The CS curve is calculated on the FG-NET dataset according to the CS evaluation index calculation method as shown in Fig. 2, and the threshold setting here accepts an age error value of 10 years.

Fig. 2
figure 2

CS curves with different error levels on the FG-NET dataset

It can be seen from Fig. 3 that although the FG-NET dataset is small, the multi-label sorting learning algorithm proposed in this paper has almost achieved the maximum CS value at each error value. It shows that the algorithm can estimate the number of accurate samples more and more as the acceptable error increases on the whole.

Fig. 3
figure 3

Convergence of the objective function on FG-NET dataset

Refined-MORPH dataset

Table 3 shows the MAE values calculated by using the different algorithms in the Refined-MORPH dataset based on the face experiments of different genders and races.

Table 3 MAE results of Refined-MORPH dataset

Table 3 shows the MAE values of the AGRES, SVM, SVR, kNN, OHRank, IIS-LLD, and CPNN algorithms in the Refined-MORPH dataset. In this table, the newly proposed OHRank and label distribution algorithm still belongs to the previously proposed algorithm in general, but it is not as good as the multi-label sorting learning algorithm proposed in this paper on the whole. In addition, from this table, it is also found that the smallest age estimation error and the next smallest MAE value are between same-sex ethnic groups, while the largest MAE value is regardless of gender and ethnicity. This finding shows that the problem of age estimation of face is vulnerable to gender and ethnicity. From the data in Table 3, it can be concluded that the algorithm proposed in this paper is superior to other algorithms in the experiments of face age recognition of different sexes and races.

Convergence detection

Figures 3 and 4 show the convergence of the algorithm’s objective function on the FG-NET and Refined-MORPH datasets, respectively. Convergence has achieved a faster rate on both datasets; it illustrates the problem of translating the age estimation into matrix model through multi-label sorting learning, which not only simplifies the steps of the age estimation model, but also shortens the construction time of the model.

Fig. 4
figure 4

Convergence of the objective function on Refined-MORPH dataset

Comparison between different loss functions

To verify the effect of RL(R) and the norm (T) on age estimation in the objective loss function. In this paper, classification loss (C) and F norm (F) are used as benchmarks for comparison to form C&F, C&T, R&F, R&T total four age-estimated loss functions. Table 4 shows the MAE values obtained on the FG-NET and Refined-MORGH datasets. From the table, we can see that R&T achieves a smaller MAE value than other loss functions, which proves the effectiveness of the algorithm.

Table 4 MAE results of FG-NET dataset

Based on the above experimental analysis, the age estimation algorithm of multi-label sorting learning proposed by this paper has achieved good results whether it is from the MAE value of evaluating indicator, the CS curve or the convergence speed, which is mainly attributed to the effectiveness of multi-label sorting learning. First, using multi-age labels to represent face samples in the case of limited training samples, it not only enriches the representation of the dataset to certain extent, but also transforms the traditional multi-category age estimation method into the solution of age matrix, thereby shortening the training time of the age estimation model. At the same time, in order to learn the orderly information between the age tags, the algorithm uses the sorting loss function and introduces the matrix norm, which not only successfully reduces the age estimation error, but also verifies the effectiveness of the proposed target loss function in this paper through the fourth set of experiments.


Although multi-label learning is widely used in the fields of text analysis, bioinformatics analysis, and so on and presents good generalization ability in dealing with complex things, it is still unknown to improve the accuracy of face age estimation. For the insufficiency of the age dataset, this paper first transforms the single age label of the face image sample into the multi-label vector according to the correlation degree and then integrates a matrix of age characteristics following sequence of age, which changes the traditional method of multiple binary classification. The use and study of matrix operation of age simplifies the tedious steps of age estimation problem and shortens the model training time on age estimation model. At the same time, in order to take full advantage of the age-label ordering information to make up for the lack of training samples, the multi-label learning method uses ranking loss function to learn the sequence information between all age tags and introduces a matrix trace norm to control the complexity of an age estimation model. The optimization of an age estimation model is achieved through the APG algorithm during the solution process of the model. For the proposed multi-label learning algorithm, this paper conducted a series of experiments on two age datasets and verified its efficiency and accuracy compared to some classical algorithms of age estimation.



Active appearance model feature


Aging Patterns Subspace


Conditional Probability Neural Network


Cumulative score


Improved Iterative Scaling-Learning from Label Distribution


k-nearest neighbor sorting


Label distribution




Mean absolute error


Ordinal hyperplanes ranker


Principal component analysis


Ranking loss


Singular value decomposition


Support vector machine


Support vector regression


White male


  1. X. Geng, Z.H. Zhou, K. Smithmiles, Correction to “automatic age estimation based on facial aging patterns”. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 368–381 (2008)

    Article  Google Scholar 

  2. G. Guo, G. Mu, Y. Fu, Human age estimation using bio-inspired features. Comput. Vis. Pattern Recognit. 2009. CVPR 2009. IEEE, 112–119 (2009)

    Google Scholar 

  3. G. Guo, G. Mu, Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression. Comput. Vis. Pattern Recognit. IEEE, 42(7), 657–664 (2011).

  4. G. Guo, Y. Fu, T.S. Huang, Locally adjusted robust regression for human age estimation. IEEE Trans. Pattern Anal. Mach. Intell. 76(6), 331–346 (2014)

    Google Scholar 

  5. C. Li, Q. Liu, Image-based human age estimation by manifold learning and locally adjusted robust regression. IEEE Trans. Image Process. 64(1), 1176–1188 (2016)

    MathSciNet  Google Scholar 

  6. R. Hong, Z. Hu, L. Liu, Understanding blooming human groups in social networks. IEEE Trans. Multimedia 17(11), 1–15 (2016)

    Google Scholar 

  7. C. Li, Q. Liu, W. Dong, Human age estimation based on locality and ordinal information. IEEE Trans. Cybern 45(11), 2522–2534 (2017)

    Article  Google Scholar 

  8. X. Geng, Z.H. Zhou, K.S. Miles, Facial Age Estimation by Learning from Lable Distributions (Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, 2010), pp. 451–456

    Google Scholar 

  9. C. Yin, X. Geng, Facial age estimation by conditional probability neural network. Pattern Recognit. Springer, Berlin Heidelberg 15(2), 243–250 (2012)

    Google Scholar 

  10. Q. Zhao, X. Geng, Selection of objective functions in market-distributive learning. Com. Sci. Explor. 11(5), 708–719 (2017)

    Google Scholar 

  11. Liao H B, Yan Y C, Dai W H and Fan P: Age Estimation of Face Images Based on CNN and Divide-And-Rule Strategy, Mathematical Problems in Engineering, 2018

    Google Scholar 

  12. R.E. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37(3), 297–336 (1999)

    Article  Google Scholar 

  13. G. Liu, Z. Lin, S. Yan, Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2010)

    Article  Google Scholar 

  14. Z.B. Ren, L.L. Wang, Z.L. Fu, Multi-label classification integration learning algorithm based on ranking loss. Comput. Appl. 33((S1)), 40–42 (2013) 68

    Google Scholar 

  15. C.W.L. Chao, J.Z. Liu, J.J. Ding, Facial age estimation based on label-sensitive learning and age-oriented regression. Pattern Recogn. 46(3), 628–641 (2013)

    Article  Google Scholar 

  16. K. Chen, S. Gong, T. Xiang, Cumulative attribute space for age and crowd density estimation. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE Comput Soc. 9(4), 2467–2474 (2013)

  17. K. Yu, X.J. Wu, Semi-supervised community discovery of latent mapping based on KL divergence matrix traces. Comput. Eng. 12, 296–302 (2017)

    Google Scholar 

  18. K.-C. Toh, Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacif. J. Optim. 6(3), 615–640 (2010)

    MathSciNet  MATH  Google Scholar 

  19. S. Ji, J. Ye, An Accelerated Gradient Method for Trace Norm Minimization (International Conference on Machine Learning, ICML 2009, Montreal, 2009), pp. 457–464

    Google Scholar 

  20. S.J. Huang, Demilitarization of the Use of Marker Relationships in Multi-Marker Learning. Journal of Nanjing University (Natural Science Edition). 56(8), 882-890 (2015)

  21. Y.F. Guo, F.Y. Ning, H.H. Chao, A socialized matrix decomposition recommendation algorithm based on logistic function. J. Beijing Inst. Technol. 36(1), 70–74 (2016)

    Google Scholar 

  22. Y.M. Wang, J.P. Zhai, Y. Mo, 3D reconstruction of human body based on orthogonal matching tracking and accelerating proximal gradient. Chin. J. Biomed. Eng. 36(4), 385–393 (2017)

    Google Scholar 

  23. Q. Wang, Face Age Estimation Based on Adaptive Marker Distribution Learning. Journal of Southeast University (Natural Science Edition). 3, 475–479 (2017)

  24. L.F. Xu, J.Y. Wang, J.N. Cui, Dynamic expression recognition based on dynamic time warping and active appearance model. Chinese. J. Electron. Inf. (EIS) 40(2), 338–345 (2018)

    Google Scholar 

Download references


The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.


This work was supported in part by a grant from the Characteristics innovation project of colleges and universities of Guangdong Province (Natural Science, No. 2016KTSCX182, 2016) and a grant from the Youth Innovation Talent Project of colleges and universities of Guangdong Province (No. 2016KQNCX230, 2016).

Availability of data and materials

We can provide the data.

About the authors

Zijiang Zhu received a master’s degree in software engineering from Wuhan University in 2009 and a senior engineer title in 2008. From 2012 to 2016, he was the seventh batch of school level training objects of “thousand, 100 and ten” projects in Guangdong higher education institutions. He is a senior member of the China Computer Federation. He is currently an associate professor, dean of the School of Information Science and Technology, and deputy director of the Institute for intelligent information processing, South China Business College of Guangdong University of Foreign Studies, Guangzhou, China. His current research areas include image processing, machine learning, and big data technology.

Hang Chen received a master’s degree in software engineering from Guangdong University of Technology in 2012 and a senior engineer title in 2007. From 2012 to 2016, he was the seventh batch of school level training objects of “thousand, 100 and ten” projects in Guangdong higher education institutions. He is currently an associate professor and associate dean of the School of Computer Science and Engineering, Tianhe College of Guangdong Polytechnic Normal University, Guangzhou, China. His current research areas include cloud computing technology, image recognition, data mining, and personalized recommendation technology.

Yi Hu received a master’s degree in software engineering from South China University of Technology in 2018 and a senior engineer title for Information System Project Management in 2015. He is currently a lecturer and dean assistant of the School of Information Science and Technology, South China Business College of Guangdong University of Foreign Studies, Guangzhou, China. His current research areas include image processing, machine learning, and big data technology.

Junshan Li was promoted to professor in 1999. He obtained a doctorate in computer system structure in 2001 and was elected as a provincial and ministerial expert in 2002. He is the head of the national boutique resource sharing course and the head of the national boutique course, the director of the China Computer Society, and the director of the Chinese Society of Image and Graphics. He is currently the director of the Institute of Intelligent Information Processing of South China Business School of Guangdong University of Foreign Studies. His current research interests include image processing and image understanding, intelligent computing, and intelligent systems.

Author information

Authors and Affiliations



All authors take part in the discussion of the work described in this paper. The author ZZ wrote the first version of the paper and did part of the experiments of the paper. HC, JL, and YH revised the paper in different versions of the paper. All authors read and approved he final manuscript.

Corresponding author

Correspondence to Zijiang Zhu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhu, Z., Chen, H., Hu, Y. et al. Age estimation algorithm of facial images based on multi-label sorting. J Image Video Proc. 2018, 114 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Multi-label sorting
  • Age estimation of facial images
  • Mean absolute error
  • Cumulative score