Age and Gender Prediction From Face Images Using Attentional Convolutional Network
Age and Gender Prediction From Face Images Using Attentional Convolutional Network
I. I NTRODUCTION
as one of our backbone models, to better attend to the salient
Age and gender information are very important for various
and informative part of the face. Figure 1 provide three sample
real world applications, such as social understanding, biomet-
images, and the corresponding attention map outputs of two
rics, identity verification, video surveillance, human-computer
different layers of our model for these images. As we can see,
interaction, electronic customer, crowd behavior analysis, on-
the model outputs are mostly sensitive to the edge patterns
line advertisement, item recommendation, and many more.
around facial parts, as well as wrinkles, which are important
Despite their huge applications, being able to automatically
for age and gender prediction.
predicting age and gender from face images is a very hard
problem, mainly due to the various sources of intra-class As predicting age and gender from faces are very related,
variations on the facial images of people, which makes the we use a single model with multi-task learning approach to
use of these models in real world applications limited. jointly predict both gender and age bucket. Also, given that
knowing the gender of someone, we can better estimate her/his
There are numerous works proposed for age and gender
age, we augment the feature of the age-prediction branch with
prediction in the past several years. The earlier works were
the predicted gender output. Through experimental results, we
mainly based on hand-crafted features extracted facial images
show that adding the predicted gender information to the age-
followed by a classifier. But with the great success of deep
prediction branch, improves the model performance. To further
learning models in various computer vision problems in the
improve the prediction accuracy of our model, we combine the
past decade [1]–[5], the more recent works on age and gender
prediction of attentional network with the residual network,
predictions are mostly shifted toward deep neural networks
and use their ensemble model as the final predictor.
based models.
In this work, we propose a deep learning framework to Here are the contributions of this work:
jointly predict the age and gender from face images. Given • We propose a multi-task learning framework to jointly
the intuition that some local regions of the face have more predict the age and gender of individuals from their
clear signals about the age and gender of an individual (such face images.
as beard and mustache for male, and wrinkles around eyes and
mouth for age), we use an attentional convolutional network • We develop an ensemble of attentional and residual
networks, which outperforms both individual models. classification in uncontrolled environments. Not surprisingly,
The attention layers of our model learn to focus on they were able to obtain promising results on the features
the most important and salient parts of the face. learned from these popular architectures.
• We further propose to feed the predicted gender label In [18], Lapuschkin et al. compared four popular neural
to the age prediction branch, and show that doing this network architectures, studied the effect of pre-training, evalu-
will improve the accuracy of age prediction branch. ated the robustness of the considered alignment preprocessings
• With the help of the attention mechanism, we can via cross-method test set swapping, and intuitively visualized
explain the predictions of the classifiers after they are the model’s prediction strategies in given pre-processing con-
trained, by locating the salient facial regions they are ditions using the Layer-wise Relevance Propagation (LRP)
focusing on each image. algorithm. They were able to obtain very interesting relevance
maps for some of the popular model architectures, as shown
The structure of the remaining parts of this paper is as in Figure 3. [15].
follows. Section II provides an overview of some of the previ-
ous works on age and gender prediction. Section III provides
the details of our proposed framework, and the architecture of
our multi-task learning model. Section IV, provides a quick
overview of the dataset used in our framework. Then, in Sec-
tion V, we provide the experimental studies, the quantitative
performance of our model, and also visual evaluation of model
outputs. Finally the paper is concluded in Section VI.
Fig. 2. The architecture of the work by Levi and Hassner, courtesy of [15].
III. T HE P ROPOSED F RAMEWORK average predictions (output probabilities) are used as the final
prediction. We will give more details on the architecture of
In this section we provide the details of the proposed
each of these two models in the below parts.
age and gender prediction framework. We formulate this as
a multi-task learning problem, in which a single model is
used to predict both gender and age-buckets simultaneously. A. Residual Attentional Network
In another word, a single convolutional neural network with As mentioned above, an important piece of our framework
two heads (output branches) is used to jointly predict age and is the residual attentional network (RAN), which is a convo-
gender. Figure 6 shows the block-diagram of a simple multi- lutional neural network with attention mechanism, which can
task learning model for joint age and gender prediction. be incorporated with state-of-art feed forward network archi-
tecture in an end-to-end training fashion [23]. This network is
built by stacking attention modules, which generate attention-
aware features that adaptively changes as layers go deeper into
the network.
The composition of the Attention Module includes two
branches: the trunk branch and the mask branch. Trunk Branch
Fig. 6. The block-diagram of a multi-task learning network for joint age and performs feature processing with Residual Units. Mask Branch
gender prediction. uses bottom-up top-down structure softly weight output fea-
tures with the goal of improving trunk branch features. Bottom-
Up Step: collects global information of the whole image by
Given the intuition that knowing one’s gender, can enable
downsampling (i.e. max pooling) the image. Top-Down Step:
us to better predict her/his age, we augment the input feature
combines global information with original feature maps by
of the age prediction part of the model, with the the predicted-
upsampling (i.e. interpolation) to keep the output size the same
gender from the other head. This can help us the age model
as the input feature map. The full architecture of residual
to have access to a rough estimation of gender. Through
attention network is shown in Figure 5.
experimental study, we show that doing so improves the
performance of the age prediction. Figure 7, provides the block
diagram of the proposed model architecture. B. ResNet Model
Another model used in our framework is based on residual
convolutional network (ResNet) [24]. ResNet is known to
have a better gradient flow by providing the skip connection
in each residual block. Here we use a ResNet architecture
to perform gender classification on the input image. Then,
the predicted output of the gender branch is concatenated
with the last hidden layer of the age branch. The overall
block diagram of ResNet18 model is illustrated in Figure 8.
Fig. 7. The block-diagram of the framework for joint age and gender
prediction, in which the predicted gender is used along with the neural
ResNet50 architecture is pretty similar to ResNet18, the main
embedding as the input for the age prediction branch. difference being having more layers.
3x3x128 kernel
3x3x128 kernel
3x3x128 kernel
3x3x256 kernel
3x3x256 kernel
3x3x256 kernel
3x3x512 kernel
3x3x512 kernel
3x3x512 kernel
3x3x256 kernel
3x3x512 kernel
3x3x64 kernel
3x3x64 kernel
3x3x64 kernel
3x3x64 kernel
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
512x1000
Softmax
FC layer
Input Classes
Predicted Probability
0.5 Male Female
0.4
0.3
0.2
0.1
0
[0, 0.1] [0.1, 0.2] [0.2, 0.3] [0.3, 0.4] [0.4, 0.5] [0.5, 0.6] [0.6, 0.7] [0.7, 0.8] [0.8, 0.9] [0.9, 1]
VI. C ONCLUSION
[3] S. Minaee, A. Abdolrashidi, H. Su, M. Bennamoun, and D. Zhang, Conference of the Biometrics Special Interest Group (BIOSIG). IEEE,
“Biometric recognition using deep learning: A survey,” arXiv preprint 2016, pp. 1–6.
arXiv:1912.00271, 2019. [18] S. Lapuschkin, A. Binder, K.-R. Muller, and W. Samek, “Understanding
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, and comparing deep neural networks for age and gender classification,”
“Deeplab: Semantic image segmentation with deep convolutional nets, in Proceedings of the IEEE International Conference on Computer
atrous convolution, and fully connected crfs,” IEEE transactions on Vision Workshops, 2017, pp. 1629–1638.
pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, [19] P. Rodrı́guez, G. Cucurull, J. M. Gonfaus, F. X. Roca, and J. Gonzàlez,
2017. “Age and gender recognition in the wild with deep attention,” Pattern
[5] S. Minaee, Y. Wang, A. Aygar, S. Chung, X. Wang, Y. W. Lui, Recognition, vol. 72, pp. 563–571, 2017.
E. Fieremans, S. Flanagan, and J. Rath, “Mtbi identification from [20] S. S. Lee, H. G. Kim, K. Kim, and Y. M. Ro, “Adversarial spatial
diffusion mr images using bag of adversarial visual features,” IEEE frequency domain critic learning for age and gender classification,” in
transactions on medical imaging, vol. 38, no. 11, pp. 2545–2555, 2019. 2018 25th IEEE International Conference on Image Processing (ICIP).
[6] Q. Zhao, L. Zhang, D. Zhang, and N. Luo, “Direct pore matching IEEE, 2018, pp. 2032–2036.
for fingerprint recognition,” in International Conference on Biometrics. [21] H. Kim, S.-H. Lee, M.-K. Sohn, and B. Hwang, “Age and gender esti-
Springer, 2009, pp. 597–606. mation using region-sift and multi-layered svm,” in Tenth International
Conference on Machine Vision (ICMV 2017), vol. 10696. International
[7] S. Minaee and Y. Wang, “Fingerprint recognition using translation
Society for Optics and Photonics, 2018, p. 106962J.
invariant scattering network,” in 2015 IEEE Signal Processing in
Medicine and Biology Symposium (SPMB). IEEE, 2015, pp. 1–6. [22] Y. Zhang and T. Xu, “Landmark-guided local deep neural networks for
age and gender classification,” Journal of Sensors, vol. 2018, 2018.
[8] M. De Marsico, A. Petrosino, and S. Ricciardi, “Iris recognition through
machine learning techniques: A survey,” Pattern Recognition Letters, [23] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,
vol. 82, pp. 106–115, 2016. and X. Tang, “Residual attention network for image classification,” in
Proceedings of the IEEE conference on computer vision and pattern
[9] S. Minaee, A. Abdolrashidiy, and Y. Wang, “An experimental study of recognition, 2017, pp. 3156–3164.
deep convolutional features for iris recognition,” in 2016 IEEE signal
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
processing in medicine and biology symposium (SPMB). IEEE, 2016,
recognition,” in Proceedings of the IEEE conference on computer vision
pp. 1–6.
and pattern recognition, 2016, pp. 770–778.
[10] S. Minaee and A. Abdolrashidi, “Highly accurate palmprint recognition [25] Z. Zhang, Y. Song, and H. Qi, “Age progression/regression by condi-
using statistical and wavelet features,” in 2015 IEEE Signal Processing tional adversarial autoencoder,” in Proceedings of the IEEE conference
and Signal Processing Education Workshop (SP/SPE). IEEE, 2015, on computer vision and pattern recognition, 2017, pp. 5810–5818.
pp. 31–36.
[26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
[11] S. A. Mistani, S. Minaee, and E. Fatemizadeh, “Multispectral palmprint A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
recognition using a hybrid feature,” arXiv preprint arXiv:1112.5997, pytorch,” 2017.
2011.
[27] R. R. Nair, R. Madhavankutty, and S. Nema, “Automated detection of
[12] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face gender from face images,” 2019.
recognition via sparse representation,” IEEE transactions on pattern
analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2008.
[13] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,”
2015.
[14] S. Minaee, A. Abdolrashidi, and Y. Wang, “Face recognition using
scattering convolutional network,” in 2017 IEEE Signal Processing in
Medicine and Biology Symposium (SPMB). IEEE, 2017, pp. 1–6.
[15] G. Levi and T. Hassner, “Age and gender classification using convo-
lutional neural networks,” in Proceedings of the IEEE conference on
computer vision and pattern recognition workshops, 2015, pp. 34–42.
[16] M. Duan, K. Li, C. Yang, and K. Li, “A hybrid deep learning cnn–
elm for age and gender classification,” Neurocomputing, vol. 275, pp.
448–461, 2018.
[17] G. Ozbulak, Y. Aytar, and H. K. Ekenel, “How transferable are cnn-
based features for age and gender classification?” in 2016 International