Face Recognition with Learning-based Descriptor
Zhimin Cao
1
1
The Chinese University  of Hong Kong
Qi Yin
2
2
ITCS, Tsinghua University
Xiaoou Tang
1,3
3
Shenzhen Institutes  of Advanced  Technology
Chinese Academy of Sciences, China
Jian Sun
4
4
Microsoft Research Asia
Abstract
We  present  a  novel approach to  address  the  representa-
tion issue and the matching issue in face recognition (veri-
cation).  Firstly, our approach encodes the micro-structures
of the face by a new learning-based encoding method.   Un-
like  many  previous  manually  designed  encoding  methods
(e.g.,   LBP  or  SIFT),   we  use  unsupervised  learning  tech-
niques   to   learn   an  encoder   from  the   training  examples,
which  can  automatically  achieve   very   good  tradeoff   be-
tween  discriminative  power  and  invariance.   Then  we  ap-
ply  PCA  to  get  a  compact  face  descriptor.   We  nd  that  a
simple normalization mechanism after PCA can further im-
prove  the  discriminative  ability  of   the  descriptor.   The  re-
sulting face representation, learning-based (LE) descriptor,
is compact, highly discriminative, and easy-to-extract.
To  handle  the  large  pose  variation  in  real-life  scenar-
ios, we propose a pose-adaptive matching method that uses
pose-specic  classiers  to  deal  with  different  pose  combi-
nations  (e.g.,   frontal   v.s.   frontal,   frontal   v.s.   left)   of   the
matching  face  pair.   Our  approach  is  comparable  with  the
state-of-the-art methods on the Labeled Face in Wild (LFW)
benchmark  (we  achieved  84.45%  recognition  rate),   while
maintaining  excellent   compactness,   simplicity,   and  gener-
alization ability across different datasets.
1. Introduction
Recently,   face  recognition  has  attracted  much  research
effort [9, 25,  26, 10,  11, 13, 19,  20, 30,  32] due to the pro-
gresses  of  local  descriptors  [6,   16,   17,   22,   27,   28,   29,   30]
and increasing demands of real-world applications, such as
face  tagging on  the  desktop [5]  or  the  Internet
1
.   There  are
two main kinds of face recognition tasks:  face identication
(who is who in a probe face set, given a gallery face set) and
face verication (same or not, given two faces).   In this pa-
per, we focus on the verication task, which is more widely
Yin  is  afliated  with  the  Institute  for  Theoretical   Computer  Science
(ITCS).
1
Picasa Web Albums, http://picasaweb.google.com/
Figure  1.  Images  from  the  same  person  may  look  quite  different
due  to  pose  (upper   left),   expression  (upper   right),   illumination
(lower left), and occlusion (lower right).
applicable  and  is  also  the  foundation  of  the  identication
task.
Since face verication is a binary classication problem
on an input face pair, there are two  major components of a
verication  approach:   face  representation  and  face  match-
ing.   The extracted feature (descriptor) is required to be not
only  discriminative  but   also  invariant  to  apparent  changes
and  noise.   The  matching  should  be  robust   to  variations
from pose, expression, and occlusion, as shown in Figure 1.
These  requirements  render  face  verication  a  challenging
problem.
Currently, descriptor-based approaches [10, 20, 31] have
been  proven to  be  effective  face  representations producing
best  performance  [8,   18,   12].   Ahonen  et   al.   [1]  proposed
to  use  the  histogram  of   Local   Binary  Pattern  (LBP)   [17]
to  describe  the  micro-structures  of  the  face.   LBP  encodes
the   relative   intensity  magnitude   between  each  pixel   and
its  neighboring  pixels.   It   is  invariant   to  monotonic  pho-
tometric  change  and  can  be  efciently  extracted.   Since
LBP  is  encoded  by  a  handcrafted  design,   many  LBP  va-
rieties   [21,   30,   33]   have   been  proposed  to   improve   the
original  LBP.   SIFT  [16]  or  Histogram  of  Oriented  Gradi-
ents  (HOG)  [6]  are  other  kinds  of  effective  descriptors  us-
ing handcrafted encoding.   The atomic element in these de-
scriptors can be viewed as the quantized code of the image
gradients.   Essentially,  different  encoding methods  and  de-
scriptors  have  to  balance  between  the  discriminant  power
and the robustness against data variance.
However, existing handcrafted encoding methods suffer
two drawbacks.   On one hand, manually getting an optimal
encoding  method  is  difcult.   Usually,   using  more  contex-
tual  pixels  (higher  dimension  vector)  can  generate  a  more
discriminative  code.   But   it   is  non-trivial  to  manually  de-
sign an  encoding method and  determine the codebook size
to  achieve  reasonable  tradeoff  between  discrimination  and
robustness  in  a  high  dimension  space.   In  addition,   hand-
crafted  codes  are  usually  unevenly distributed  as  shown  in
Figure  2.   Some  codes  may  rarely  appear  in  real-life  face
images.   It  means  that  the  resulting code  histogram will  be
less  informative and  less  compact,  degrading the  discrimi-
nant ability of the descriptor.
In  this  paper,   to  tackle  the  aforementioned  difculties,
we  present  a  learning-based  encoding method,  which  uses
unsupervised  learning  methods  to  encode  the  local  micro-
structures   of   the  face   into  a   set   of   discrete  codes.   The
learned  codes   are   more   uniformly  distributed  (as   shown
in  Figure  2)  and  the  resulting  code  histogram  can  achieve
much  better  discriminative  power  and  robustness  tradeoff
than existing handcrafted encoding methods.   Furthermore,
to  pursue   the   compactness,   we   apply  the   dimension  re-
duction  technique,   PCA,   to  the  code  histogram.   And  we
nd  a  proper normalization mechanism after  PCA  can  im-
prove the discriminative ability of the code histogram.   Us-
ing two simple unsupervised learning methods, we obtain a
highly discriminative and  compact face  representation,  the
learning-based (LE) descriptor.
Many  recent  researches  also  apply  learning  approaches
in face recognition, such as subspace learning [25, 26], met-
ric  learning  [9],  high-level trait  learning [13],  discriminant
model learning [20, 30, 31], but fewof these works focus on
the issue of local feature encoding [14, 24] and the study of
descriptor compactness.   Though Ahonen et al. [2] tried K-
means  cluster  to  build  local  lter  response  codebook,  they
argued manual thresholding is faster and more robust.
Besides  the  representation,   the  matching  also  plays  an
important  role.   In  most  practices,   the  face  is  aligned  by  a
similarity or afne transformation using detected face land-
marks.   Such 2D holistic alignment is  not sufcient to han-
dle  large  pose  deviations  from  the  frontal   pose.   Further,
the  large  localization  error  of  any  landmark  will   result   in
misalignment of the  whole face.   3D alignment [3]  is  more
principled  but   error-prone  and  computationally  intensive.
Wright   et   al.   [32]   recently  encoded  the  geometric  infor-
mation  into  descriptors  and  used  an  implicit   matching  al-
gorithm  to  deal   with  the  misalignment  and  pose  problem.
Gang  [10]   demonstrated  that   a  simple  elastic  and  partial
matching  metric  can  also  handle  pose  change  and  clutter
background.
To  explicitly  handle  large  pose-variance,   we  propose  a
pose-adaptive  matching  method.   We  found  that  a  specic
1 11 21 31 41 51 59
0
0.05
0.1
0.15
0.2
LBP  code  emergence  frequency  histogram
1 5 9 13 17 21 25 29 32
0
0.05
0.1
HOG  code  emergence  frequency  histogram
1 9 17 25 33 41 49 57 64
0
0.005
0.01
0.015
0.02
LE  code  emergence  frequency  histogram  
Figure 2.  The code uniformity comparison  of LBP, HOG, and the
proposed  LE  code.   We  computed  the  distribution  of  code  emer-
gence  frequency  for  LBP  (59  uniform  codes),   HOG  (32  orienta-
tion  bins)  and  LE  (64  codes)  in  1000  face  images.   Clearly,   the
histogram  distribution is  uneven  for LBP  and  HOG while our  LE
code is close uniform.
face component contributes differently when the pose com-
binations of input face pairs are different. Based on this ob-
servation, we train a set of pose-specic classiers, each for
one specic pose combination, to make the nal decision.
Combining  a  powerful  learning-based  descriptor  and  a
pose-adaptive  matching  scheme,   our   system  achieves  the
leading performance on  both  the  LFW  [12]  and  the  Multi-
PIE [8] benchmarks.
2. Overview of framework
Pipeline  overview.   Our  system  is  a  two-level  pipeline:
the   upper-level   is   the   learning-based  descriptor   pipeline
while  the  bottom-level  is  the  pose-adaptive  face  matching
pipeline.
As  shown  in  Figure  3,   we  rst   use  a  standard  ducial
point detector [15] to extract face landmarks. Nine different
components (e.g., nose, mouth) are aligned separately based
on  detected  landmarks.   The  resulting  component  images
are fed into a DoG lter (with 
1
  =  2.0  and 
2
  =  4.0) [10]
to  remove  both  low-frequency  and  high-frequency  illumi-
nation  variations.   In  each  component   image,   a  low-level
feature vector is obtained at each pixel and encoded by our
learning-based  encoder.   The  nal   component   representa-
tion  is  a  compact  descriptor  (LE  descriptor)  generated  by
the  concatenated  patch  histogram  of  the  encoded  features
after   PCA  reduction  and  normalization.   The  component
similarity is measured by L
2
  distance between correspond-
ing LE descriptors of the face pair.   The resulting 9 compo-
nent similarity scores are fed into a pose-adaptive classier,
Landmark
detection
Component
alignment
left eye
nose
left eye
nose
LE descriptor
extraction
LE
descriptor
extractor
Pose-adaptive
classifier
Pose 
evaluation
Pose-adaptive 
face similarity 
Face 
verificaton
*
Preprocessed
image
Sampling and 
normalization
Normalized low-level 
feature vectors
Learning-based 
encoding
Code image
Component
similarity vector
Component
representaion
           
left eye
left eye
nose
nose
 
.
.
.
 
 
s
1
s
2
s
9
{
{
R
1
R
2
Concatenated 
patch histogram
 LE descriptor 
DoG
{                          }
d
1,1
 d
1,2
      d
1,w
d
2,1
 d
2,2
      d
2,w
                     
d
h,1
 d
h,2
      d
h,w
PCA and 
normalization
Figure 3. The proposed LE descriptor pipeline and the pose-adaptive face matching framework.
consisting  of   a  set   of  pose-specic  classiers.   The  pose-
specic classier optimized to the pose combination of the
matching pair gives the nal decision.
Experiment overview.   We mainly use the LFW bench-
mark [12] in our experiments and followtheir protocol. The
LFW standard test set consists of ten subsets and each sub-
set   contains  300  intra-personal/extra-personal   pairs.   The
recognition algorithmneeds to run ten times for formal eval-
uation purpose.  At each time, one subset is chosen for test-
ing and the other nine are used for training.   The nal aver-
age recognition performance serves as  the evaluation crite-
rion.
3. Learning-based descriptor extraction
In   this   section,   we   describe   the   critical   steps   in   the
learning-based (LE) descriptor extraction. In order to study
the  LE  descriptors  power  precisely,  all  the  experiments in
this section are conducted in holistic face level, without us-
ing component-level pose adaptive matching.
3.1. Sampling and normalization
At   each  pixel,   we  sample  its  neighboring  pixels  in  the
ring-based  pattern  to  form  a  low-level  feature  vector.   We
sample r  8 pixels at even intervals on the ring of radius r.
Figure 4 shows four effective sampling patterns we found in
an empirical manner.   We extensively varied the parameters
(e.g.,   ring  number,   ring  radius,   sampling  number  of   each
ring) but found the differences among good patterns are not
signicant - no more than 1% on the LFW benchmark. The
2
nd
pattern in Figure 4 is our best single pattern and we use
it as our default sampling method.
Although the performances of single patterns are similar,
combining  them  together  may  give  us  a  chance  to  exploit
the complementary information captured by different sam-
pling methods.  We will discuss the use of multiple patterns
later in this section.
After   the  sampling,   we  normalize  the  sampled  feature
vector into unit length.   Such normalization combined with
DoG preprocessing makes the feature vector invariant to lo-
cal photometric afne change.
R
1
R
2
(2)
(3)
R
1
R
2
(4)
R
1
(1)
R
1 
Figure 4. Four typical sampling methods used in our experiments:
(1)  R1   =  1,   with  center;   (2)  R1   =  1, R2   =  2,   with  center;   (3)
R1   =  3,   no  center;   (4)  R1   =  4, R2   =  7,   no  center.   (The  sam-
pling  dots  on  the  green-square  labeled  arcs  are  omitted  for  better
visuality).
3.2.   Learning-based  encoding  and  histogram  rep-
resentation
Next,  an  encoding method is  applied to  encode the  nor-
malized  feature  vector   into  discrete  codes.   Unlike  many
handcrafted   encoders,   in   our   approach,   the   encoder   is
specically trained for the  face in  an  unsupervised manner
from a set  of training face images.   We  have tried three un-
supervised  learning methods:   K-means,  PCA  tree  [7],  and
random-projection  tree  [7].   While  K-means  is  commonly
used  to  discover  data  clusters,   random-projection tree  and
PCA  tree  are  recently  proved  effective  for   vector   quanti-
zation.   In  our implementation, random-projection tree  and
PCA  tree  recursively split  the  data based  on uniform crite-
rion,   which  means  each  leaf  of  the  tree  is  hit  by  the  same
number of vectors.   In other words, all  the  quantized codes
have a similar emergence frequency in the vector space (as
shown in Figure 2).
After   the   encoding,   the   input   image   is   turned  into  a
code  image  (Figure  3).   Following the  method  described
in Ahone et al.s work [1], the encoded image is divided into
a grid of patches (57 patches for the holistic face (8496)
used in this  section).   A histogram of the LE codes is  com-
puted in each patch and the patch histogramis concatenated
to form the descriptor of the whole face image.
The  choice  of  the  learning  method  and  the  code  num-
ber   are  important   for   our   learning-based  encoding.   Fig-
ure 5 shows the performance comparison of the three learn-
ing  methods  under  different  code  number  setting.   We  se-
lect   1,000  images  from  the  LFW  training  set   to  train  our
learning-based  encoders.   On   each  image,   a   number   of
8,064  (=84    96)   feature   vectors   are   sampled  as   train-
ing   examples.   We   varied   the   code   number   from  4   to
131,072 (=2
17
) and plotted the recognition rate (we stopped
testing  K-means  after  reaching  2
9
codes  since  the  compu-
tation becomes intractable).   Notice that random-projection
tree slightly outperforms the  other two and thus  is  adopted
in  the  following  as  default.   We  compare  our  LE  descrip-
tor with LBP (59-bin), HOG (8-bin), and Gabor [29] on the
LFW.   Our  LE  descriptors  start  to  beat  existing  descriptors
(LBP 72.35%, HOG 71.25%, and Gabor 68.53%) when the
code  number  reaches  32.   And  our  LE  descriptor  achieves
77.78% rate when the code number reaches 2
15
.
3.3. PCA dimension reduction
If we use the concatenated histogram directly as the nal
descriptor, the resulting face feature may be too large (e.g.,
256 codes  35 patch = 8,960 dimension).   A large feature
not only limits the number of faces which can be loaded into
memory,  but  also  slows  down  the  recognition  speed.   This
is  very  important   for  the  applications  that   need  to  handle
a  large  number  of  faces,   for  example,  recognizing all  face
photos  on  a  desktop.   To  reduce  the  feature  size,   we  apply
Principle Component Analysis (PCA) [23] to compress the
concatenated histogram, and call the compressed descriptor
as our nal learning-based (LE) descriptor.
Surprisingly,   we  found  that  PCA  compression  substan-
tially  improves  the  performance  if  a  simple  normalization
is applied after the compression.  Figure 6 shows the recog-
nition  rates  of  LE  descriptors  with  different  normalization
methods.   Without  the  normalization,   the  compressed  fea-
ture is inferior to the uncompressed one by  6% points.   But
with  L
1
  or  L
2
  normalization,  the  PCA  version  can  be  5%
higher. This result reveals the angle difference between fea-
tures is most essential for the recognition in the compressed
space.   To conrm this key observation, we also tried to ap-
ply PCA  compression to  LBP.  We  repeated the  same  com-
pression and normalization operations and also  found sim-
ple  normalization  can  boost   uncompressed  LBPs  perfor-
mance  3%  points  while  skipping  such  step  will   detract   it
5% points.
To   obtain   the   optimal   setting   for   the   LE  descriptor,
we  extensively  studied  the  parameter  combination of  code
number and PCA dimension.  For large code number shows
little  performance  advantage  after   PCA  compression,   we
choose 256 code and 400 PCA-dimension as our default set-
ting in the following experiments.
Our   default   LE  descriptor  achieves  recognition  rate  as
high  as  81.22%,   which  signicantly  outperforms  previous
descriptors,   using  only  400-dimension  feature  vector   for
the  holistic  face,   about   20%  the  size  of  the  59-code  LBP
descriptor.   This  demonstrated  that   our   descriptor   extrac-
tion  pipeline  (pre-processing,   sampling  and  normalizing,
learning-based encoding, and dimension reduction) is  very
effective for producing a compact and highly discriminative
descriptor.
3.4. Multiple LE descriptors
As   discussed   in   Section   3.1,   our   exible   sampling
method enables us to generate a class of complementary LE
4 16 64 256 1024 4096 16384 65536
0.68
0.70
0.72
0.74
0.76
0.78
0.80
code  number
r
e
c
o
g
n
i
t
i
o
n
r
a
t
e
 
 
Kmeans
Randomprojection tree
PCA tree
8bin HOG 
Gabor
59code LBP
Figure   5.   Performance   comparison   vs.   learning  method.   We
studied  the  recognition  performance  of  the  LE  descriptors  using
three learning methods (random projection tree, PCA-tree, and K-
means) under different code number settings.  We also gave several
existing descriptors results for comparison.
100 200 300 400
0.60
0.65
0.70
0.75
0.80
0.85
principal  component  dimension
r
e
c
o
g
r
a
t
i
o
n
r
a
t
e
 
 
PCA only PCA + L
2
No PCA PCA + L
1
Figure 6. Investigate the effects of the PCA dimension with differ-
ent  normalization  methods.   After  applying  PCA  compression  to
the concatenated patch histogram vector, we normalize the result-
ing vector with different normalization methods and then compute
the similarity score with L2  distance.
descriptors, and the combination of multiple LE descriptors
may  achieve  better  performance.   In  this  paper,   we  take  a
simple  approach  by  training  a  linear  SVM  [4]  to  combine
the similarity scores generated by different LE descriptors.
Generally, the combination can always achieve better result.
In our experiments, the combination of four LE descriptors
(shown  in  Figure  4)  obtained  the  best  performance  on  the
LFW. Figure 7 gives the comparison curves of different de-
scriptors.
0   0.2 0.4 0.6 0.8 1  
0  
0.2
0.4
0.6
0.8
1  
false  positive  rate
t
r
u
e
p
o
s
i
t
i
v
e
r
a
t
e
 
 
8bin HOG 
Gabor 
59code LBP
Single LE (No PCA)
Single LE
Multiple LE
Figure 7. ROC curve comparison between our LE descriptors and
existing descriptors.
4. Pose-adaptive matching
In the previous section, we use 2Dholistic alignment and
matching  for  the  comparison  purpose.   In  this  section,   we
will show that a pose-adaptive matching at the component-
level can effectively handle large pose variation and further
boost the recognition accuracy.
4.1. Component-level face alignment
Instead  of  using  a  2D  holistic  (similarity)  alignment  on
the whole face, we align 9 face components (shown in Fig-
ure 8) separately using similarity transform.  For each com-
ponent,   two  landmarks  are  selected  from  the  ve  detected
ducial  landmarks  (eyes,   nose,   and  mouth  corners)  to  de-
termine  the  similarity  transformation  (details  in  Table  1).
Compared with  the  2D  holistic  alignment,  the  component-
level   alignment   presents  advantages  in  large  pose-variant
case.   The  component-level  approach  can  more  accurately
align  each  component  without  balancing  across  the  whole
face.  And the negative effect of landmark error will also be
reduced.   Figure  8  shows  aligned  components  and  Table  2
compares the performance of different alignment methods.
4.2. Pose-adaptive matching
Using  the  component-level alignment,   the  face  similar-
ity  score  is  the  sum  of  similarities  between  corresponding
components.   We  found  that   each  component   contributes
differently  for  the  recognition  when  the  pose  combination
of  the  matching pair  is  different.   For  example, the  left  eye
is  less  effective  when  we  match  a  frontal   face  and  a  left-
turned  face.   Based  on  this  observation,   we  take  a  simple
pose-adaptive matching method.
Firstly, we categorize the pose of the input face to one of
Forehead
Brows
Eyes
Cheeks
Nose
Mouth
Figure 8. Fiducial points and component  alignment.
Component   Selected landmarks
Forehead   left eye + right eye
Left eyebrow   left eye + right eye
Right eyebrow   left eye + right eye
Left eye   left eye + right eye
Right eye   left eye + right eye
Nose   nose tip + nose pedal*
Left cheek   left eye + nose tip
Right cheek   right eye + nose tip
Mouth   two mouth corners
Table  1.  Landmark  selection  for  component  alignment.   (*  means
the pedal of the nose tip on the eye line.)
Alignment mode   Recog.  rate
2-Point holistic   79.85% 0.42%
5-Point holistic   81.22% 0.53%
Component   82.73% 0.43%
Table 2. Recognition rate vs.  alignment mode.
three  poses  (frontal  (F),  left  (L),  and  right  (R)).  To  handle
this pose category, three images are selected fromthe Multi-
PIE dataset, one image for each pose, and the other factors
in these three images, such as person identity, illumination,
expression remain the same.  After measuring the similarity
between these  three gallery images  and the  probe face,  the
pose label of the most alike gallery image is assigned to the
probe face.
Given  the  estimated  pose  of  each  face,  the  pose  combi-
nations  of  a  face  pair  could  be {FF,  LL,  RR,  LR  (RL),  LF
(FL), RF  (FR)}.   Our  nal  pose-adaptive classier  consists
of  a  set   of  linear  SVM  classiers,   each  trained  by  a  sub-
set  of  training pairs with  a  specic  pose  combination.   The
best-t classier having the same pose combination with
the  input  matching pair  makes  the  nal  decision.   Through
pose-adaptive matching, we explicitly handle the large pose
variation by this divide-and-conquer method.
Component   Image size   Patch division
Forehead   76  24   7  2
Left eyebrow   46  34   4  3
Right eyebrow   46  34   4  3
Left eye   36  24   3  2
Right eye   36  24   3  2
nose   24  76   2  7
Left cheek   34  46   3  4
Right cheek   34  46   3  4
Mouth   76  24   7  2
Table 3. Patch division for face components.
4.3. Evaluations of pose-adaptive matching
To best evaluate the ability of pose change handling, we
constructed  a  new  test   set   from  the  LFW  dataset   by  ran-
domly  sampling  3,000  intra-personal/extra-personal   pairs
for  each  pose  combination.   The  total   pair  number  in  our
new  test   set   is  3, 000   6   =  18, 000.   Note  that   this  new
test set is more challenging than the standard test data in the
LFW  due to  the  larger pose  difference between the  match-
ing  pair.   We  use  half  of  them  as  the  training  set   and  the
rest as the test set.   Subjects are mutually exclusive in these
two sets.  And the patch division in component-level setting
is  shown in Table 3.   Recognition performances were com-
pared before (76.20%0.41%) and after (78.30%0.42%)
pose-adaptive  matching  was   adopted  and  results   showed
that   the  proposed  technique  is  useful   in  such  large  pose-
variant case.
5. Experimental results
In  this  section,  we  report our nal  face  recognition per-
formance  on  the  LFW  benchmark  systematically  and  then
validate  the  excellent   generalization  ability  of  our  system
across different datasets.
5.1. Results on the LFW benchmark
We  present  our  recognition  results  on  the  LFW  bench-
mark  in  the  form  of   ROC  curves.   Figure  9  shows  com-
parison results for the validation of our proposed individual
techniques.   In  Figure  9,  single  LE  +  holistic  means  that
we only use the single best LE to represent the holistic face,
and it is the baseline to show the power of LE without other
techniques.   single  LE  +  comp  indicates  the  application
of component-level, pose-adaptive matching to the baseline
single  LE.   Multiple  LE  descriptors  are  combined  to  form
multiple LE + holistic.  And multiple LE + comp is our
best  performer.   The  accuracies  for  these  four  methods  are
81.22% 0.53%, 82.72% 0.43%, 83.43% 0.55%, and
84.45%   0.46%.   Despite  the  strong  discriminant  ability
of the LE descriptor itself,  the pose-adaptive matching and
0 0.1 0.2 0.3 0.4 0.5
0.5
0.6
0.7
0.8
0.9
1
false  positive  rate
t
r
u
e
p
o
s
i
t
i
v
e
r
a
t
e
 
 
Single LE + holistic
Single LE + comp
Multiple LE + holistic
Multiple LE + comp (our best)
Figure   9.   Demonstrate   the   effects   of   our   proposed   techniques
on  the  LFW  benchmark.   Here,   holistic  means   using  holistic
face  representation  while  comp  means  component-level,   pose-
adaptive matching.
0 0.1 0.2 0.3 0.4 0.5
0.5
0.6
0.7
0.8
0.9
1
false  positive  rate
t
r
u
e
p
o
s
i
t
i
v
e
r
a
t
e
 
 
LDML, funneled [9]
Hybrid, aligned [20]
V1like, funneled[19]
LBP, baseline
Attribute [13]
Simile [13]
Attribute + Simile [13]
Background sample [31]
Multiple LE + comp (our best)
Figure  10.   Face  recognition  comparison  on  the  LFW  benchmark
in restrict protocol.
multiple descriptor combination further enhance the recog-
nition performance of our system.
Our  best  ROC  curve is  comparable with  previous  state-
of-the-art  methods,   as  shown  in  Figure  10.   On  the  LFW
benchmark,   two  new  algorithms  show  the  leading  perfor-
mance. Wolf et al.s work [31] adopts the background learn-
ing  by  using  the  identity  information  within  the  training
set.   Kumar et  al. [13] used the supervised learning to train
high-level  classications  through  a  huge  volume  of   train-
ing  images  outside  of  the  LFW  dataset.   These  two  meth-
ods   [13,   31]   both  use  additional   information  outside  the
LFW  test   protocol.   So  the  comparison  with  other   meth-
Descriptor   Recog.  rate on Multi-PIE
59-code LBP   84.30% 0.89%
8-bin HOG   84.02% 0.66%
Gabor   86.42% 0.85%
single LE + holistic   91.58% 0.50%
single LE + comp   92.12% 0.52%
multiple LE + holistic   92.20% 0.49%
multiple LE + comp   95.19% 0.46%
Table 4. Recognition performance on the Multi-PIE dataset.
ods  (including  ours)  in  Figure  10  is  not  really  fair.   Addi-
tional  training data  or  information may  also  improve other
approaches.
Our systemachieves the best performance when the stan-
dard  test   protocol  is  strictly  respected  [12].   More  impor-
tantly,   our  work  focuses  on  low-level   face  representation,
which  can  be  easily  combined with  previous algorithms to
produce better performance.
5.2. Results on Multi-PIE
We also performextensive experiments on the Multi-PIE
dataset to verify the generalization ability of our approach.
The Multi-PIE dataset  contains face  images from 337 sub-
jects, imaged under 15 view points and 19 illumination con-
ditions in  4  recording sessions.   Large differences exist  be-
tween  LFW  and  Multi-PIE,   considering  the  pose  compo-
sitions,   illumination  variance,   and  resolution.   Moreover,
Multi-PIE  is  collected  under a  controlled setting  systemat-
ically  simulating  the  effects  of  pose,   illumination,  and  ex-
pression.   On the  other hand, the  LFW  is  more close  to the
real-life  setting  since  its  faces  are  selected  from  news  im-
ages.   For these reasons, training on one dataset and testing
on the other can better demonstrate the generalization abil-
ity of a recognition system.
Similar  to  the  LFW  benchmark,  we  randomly  generate
10  subsets  of   face  images  with  Multi-PIE,   each  has  300
intra-personal   and  300  extra-personal   image   pairs.   The
identities  of   subjects  are  mutually  exclusive  among  these
10  subsets,   and  cross-validation  mode  similar   to  LFW  is
applied.   The  default  single  LE  descriptor  and  multiple
LE descriptors trained on the LFW benchmark are adopted
in the experiments.
As  shown  in  Table  4,   the  single  LE  with  holistic  face
representation outperforms the commonly used descriptors
more than 5  points,  and pose-specic classiers  trained on
the LFW dataset also performwell on the Multi-PIE dataset.
All  these  results  demonstrated  the  excellent  generalization
ability of our system.
6. Conclusion and discussion
We  have  introduced  a  new  approach  for   face  recogni-
tion using learning-based (LE) descriptor and pose-adaptive
matching. We validated our recognition system on the LFW
benchmark  and  demonstrated  its   excellent   generalization
ability on Multi-PIE.
In  this  work, the  face  micro-pattern encoding is  learned
but   the  pattern  sampling  is  still   manually  designed.   Au-
tomating  this  step  with  learning  techniques  [27]  may  pro-
duce a more powerful descriptor for face recognition.
Acknowledegments:   This   work  is   performed  when  Cao
and Yin visited Microsoft Research Asia.   It was  supported
in part by the National Natural Science Foundation of China
Grant No.60553001, and the National Basic  Research Pro-
gram of China Grant Nos.2007CB807900, 2007CB807901.
References
[1]   T.   Ahonen,   A.   Hadid,   and  M.   Pietikainen.   Face  Recogni-
tion with Local Binary Patterns.   Lecture Notes in Computer
Science, pages 469481,  2004.
[2]   T. Ahonen and M. Pietik ainen.  Image description using joint
distribution of lter bank responses. Pattern Recognition Let-
ters, 30(4):368376,  2009.
[3]   V. Blanz and T. Vetter.   Face recognition based on tting a 3
D morphable model.   IEEE Transactions on pattern analysis
and machine intelligence, 25(9):10631074,  2003.
[4]   C. Chang and C. Lin.   LIBSVM: a library for support  vector
machines, 2001.   Software available at http://www. csie. ntu.
edu. tw/cjlin/libsvm, 2001.
[5]   J.  Cui,  F. Wen,  R.  Xiao,  Y.  Tian,  and  X.  Tang.   EasyAlbum:
an interactive photo annotation system based on face cluster-
ing  and  re-ranking.   In  Proc.   of   the  SIGCHI  conference  on
human factors in computing systems, page 376, 2007.
[6]   N. Dalal and B. Triggs.   Histograms of oriented gradients for
human detection.   In Proc. CVPR, 2005.
[7]   Y.   Freund,   S.   Dasgupta,   M.   Kabra,   and  N.   Verma.   Learn-
ing  the  structure  of  manifolds  using  random  projections.   In
NIPS, 2007.
[8]   R.   Gross,   I.   Matthews,   J.   Cohn,   T.   Kanade,   and  S.   Baker.
Multi-PIE.   In  International  Conference  on  Automatic  Face
and Gesture Recognition, 2008.
[9]   M.   Guillaumin,   J.   Verbeek,   C.   Schmid,   I.   LEAR,   and
L. Kuntzmann.   Is that  you?  Metric  learning  approaches  for
face identication.   In Proc. ICCV, 2009.
[10]   G.   Hua  and  A.   Akbarzadeh.   A  robust   elastic  and  partial
matching metric for face recognition.   In Proc. ICCV, 2009.
[11]   P. Hua, G. Viola and S. Drucker.   Face recognition using dis-
criminatively trained orthogonal rank one tensor projections.
In Proc. CVPR, 2007.
[12]   G. Huang,  M.  Ramesh,  T. Berg,  and  E. Learned-Miller.   La-
beled faces  in the wild:   A database  for studying  face  recog-
nition  in  unconstrained  environments.   University  of   Mas-
sachusetts, Amherst, Technical Report 07-49, 2007.
[13]   N.  Kumar,   A.  Berg,   P.  Belhumeur,   and  S.  Nayar.   Attribute
and  Simile  classiers  for  face  verication.   In  Proc.   ICCV,
2009.
[14]   T.   Leung  and  J.   Malik.   Representing  and  recognizing  the
visual  appearance  of  materials  using  three-dimensional  tex-
tons.   International   Journal   of   Computer  Vision,   43(1):29
44, 2001.
[15]   L.  Liang,  R.  Xiao,  F.  Wen,  and  J.  Sun.   Face  Alignment  via
Component-based  Discriminative  Search.   In  Proc.   ECCV,
2008.
[16]   D.   Lowe.   Distinctive  image  features   from  scale-invariant
keypoints.   International   Journal   of   Computer   Vision,
60(2):91110,  2004.
[17]   T.   Ojala,   M.   Pietikainen,   and  T.  Maenpaa.   Multiresolution
gray-scale  and  rotation  invariant   texture  classication  with
local binary patterns.   IEEE Transactions on pattern analysis
and machine intelligence, 24(7):971987,  2002.
[18]   P.   Phillips,   H.   Moon,   S.   Rizvi,   and   P.   Rauss.   The
FERET  evaluation   mthodology   for   face-recognition   algo-
rithms.   IEEE Transactions  on pattern analysis and machine
intelligence, 22(10):10901104,  2000.
[19]   N. Pinto, J. DiCarlo, and D. Cox.  How far can you get with a
modern  face  recognition  test  set  using  only  simple  features.
In Proc. CVPR, 2009.
[20]   Y.  Taigman,   L.  Wolf,  T.  Hassner,   and  I.  Tel-Aviv.   Multiple
One-Shots  for  utilizing  class  label   information.   In  BMVC,
2009.
[21]   X. Tan and B. Triggs.  Enhanced local texture feature sets for
face  recognition  under  difcult  lighting  conditions.   Lecture
Notes in Computer Science, 4778:168, 2007.
[22]   E.   Tola,   V.   Lepetit,   and  P.   Fua.   A  fast   local   descriptor  for
dense matching.   In Proc. CVPR, 2008.
[23]   M.   Turk  and  A.   Pentland.   Face  Recognition  using  Eigen-
faces.   In Proc. CVPR, 1991.
[24]   M. Varma and A. Zisserman.   Texture classication:  Are l-
ter banks necessary?  In Proc. CVPR, 2003.
[25]   X.   Wang  and  X.   Tang.   A  unied  framework  for  subspace
face recognition.   IEEE Transactions on pattern analysis and
machine intelligence, 26(9):12221228,  2004.
[26]   X.   Wang  and   X.   Tang.   Random  sampling   for   subspace
face recognition.   International  Journal  of Computer  Vision,
70(1):91104,  2006.
[27]   S. Winder and M. Brown.   Learning local image descriptors.
In Proc. CVPR, 2007.
[28]   S. Winder, G. Hua, and M. Brown.   Picking the best DAISY.
In Proc. CVPR, 2009.
[29]   L. Wiskott, J.  Fellous,  N. Kr uger,  and  C. Von der Malsburg.
Face  recognition  by  elastic  bunch  graph  matching.   IEEE
Transactions  on  pattern  analysis  and  machine  intelligence,
19(7):775779,  1997.
[30]   L. Wolf, T. Hassner, and Y. Taigman.  Descriptor based meth-
ods  in  the  wild.   In  Faces  in  Real-Life  Images  Workshop  in
ECCV, 2008.
[31]   L. Wolf, T. Hassner, and Y. Taigman.  Similarity scores based
on background  samples.   In Proc. ACCV, 2009.
[32]   J. Wright and G. Hua.  Implicit elastic matching with random
projections for pose-variant face recognition. In Proc. CVPR,
2009.
[33]   L. Zhang,  R. Chu,  S. Xiang,  S. Liao,  and  S. Li.   Face detec-
tion based  on  multi-block lbp  representation.   Lecture Notes
in Computer Science, 4642:11, 2007.