0% found this document useful (0 votes)
139 views16 pages

Deepfacedrawing: Deep Generation of Face Images From Sketches

Deepppp face kdkkdkrmrkrkrk

Uploaded by

Rodutchi Rod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views16 pages

Deepfacedrawing: Deep Generation of Face Images From Sketches

Deepppp face kdkkdkrmrkrkrk

Uploaded by

Rodutchi Rod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DeepFaceDrawing: Deep Generation of Face Images from Sketches

SHU-YU CHEN† , Institute of Computing Technology, CAS and University of Chinese Academy of Sciences
WANCHAO SU† , School of Creative Media, City University of Hong Kong
LIN GAO∗ , Institute of Computing Technology, CAS and University of Chinese Academy of Sciences
SHIHONG XIA, Institute of Computing Technology, CAS and University of Chinese Academy of Sciences
HONGBO FU, School of Creative Media, City University of Hong Kong

(a) (b) (c) (d) (e)

Fig. 1. Our DeepFaceDrawing system allows users with little training in drawing to produce high-quality face images (Bottom) from rough or even incomplete
freehand sketches (Top). Note that our method faithfully respects user intentions in input strokes, which serve more like soft constraints to guide image
synthesis.

Recent deep image-to-image translation techniques allow fast generation CCS Concepts: • Human-centered computing → Graphical user in-
of face images from freehand sketches. However, existing solutions tend to terfaces; • Computing methodologies → Perception; Texturing; Image
overfit to sketches, thus requiring professional sketches or even edge maps processing.
as input. To address this issue, our key idea is to implicitly model the shape
space of plausible face images and synthesize a face image in this space to Additional Key Words and Phrases: image-to-image translation, feature
approximate an input sketch. We take a local-to-global approach. We first embedding, sketch-based generation, face synthesis
learn feature embeddings of key face components, and push corresponding
parts of input sketches towards underlying component manifolds defined 1 INTRODUCTION
by the feature vectors of face component samples. We also propose another Creating realistic human face images from scratch benefits vari-
deep neural network to learn the mapping from the embedded component
ous applications including criminal investigation, character design,
features to realistic images with multi-channel feature maps as intermediate
educational training, etc. Due to their simplicity, conciseness and
results to improve the information flow. Our method essentially uses input
sketches as soft constraints and is thus able to produce high-quality face ease of use, sketches are often used to depict desired faces. The
images even from rough and/or incomplete sketches. Our tool is easy to recently proposed deep learning based image-to-image translation
use even for non-artists, while still supporting fine-grained control of shape techniques (e.g., [19, 38]) allow automatic generation of photo im-
details. Both qualitative and quantitative evaluations show the superior ages from sketches for various object categories including human
generation ability of our system to existing and alternative solutions. The faces, and lead to impressive results.
usability and expressiveness of our system are confirmed by a user study. Most of such deep learning based solutions (e.g., [6, 19, 26, 38]) for
sketch-to-image translation often take input sketches almost fixed
† Authors contributed equally. and attempt to infer the missing texture or shading information
∗ Corresponding author. between strokes. To some extent, their problems are formulated
Webpage: http://geometrylearning.com/DeepFaceDrawing/
This is the author’s version of the work. It is posted here for your personal use. Not for more like reconstruction problems with input sketches as hard con-
redistribution. straints. Since they often train their networks from pairs of real
2 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu

images and their corresponding edge maps, due to the data-driven 2.1 Drawing Assistance
nature, they thus require test sketches with quality similar to edge Multiple guidance or suggestive interfaces (e.g., [17]) have been
maps of real images to synthesize realistic face images. However, proposed to assist users in creating drawings of better quality. For
such sketches are difficult to make especially for users with little example, Dixon et al. [7] proposed iCanDraw, which provides correc-
training in drawing. tive feedbacks based on an input sketch and facial features extracted
To address this issue, our key idea is to implicitly learn a space from a reference image. ShadowDraw by Lee et al. [25] retrieves
of plausible face sketches from real face sketch images and find real images from an image repository involving many object cate-
the closest point in this space to approximate an input sketch. In gories for an input sketch as query and then blends the retrieved
this way, sketches can be used more like soft constraints to guide images as shadow for drawing guidance. Our shadow-guided inter-
image synthesis. Thus we can increase the plausibility of synthe- face for inputting sketches is based on the concept of ShadowDraw
sized images even for rough and/or incomplete input sketches while but specially designed for assisting in face drawing. Matsui et al. [29]
respecting the characteristics represented in the sketches (e.g., Fig- proposed DrawFromDrawings, which allows the retrieval of refer-
ure 1 (a-d)). Learning such a space globally (if exists) is not very ence sketches and their interpolation with an input sketch. Our
feasible due to the limited training data against an expected high- solution for projecting an input sketch to underlying component
dimensional feature space. This motivates us to implicitly model manifolds follows a similar retrieval-and-interpolation idea but we
component-level manifolds, which makes a better sense to assume perform this in the learned feature spaces, without explicit corre-
each component manifold is low-dimensional and locally linear [32]. spondence detection, as needed by DrawFromDrawings. Unlike the
This decision not only helps locally span such manifolds using a above works, which aim to produce quality sketches as output, our
limited amount of face data, but also enables finer-grained control work treats such sketches as possible inputs and we are more in-
of shape details (Figure 1 (e)). terested in producing realistic face images even from rough and/or
To this end we present a novel deep learning framework for incomplete sketches.
sketch-based face image synthesis, as illustrated in Figure 3. Our Another group of methods (e.g., [1, 18]) take a more aggressive
system consists of three main modules, namely, CE (Component way and aim to automatically correct input sketches. For exam-
Embedding), FM (Feature Mapping), and IS (Image Synthesis). The ple, Limpaecher et al. [27] learn a correction vector field from a
CE module adopts an auto-encoder architecture and separately crowdsourced set of face drawings to correct a face sketch, with the
learns five feature descriptors from the face sketch data, namely, assumption that such face drawings and the input sketch is for a
for “left-eye”, “right-eye”, “nose”, “mouth”, and “remainder” for lo- same subject. Xie et al. [41] and Su et al. [36] propose optimization-
cally spanning the component manifolds. The FM and IS modules based approaches for refining sketches roughly drawn on a refer-
together form another deep learning sub-network for conditional ence image. We refine an input sketch by projecting individual face
image generation, and map component feature vectors to realistic components of the input sketch to the corresponding component
images. Although FM looks similar to the decoding part of CE, by manifolds. However, as shown in Figure 5, directly using such re-
mapping the feature vectors to 32-channel feature maps instead fined component sketches as input to conditional image generation
of 1-channel sketches, it improves the information flow and thus might cause artifacts across facial components. Since our goal is
provides more flexibility to fuse individual face components for sketch-based image synthesis, we thus perform sketch refinement
higher-quality synthesis results. only implicitly.
Inspired by [25], we provide a shadow-guided interface (imple-
mented based on CE) for users to input face sketches with proper
structures more easily (Figure 8). Corresponding parts of input 2.2 Conditional Face Generation
sketches are projected to the underlying facial component manifolds In recent years, conditional generative models, in particular, con-
and then mapped to the corresponding feature maps for conditions ditional Generative Adversarial Networks [11] (GANs), have been
for image synthesis. Our system produces high-quality realistic popular for image generation conditioned on various input types.
face images (with resolution of 512 × 512), which faithfully respect Karras et al. [22] propose an alternative for the generator in GAN
input sketches. We evaluate our system by comparing with the exist- that separates the high level face attributes and stochastic varia-
ing and alternative solutions, both quantitatively and qualitatively. tions in generating high quality face images. Based on conditional
The results show that our method produces visually more pleas- GANs [30], Isola et al. [19] present the pix2pix framework for vari-
ing face images. The usability and expressiveness of our system ous image-and-image translation problems like image colorization,
are confirmed by a user study. We also propose several interesting semantic segmentation, sketch-to-image synthesis, etc. Wang et
applications using our method. al. [38] introduce pix2pixHD, an improved version of pix2pix to gen-
erate higher-resolution images, and demonstrate its application to
image synthesis from semantic label maps. Wang et al. [37] generate
an image given a semantic label map as well as an image exemplar.
2 RELATED WORK Sangkloy et al. [34] take hand-drawn sketches as input and col-
Our work is related to existing works for drawing assistance and orize them under the guidance of user-specified sparse color strokes.
conditional face generation. We focus on the works closely related These systems tend to overfit to conditions seen during training, and
to ours. A full review on such topics is beyond the scope of our thus when sketches being used as conditions, they achieve quality
paper. results only given edge maps as input. To address this issue, instead
DeepFaceDrawing: Deep Generation of Face Images from Sketches • 3

use smoothed edge maps as part of the input to their conditional


completion network. Our work takes a step further to implicitly
model face component manifolds and perform manifold projection.
Several attempts have also been made to generate images from
incomplete sketches. To synthesize face images from line maps possi-
bly with some missing face components, Li et al. [26] proposed a con-
ditional self-attention GAN with a multi-scale discriminator, where
(a) (b) (c) a large-scale discriminator enforces the completeness of global struc-
tures. Although their method leads to visually better results than
pix2pix [19] and SkethyGAN [5], due to the direct condition on
edge maps, their solution has poor ability to handle hand-drawn
sketches. Ghosh et al. [10] present a shape generator to complete
a partial sketch before image generation, and present interesting
auto-completion results. However, their synthesized images still
exhibit noticeable artifacts, since the performance of their image
generation step (i.e., pix2pixHD [38] for single-class generation and
(d) (e) (f)
SkinnyResNet [10] for multi-class generation) heavily depends on
the quality of input sketches. A similar problem exists with the pro-
Fig. 2. The comparisons of different edge extraction methods. (a): Input gressive image reconstruction network proposed by You et al. [45],
real image. (b): Result by HED [42]. (c): Result by APDrawingGAN [43]. (d): which is able to reconstruct images from extremely sparse inputs
Canny edges [4]. (e): the result by the Photocopy filter in Photoshop. (f):
but still requires relatively accurate inputs.
Simplification of (e) by [35]. Photo (a) courtesy of © LanaLucia.
To alleviate the heterogeneity of input sketches and real face
images, some researchers resort to the unpaired image-to-image
methods (e.g., [16, 44, 48]). These methods adopt self-consistent
of training an end-to-end network for sketch-to-image synthesis, constraints to solve the lack of paired data. While the self-consistent
we exploit the domain knowledge and condition GAN on feature mechanism ensures the correspondence between the input and the
maps derived from the component feature vectors. reconstructed input, there is no guarantee for the correspondence
Considering the known structure of human faces, researchers between the input and the transformed representation. Since our
have explored component-based methods (e.g., [15]) for face image goal is to transform sketches to the corresponding face images, these
generation. For example, given an input sketch, Wu and Dai [40] frameworks are not suitable for our task. In addition, there are some
first retrieve best-fit face components from a database of face im- works leveraging the image manifolds. For example, Lu et al. [28]
ages, then compose the retrieved components together, and finally learn a fused representation from shape and texture features to
deform the composed image to approximate a sketch. Due to their construct a face retrieval system. In contrast, our method not only
synthesis-and-deforming strategy, their solution requires a well- retrieves but also interpolates the face representations in generation.
drawn sketch as input. To enable component-level controllability, Zhu et al. [47] first construct a manifold with the real image dataset,
Gu et al. [12] use auto-encoders to learn feature embeddings for then predict a dense correspondence between a projected source
individual face components, and fuse component feature tensors image and an edit-oriented “feasible” target in the manifold, and
in a mask-guided generative network. Our CE module is inspired finally apply the dense correspondence back to the original source
by their work. However, their local embeddings learned from real image to complete the visual manipulation. In contrast, our method
images are mainly used to generate portrait images with high diver- directly interpolates the nearest neighbors of the query and feeds the
sity while ours learned from sketch images are mainly for implicitly interpolation result to the subsequent generation process. Compared
refining and completing input sketches. to Zhu et al. [47], our method is more direct and efficient for the
Conditional GANs have also been adopted for local editing of sketch-based image generation task.
face images, via interfaces either based on semantic label masks
[12, 24, 37] or sketches [20, 31]. While the former is more flexible for
applications such as component transfer and style transfer, the latter 3 METHODOLOGY
provides a more direct and finer control of details, even within face The 3D shape space of human faces has been well studied (see the
components. Deep sketch-based face editing is essentially a sketch- classic morphable face model [3]). A possible approach to synthesize
guided image completion problem, which requires the completion realistic faces from hand-drawn sketches is to first project an input
of missing parts such that the completed content faithfully reflects sketch to such a 3D face space [13] and then synthesize a face image
an input sketch and seamlessly connects to the known context. from a generated 3D face. However, such a global parametric model
It thus requires different networks from ours. The SN-patchGAN is not flexible enough to accommodate rich image details or support
proposed by Jo and Park [20] is able to produce impressive details local editing. Inspired by [8], which shows the effectiveness of a
for example for a sketched earring. However, this also implies that local-global structure for faithful local detail synthesis, our method
their solution requires high-quality sketches as input. To tolerate aims for modeling the shape spaces of face components in the image
the errors in hand-drawn sketches, Portenier et al. [31] propose to domain.
4 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu

Fig. 3. Illustration of our network architecture. The upper half is the Component Embedding module. We learn feature embeddings of face components using
individual auto-encoders. The feature vectors of component samples are considered as the point samples of the underlying component manifolds and are used
to refine an input hand-drawn sketch by projecting its individual parts to the corresponding component manifolds. The lower half illustrates a sub-network
consisting of the Feature Mapping (FM) and the Image Synthesis (IS) modules. The FM module decodes the component feature vectors to the corresponding
multi-channel feature maps (𝐻 × 𝑊 × 32), which are combined according to the spatial locations of the corresponding facial components before passing them
to the IS module.

To achieve this, we first learn the feature embeddings of face (Section 3.2). This greatly improves the information flow and bene-
components (Section 3.2). For each component type, the points fits component fusion. Below we first discuss our data preparation
corresponding to component samples implicitly define a manifold. procedure (Section 3.1). We then introduce our novel pipeline for
However, we do not explicitly learn this manifold, since we are more sketch-to-image synthesis (Section 3.2), and our approach for man-
interested in knowing the closest point in such a manifold given a ifold projection (Section 3.3). Finally present our shadow-guided
new sketched face component, which needs to be refined. Observing interface (Section 3.4).
that in the embedding spaces semantically similar components are
close to each other, we assume that the underlying component 3.1 Data Preparation
manifolds are locally linear. We then follow the main idea of the To train our network, it requires a reasonably large-scale dataset of
classic locally linear embedding (LLE) algorithm [32] to project the face sketch-image pairs. There exist several relevant datasets like the
feature vector of the sketched face component to its component CUHK face sketch database [39, 46]. However, the sketches in such
manifold (Section 3.3). datasets involve shading effects while we expect a more abstract
The learned feature embeddings also allow us to guide conditional representation of faces using sparse lines. We thus contribute to a
sketch-to-image synthesis to explicitly exploit the information in the new dataset of pairs of face images and corresponding synthesized
feature space. Unlike traditional sketch-to-image synthesis methods sketches. We build this on the face image data of CelebAMask-HQ
(e.g., [19, 38]), which learn conditional GANs to translate sketches to [24], which contains high-resolution facial images with semantic
images, our approach forces the synthesis pipeline to go through the masks of facial attributes. For simplicity, we currently focus on front
component feature spaces and then map 1-channel feature vectors faces, without decorative accessories (e.g., glasses, face masks).
to 32-channel feature maps before the use of a conditional GAN To extract sparse lines from real images, we have tried the fol-
lowing edge detection methods. As shown in Figure 2 (b) and (d),
DeepFaceDrawing: Deep Generation of Face Images from Sketches • 5

the holistically-nested edge detection (HED) method [42] and the


traditional Canny edge detection algorithm [4] tend to produce edge
maps with discontinuous lines. APDrawingGAN [43], a very recent
approach for generating portrait drawings from face photos leads
to artistically pleasing results, which, however, are different from
our expectation (e.g., see the regional effects in the hair area and
missing details around the mouth in Figure 2 (c)). We also resorted
to the Photocopy filter in Photoshop, which preserves facial details
but meanwhile brings excessive noise (Figure 2 (e)). By applying the
sketch simplification method by Simo-Serra et al. [35] to the result
by the Photocopy filter, we get an edge map with the noise reduced
and the lines better resembling hand-drawn sketches (Figure 2 (f)).
We thus adopt this approach (i.e., Photocopy + sketch simplification)
to prepare our training dataset, which contains 17K pairs of sketch-
image pairs (see an example pair in Figure 2 (f) and (a)), with 6247
for male subjects and 11456 for female subjects. Since our dataset is
not very large-scale, we reserve the data in the training process as
much as possible to provide as many samples as possible to span
the component manifolds. Thus we set a training/testing ratio to
20:1 in our experiments. It results in 16,860 samples for training and
842 for testing.
Fig. 4. Two examples of generation flexibility supported by using separate
components for the left and right eyes.
3.2 Sketch-to-Image Synthesis Architecture
As illustrated in Figure 3, our deep learning framework takes as
input a sketch image and generates a high-quality facial image of
size 512×512. It consists of two sub-networks: The first sub-network We experimented with different numbers of dimensions for the la-
is our CE module, which is responsible for learning feature embed- tent representation (128, 256, 512) – we found that 512 dimensions
dings of individual face components using separate auto-encoder are enough for reconstructing and representing the sketch details.
networks. This step turns component sketches into semantically Instead, lower-dimensional representations tend to lead to blurry
meaningful feature vectors. The second sub-network consists of results. By trial and error, we append a residual block after every
two modules: FM and IS. FM turns the component feature vectors convolution/deconvolution operation in each encoding/decoding
to the corresponding feature maps to improve the information flow. layer to construct the latent descriptors instead of only using con-
The feature maps of individual face components are then combined volution and deconvolution layers. We use Adam solver [23] in the
according to the face structure and finally passed to IS for face image training process. Please find the details of the network architectures
synthesis. and the parameter settings in the supplemental materials.

Component Embedding Module. Since human faces share a clear Feature Mapping Module. Given an input sketch, we can project
structure, we decompose a face sketch into five components, denoted its individual parts to the component manifolds to increase its plau-
as 𝑆 𝑐 , 𝑐 ∈ {1, 2, 3, 4, 5} for “left-eye", “right-eye", “nose", “mouth", sibility (Section 3.3). One possible solution to synthesize a realistic
and “remainder", respectively. To handle the details in-between com- image is to first convert the feature vectors of the projected manifold
ponents, we define the first four components simply by using four points back to the component sketches using the learned decoders
overlapping windows centered at individual face components (de- {𝐷𝑐 }, then perform component-level sketch-to-image synthesis (e.g.,
rived from the pre-labeled segmentation masks in the dataset), as based on [38]), and finally fuse the component images together to
illustrated in Figure 3 (Top-Left). A “remainder” image correspond- get a complete face. However, this straightforward solution easily
ing to the “remainder” component is the same as the original sketch leads to inconsistencies in synthesized results in terms of both local
image but with the eyes, nose and mouth removed. Here we treat details and global styles, since there is no mechanism to coordinate
“left-eye” and “right-eye” separately to best explore the flexibility in the individual generation processes.
the generated faces (see two examples in Figure 4). To better control Another possible solution is to first fuse the decoded component
of the details of individual components, for each face component sketches into a complete face sketch (Figure 5 (b)) and then perform
type we learn a local feature embedding. We obtain the feature sketch-to-image synthesis to get a face image (Figure 5 (c)). It can be
descriptors of individual components by using five auto-encoder seen that this solution also easily causes artifacts (e.g., misalignment
networks, denoted as {𝐸𝑐 , 𝐷𝑐 } with 𝐸𝑐 being an encoder and 𝐷𝑐 a between face components, incompatible hair styles) in the synthe-
decoder for component 𝑐. sized sketch, and such artifacts are inherited to the synthesized
Each auto-encoder consists of five encoding layers and five decod- image, since existing deep learning solutions for sketch-to-image
ing layers. We add a fully connected layer in the middle to ensure synthesis tend to use input sketches as rather hard constraints, as
the latent descriptor is of 512 dimensions for all the five components. discussed previously.
6 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu

f sc
w2c c
w1c f proj
wkc

Mc Kc
Fig. 6. Illustration of manifold projection. Given a new feature vector 𝑓𝑠˜𝑐 , we
(a) (b) replace it with the projected feature vector 𝑓𝑝𝑟𝑜 𝑗 using K nearest neighbors
𝑐

of 𝑓𝑠˜𝑐 .

Image Synthesis Module. Given the combined feature maps, the


IS module converts them to a realistic face image. We implement
this module using a conditional GAN architecture, which takes the
feature maps as input to a generator, with the generation guided
by a discriminator. Like the global generator in pix2pixHD [38],
our generator contains an encoding part, a residual block, and a
decoding unit. The input feature maps go through these units se-
quentially. Similar to [38], the discriminator is designed to determine
(c) (d) the samples in a multi-scale manner: we downsample the input to
multiple sizes and use multiple discriminators to process different
Fig. 5. Given the same input sketch (a), image synthesis conditioned on the inputs at different scales. We use this setting to learn the high-level
feature vectors after manifold projection achieves a more realistic result (d) correlations among parts implicitly.
than that (c) by image synthesis conditioned on an intermediate sketch (b).
See the highlighted artifacts in both the intermediate sketch (b) and the Two-stage Training. As illustrated in Figure 3, we adopt a two-
corresponding synthesized result (c) by pix2pixHD [38]. stage training strategy to train our network using our dataset of
sketch-image pairs (Section 3.1). In Stage I, we train only the CE mod-
ule, by using component sketches to train five individual auto-
encoders for feature embeddings. The training is done in a self-
supervised manner, with the mean square error (MSE) loss between
We observe that the above issues mainly happen in the overlap- an input sketch image and the reconstructed image. In Stage II, we
ping regions of the cropping windows for individual components. fix the parameters of the trained component encoders and train
Since sketches only have one channel, the incompatibility of neigh- the entire network with the unknown parameters in the FM and
boring components in the overlapping regions is thus difficult to IS modules together in an end-to-end manner. For the GAN in the IS,
automatically resolve by sketch-to-image networks. This motivates besides the GAN loss, we also incorporate a 𝐿1 loss to further guide
us to map the feature vectors of sampled manifold points to multi- the generator and thus ensure the pixel-wise quality of generated
channel feature maps (i.e., 3D feature tensors). This significantly images. We use the perceptual loss [21] in the discriminator to com-
improves the information flow, and fusing the feature maps instead pare the high-level difference between real and generated images.
of sketch components helps resolve the inconsistency between face Due to the different characteristics of female and male portraits, we
components. train the network using the complete set but constrain the searching
Since the descriptors for different components bear different se- space into the male and female spaces for testing.
mantic meanings, we design the FM module with five separate
decoding models converting feature vectors to spatial feature maps. 3.3 Manifold Projection
Each decoding model consists of a fully connected layer and five Let S = {𝑠𝑖 } denote a set of sketch images used to train the feature
decoding layers. For each feature map, it has 32 channels and is embeddings of face components (Section 3.2). For each component 𝑐,
of the same spatial size as the corresponding component in the we can get a set of points in the 𝑐-component feature space by using
sketch domain. The resulting feature maps for “left-eye”, “right-eye”, the trained encoders, denoted as F 𝑐 = {𝑓𝑖𝑐 = 𝐸𝑐 (𝑠𝑖𝑐 )}. Although
“nose”, and “mouth” are placed back to the “remainder” feature maps each feature space is 512-dimensional, given that similar component
according to the exact positions of the face components in the input images are placed closely in such feature spaces, we tend to believe
face sketch image to retain the original spatial relations between that all the points in F 𝑐 are in an underlying low-dimensional
face components. As illustrated in Figure 3 (Bottom-Center), we manifold, denoted as M𝑐 , and further assume each component
use a fixed depth order (i.e., “left/right eyes" > “nose" > “mouth” > manifold is locally linear: each point and its neighbors lie on or
“remainder") to merge the feature maps. close to a locally linear patch of the manifold [32].
DeepFaceDrawing: Deep Generation of Face Images from Sketches • 7

that as we change the interpolation weight continuously, it results in


smooth changes between the consecutive reconstructed component
sketches from a pair of selected sketches. This shows the feasibility
of our descriptor interpolation.

Fig. 7. Illustration of linear interpolation between pairs of randomly se-


lected neighboring component sketches (Leftmost and Rightmost) in the
corresponding feature spaces. The middle three images are decoded from
Fig. 8. A screenshot of our shadow-guided sketching interface (Left) for
the uniformly interpolated feature vectors.
facial image synthesis (Right). The sliders at the up-right corner can be used
to control the degree of interpolation between an input sketch and a refined
Given an input sketch 𝑠, ˜ to increase its plausibility as a human version after manifold project for individual components.
face, we project its 𝑐-th component to M𝑐 . With the locally linear
assumption, we follow the main idea of LLE and take a retrieval-
and-interpolation approach to project the 𝑐-th component feature
3.4 Shadow-guided Sketching Interface
vector of 𝐸𝑐 (𝑠˜𝑐 ), denoted as 𝑓𝑠˜𝑐 to M𝑐 , as illustrated in Figure 3. To assist users, especially those with little training in drawing, in-
As illustrated in Figure 6, given the 𝑐-th component feature vector spired by ShadowDraw [25], we provide a shadow-guided sketching
𝑓𝑠˜𝑐 , we first find the 𝐾 nearest samples in F 𝑐 under the Euclidean interface. Given a current sketch 𝑠, ˜ we first find 𝐾 (𝐾 = 10 in our
space. By trial and error, we found that 𝐾=10 is sufficient in pro- implementation) most similar sketch component images from S
viding face plausibility while maintaining adequate variations. Let according to 𝑠˜𝑐 by using the Euclidean distance in the feature space.
K 𝑐 = {𝑠𝑘𝑐 } (with {𝑠𝑘 } ⊂ S) denote the resulting set of 𝐾 nearest The found component images are then blended as shadow and
samples, i.e., the neighbors of 𝑠˜𝑐 on M𝑐 . We then seek a linear com- placed at the corresponding components’ positions for sketching
bination of these neighbors to reconstruct 𝑠˜𝑐 by minimizing the guidance (Figure 8 (Left)). Initially when the canvas is empty, the
reconstruction error. This is equivalent to solving for the interpola- shadow is more blurry. The shadow is updated instantly for every
tion weights through the following minimization problem: new input stroke. The synthesized image is displayed in the window
Õ Õ on the right. Users may choose to update the synthesized image
min ||𝑓𝑠˜𝑐 − 𝑤𝑘𝑐 · 𝑓𝑘𝑐 || 22, 𝑠.𝑡 . 𝑤𝑘𝑐 = 1, (1) instantly or trigger an “Convert” command. We show two sequences
𝑘 ∈K 𝑐 𝑘 ∈K of sketching and synthesis results in Figure 18.
where 𝑤𝑘𝑐 is the unknown weight for sample 𝑠𝑘𝑐 . The weights can be Users with good drawing skills tend to trust their own drawings
found by solving a constrained least-squares problem for individual more than those with little training in drawing. We thus provide
components independently. Given the solved weights {𝑤𝑘𝑐 }, the a slider for each component type to control the blending weights
projected point of 𝑠˜𝑐 on M𝑐 can be computed as between a sketched component and its refined version after manifold
𝑐
Õ projection. Let 𝑤𝑏𝑐 denote the blending weight for component 𝑐.
𝑓𝑝𝑟𝑜 𝑗 = 𝑤𝑘𝑐 · 𝑓𝑘𝑐 . (2)
The feature vector after blending can be calculated as:
𝑘 ∈K 𝑐
𝑐
𝑐
𝑓𝑝𝑟𝑜 is the feature vector of the refined version of 𝑠˜𝑐 , and can be 𝑓𝑏𝑙𝑒𝑛𝑑 = 𝑤𝑏𝑐 × 𝑓𝑠˜𝑐 + (1 − 𝑤𝑏𝑐 ) × 𝑓𝑝𝑟𝑜
𝑐
𝑗. (3)
𝑗
passed to the FM and IS modules for image synthesis. 𝑐
Feeding 𝑓𝑏𝑙𝑒𝑛𝑑 to the subsequent trained modules, we get a new
To verify the local continuity of the underlying manifolds, we synthesized image.
first randomly select a sample from S, and for its 𝑐-th component Figure 9 shows an example of synthesized results under differ-
randomly select one of its nearest neighbors in the correspondence ent values of 𝑤𝑏𝑐 . This blending feature is particularly useful for
feature space. We then perform linear interpolation between such a creating faces that are very different from any existing samples or
pair of component sketches in the 𝑐-th feature space, and reconstruct their blending. For example, for the female data in our training set,
the interpolated component sketches using the learned 𝑐-th decoder most of the subjects have long hairstyles. Always pushing our input
𝐷𝑐 . The reconstructed results are shown in Figure 7. It can be seen sketch to such samples would not allow us to create short-hairstyle
8 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu

Input Sketch 𝑤𝑏 5 = 0.00 𝑤𝑏 5 = 0.25

Input Sketch Without Refinement 𝑤𝑏 = 1.0

𝑤𝑏 5 = 0.50 𝑤𝑏 5 = 0.75 𝑤𝑏 5 = 1.00

Fig. 9. Interpolating an input sketch and its refined version (for the “remain-
der” component in this example) after manifold projection under different
blending weight values. 𝑤𝑏𝑐 = 1 means a full use of an input sketch for
image synthesis, while by setting 𝑤𝑏𝑐 = 0 we fully trust the data for inter-
polation.

effects. This is solved by trusting the input sketch for its “remainder”
Full Refinement 𝑤𝑏 = 0.0 𝑤𝑏 1,2,4 = {0.7, 0.4, 0.3}
component by adjusting its corresponding blending weight. Figure
10 shows another example with different blending weights for differ- Fig. 10. Blending an input sketch and its refined version after manifold
ent components. It can be easily seen that the result with automatic projection for the “left-eye”, “right-eye”, and “mouth” components. Upper
refinement (lower left) is visually more realistic than that without Right: result without any sketch refinement; Lower Left: result with full-
any refinement (upper right). Fine-tuning of the blending weights degree sketch refinement; Lower Right: result with partial-degree sketch
leads to a result better reflecting the input sketch more faithfully. refinement.

4 EXPERIMENTS
We have done extensive evaluations to show the effectiveness of our 20 to 26) were invited to participate in this study. We first asked
sketch-to-image face synthesis system and its usability via a pilot them to self-assess their drawing skills through a nine-point Lik-
study. Below we present some of the obtained results. Please refer to ert scale (1: novice to 9: professional), and divided them into three
the supplemental materials for more results and an accompanying groups: 4 novice users (drawing skill score: 1 – 3), 4 middle users
video for sketch-based image synthesis in action. (4 – 6), and 2 professional users (7 – 9). Before the drawing session,
Figure 11 shows two representative results where users progres- each participant was given a short tutorial about our system (about
sively introduce new strokes to add or stress local details. As shown 10 minutes). The participants used an iPad with iPencil to remotely
in the demo video, running on a PC with an Intel i7-7700 CPU, 16GB control the server PC for drawing. Then each of them was asked
RAM and a single Nvidia GTX 1080Ti GPU, our method achieves to create at least 3 faces using our system. The study ended with a
real-time feedback. Thanks to our local-to-global approach, gen- questionnaire to get user feedbacks on ease-of-use, controllability,
erally more strokes lead to new or refined details (e.g., the nose variance of results, quality of results, and expectation fitness. The
in the first example, and the eyebrows and wrinkles in the second additional comments on our system were also welcome.
example), with other areas largely unchanged. Still due to the com- Figure 12 gives a gallery of sketches and synthesized faces by the
bination step, local editing might still introduce subtle but global participants. It can be seen that our system consistently produce
changes. For example, for the first example, the local change of realistic results given input sketches with different styles and levels
lighting in the nose area leads to the change of highlight in the of abstraction. For several examples, the participants attempted to
whole face (especially in the forehead region). Figure 18 shows two depict beard styles via hatching and our system captured the users’
more complete sequences of progressive sketching and synthesis, intention very well.
with our shadow-guided interface. Figure 13 shows a radar plot, summarizing quantitative feedbacks
on our system for participant groups with different levels of drawing
4.1 Usability Study skills. The feedbacks for all the groups of participants were positive
We conducted a usability study to evaluate the usefulness and effec- in all the measured aspects. Particularly, the participants with good
tiveness of our system. 10 subjects (9 male and 1 female, aged from drawing skills felt a high level of controllability, while they gave
DeepFaceDrawing: Deep Generation of Face Images from Sketches • 9

original ShadowDraw paper [25]. One of the professional users men-


tioned that automatic synthesis of face images given sparse inputs
saved a lot of efforts and time compared to traditional painting
software. One professional user mentioned that it would be better
if our system could provide color control.

4.2 Comparison with Alternative Refinement Strategies


To refine an input sketch, we essentially take a component-level
retrieval-and-interpolation approach. We compare this method with
two alternative sketch refinement methods by globally or locally
retrieving the most similar sample in the training data. For fair
comparison, we use the same FM and IS modules for image synthesis.
For the local retrieval method, it is the same as our method except
that for manifold projection we simply retrieve the closest (i.e., top-
1 instead of top-𝐾) component sample in each component-level
feature space without any interpolation. For the global retrieval
method, we replace the CE module with a new module for the feature
embeddings of entire face sketches. Specifically, we first learn the
feature embeddings of the entire face sketch images, and given a
new sketch we find the most similar (i.e., top-1) sample in the whole-
face feature space. For each component in the globally retrieved
sample image, we then encode it using the corresponding trained
component-level encoder (i.e., 𝐸𝑐 ), and pass all the component-level
feature vectors to our FM and IS for image synthesis. Note that we
do not globally retrieve real face images, since our goal here is for a
fair comparison of the sketch refinement methods.
Figure 14 shows comparison results. From the overlay of input
sketches and the retrieved or interpolated sketches, it can be easily
seen that the component-level retrieval method returns samples
Fig. 11. Representative results through progressive sketching for adding closer to the input component sketches than the global-retrieval
details (1st example) and stressing local details (2nd example). method, mainly due to the limited sample data. Thanks to the in-
terpolation step, the interpolated sketches almost perfectly fit the
input sketches. Note that we show the decoded sketches after inter-
slightly lower scores for the degree of result variance. Using our
polation here only for the comparison purpose, and our conditional
system, the average time needed for drawing a face sketch among
image synthesis sub-network takes the interpolated feature vectors
the participants with different drawing abilities are: 17 ′ 14 ′′ (profes-
as input (Section 3.2).
sional), 3 ′ 17 ′′ (middle) and 2 ′ 26 ′′ (novice). It took much longer for
professionals, since they spent more time sketching and refining
details. For the refinement sliders, the most frequently used slider 4.3 Perceptive Evaluation Study
was for the “remainder” component (56.69%), which means for more As shown in Figure 14 (Right), the three refinement methods all
than half of the results, the “remainder” slider was manipulated. In lead to realistic face images. To evaluate the visual quality and the
contrast, for the other components we have 21.66% for “left-eye”, faithfulness (i.e., the degree of fitness to input sketches) of synthesize
12.74% for “right-eye”, 12.10% for “nose” and 19.75% for “mouth”. results, we conducted a user study.
For all the adjustments made in the components, participants trust We prepared a set of input sketches, containing in total 22 sketches,
the “remainder” component most, with the averaged confidence including 9 from the usability study (Section 4.1) and 13 from the
0.78; The least trusted component is “mouth” (0.56); other com- authors. This sketch set (see the supplementary materials) covered
ponent confidences are 0.70 (“left-eye”), 0.61 (“right-eye”) and 0.58 inputs with different levels of abstraction and different degrees of
(“nose”). The averaged confidences implied the importance of sketch roughness. We applied the three refinement methods to each input
refinement in creating the synthesized faces in this study. sketch to generate the corresponding synthesized results (see two
All of the participants felt that our system was powerful to create representative sets in Figure 15).
realistic faces using such sparse sketches. They liked the intuitive The evaluation was done through an online questionnaire. 60
shadow-guided interface, which was quite helpful for them to con- participants (39 male, 21 female, aged from 18 to 32) participated in
struct face sketches with proper structures and layouts. On the other this study. Most of them had no professional training in drawing.
hand, some users, particularly those with good drawing skills, felt We showed each participant four images including input sketch and
that the shadow guidance was sometimes distracting when edit- the three synthesized images, placed side by side in a random order.
ing details. This finding is consistent with the conclusions in the Each participant was asked to evaluate the quality and faithfulness
10 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu

Fig. 12. Gallery of input sketches and synthesized results in the usability study.
DeepFaceDrawing: Deep Generation of Face Images from Sketches • 11

(a) Professional
Novice
Middle

(e) (b)

(d) (c)
Fig. 15. Two representative sets of input sketches and synthesized results
Fig. 13. The summary of quantitative feedback in the usability study. (a) used in the perceptive evaluation study. From left to right: input sketch, the
Ease of use. (b) Controllability. (c) Variance of results. (d) Quality of results. results by sketch refinement through global retrieval, local retrieval, and
(e) Expectation fitness. local retrieval with interpolation (our method).

5.5
5.8 5.8
5 5
(a) 5.6 5.6

Score

Score
5

Score

Score
5.4 5.4

Perception

Perception
4.5 4.5

Quality Perception

Quality Perception
5.2 4.5
5.2

5 5
4 4
4
Faithfulness

Faithfulness
(b) 4.8 4.8

4.6 4.6
3.5 3.5
3.5
Input Sketch 4.4 4.4

4.2 4.2
3 3
(c)
Global Local Interpolation Global Local Interpolation
Interpolation

Part 1 Part 2 Part 3 Part 4 Image


Fig. 16. Box plots of the average quality and faithfulness perception scores
over the participants for each method.
(a)

further confirmed that our method (mean: 4.85) led to significantly


more faithful results than both the global (mean: 3.65; [𝑡 = −29.77,
(b)
𝑝 < 0.001] and local (mean: 4.23; [𝑡 = −16.34, 𝑝 < 0.001]) retrieval
methods. This is consistent with our expectation, since our method
Input Sketch
provides the largest flexibility to fit to input sketches.
(c) In terms of visual quality our method (mean: 5.50) significantly
outperformed the global retrieval method (mean: 5.37; [𝑡 = −3.94,
Part 1 Part 2 Part 3 Part 4 Image 𝑝 < 0.001]) and the local retrieval method (mean: 4.68; [𝑡 = −24.60,
𝑝 < 0.001]). It is surprisingly to see that the quality of results by
Fig. 14. Comparisons of using global retrieval (a), component-level retrieval
our method was even higher than the global retrieval method, since
(b), and our method (essentially component-level retrieval followed by inter-
polation) (c) for sketch refinement. The right column shows the correspond- we had expected that the visual quality of the results by the global
ing synthesized results. For easy comparison we overlay input sketches method and ours would be similar. This is possibly because some
(in light blue) on top of the retrieved or interpolated sketches by different information is lost after first decomposing an entire sketch into
methods. components and then recombining the corresponding feature maps.

4.4 Comparison with Existing Solutions


both in a seven-point Likert scale (1 = strongly negative to 7 = We compare our method with the state-of-the-art methods for image
strongly positive). In total, to evaluate either the faithfulness or the synthesis conditioned on sketches, including pix2pix [19], pix2pixHD
quality, we got 60 (participants) × 22 (sketches) = 1,320 subjective [38] and Lines2FacePhoto [26] and iSketchNFill [10] in terms of visual
evaluations for each method. quality of generated faces. We use their released source code but for
Figure 16 plots the statistics of the evaluation results. We per- fair comparisons we train all the networks on our training dataset
formed one-way ANOVA tests on the quality and faithfulness scores, (Section 3.1). The (input and output) resolution for our method and
and found significant effects for both quality (𝐹 (2,63) = 47.26, 𝑝 < pix2pixHD is 512 × 512, while we have 256 × 256 for pix2pix and
0.001) and faithfulness (𝐹 (2,63) = 51.72, 𝑝 < 0.001). Paired t-tests Lines2FacePhoto according to their default setting. In addition, for
12 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu
Input Sketch
Pix2pix
Lines2FacePhoto
Pix2pixHD
iSketchNFill
Ours

Fig. 17. Comparisons with the state-of-the-art methods given the same input sketches (Top Row).

Lines2FacePhoto, following their original paper, we also convert each Figure 17 shows representative testing results given the same
sketch to a distance map as input for both training and testing. For sketches as input. It can be easily seen that our method produces
iSketchNFill, we train their shape completion module before feeding more realistic synthesized results. Since the input sketches are rough
it to pix2pix [19] (acting as the appearance synthesis module). The and/or incomplete, they are generally different from the training
input and output resolutions in their method are 256 × 256 and data, making the compared methods fail to produce realistic faces.
128 × 128, respectively. Although Lines2FacePhoto generates a relatively plausible result
DeepFaceDrawing: Deep Generation of Face Images from Sketches • 13

given an incomplete sketch, its ability to handle data imperfections involved network easy to train from a training dataset of not very
is rather limited. We attempted to perform quantitative evaluation as large scale. Our approach outperforms existing sketch-to-image
well. However, none of the assessment metrics we tried, including synthesis approaches, which often require edge maps or sketches
Fréchet Inception Distance [14] and Inception Score [33], could with similar quality as input. Our user study confirmed the usability
faithfully reflect visual perception. For example, the averaged values of our system. We also adapted our system for two applications:
of the Inception Score were 2.59 and 1.82 (the higher, the better) for face morphing and face copy-paste.
pix2pixHD and ours, respectively. However, it is easily noticeable Our current implementation considers individual components
that our results are visually better than those by pix2pixHD. rather independently. This provides flexibility (Figure 4) but also
introduces possible incompatibility problems. This issue is more
5 APPLICATIONS obvious for the eyes (Figure 21), which are often symmetric. This
Our system can be adapted for various applications. In this section might be addressed by introducing a symmetry loss [15] or explicitly
we present two applications: face morphing and face copy-paste. requiring two eyes from the same samples (similar to Figure 20).
Our work has focused on refining an input sketch component-by-
5.1 Face Morphing component. In other words our system is generally able to handle
errors within individual components, but is not designed to fix the
Traditional face morphing algorithms [2] often require a set of
errors in the layouts of components (Figure 21). To achieve proper
keypoint-level correspondence between two face images to guide
layouts, we resort to a shadow-guided interface. In the future, we are
semantic interpolation. We show a simple but effective morphing ap-
interested in modeling spatial relations between facial components
proach by 1) decomposing a pair of source and target face sketches in
and fixing input layout errors.
the training dataset into five components (Section 3.2); 2) encoding
Our system takes black-and-white rasterized sketches as input
the component sketches as feature vectors in the corresponding fea-
and currently does not provide any control of color or texture in
ture spaces; 3) performing linear interpolation between the source
synthesized results. In a continuous drawing session, small changes
and target feature vectors for the corresponding components; 4)
in sketches sometimes might cause abrupt color changes. This might
finally feeding the interpolated feature vectors to the FM and IS mod-
surprise users and is thus not desirable for usability. We believe
ule to get intermediate face images. Figure 19 shows examples of
this can be potentially addressed by introducing a color control
face morphing using our method. It can be seen that our method
mechanism in generation. For example, we might introduce color
leads to smoothly transforming results in identity, expression, and
constraints by either adding them in the input as additional hints
even highlight effects.
or appending them to the latent space as additional guidance. In
addition, adding color control is also beneficial for applications such
5.2 Face Copy-Paste
as face morphing and face copy-and-paste.
Traditional copy-paste methods (e.g., [9]) use seamless stitching Like other learning-based approaches, the performance of our
methods on colored images. However, there will be situations where system is also dependent on the amount of training data. Although
the hue of local areas is irrelevant. To address this issue, we recom- component-level manifolds of faces might be low dimensional, due
bine face components for composing new faces, which can maintain to the relatively high-dimensional space of our feature vectors, our
the consistency of the overall color and lighting. Specifically, it can limited data only provides very sparse sampling of the manifolds. In
be achieved by first encoding face component sketches (possibly the future we are interested in increasing the scale of our training
from different subjects) as feature vectors and then combining them data, and aim to model underlying component manifolds more
as new faces by using the FM and IS modules. This can be used accurately. This will also help our system to handle non-frontal
to either replace components of existing faces with corresponding faces, faces with accessories. It is also interesting to increase the
components from another source, or combining components from diversity of results by adding random noise to the input. Explicitly
multiple persons. Figure 20 presents several synthesized new faces learning such manifolds and providing intuitive exploration tools
by re-combining eyes, nose, mouth and the remainder region from in a 2D space would be also interesting to explore.
four source sketches. Our image synthesis sub-network is able to Our current system is specially designed for faces by making use
resolve the inconsistencies between face components from different of the fixed structure of faces. How to adapt our idea to support
sources in terms of both lighting and shape. the synthesis of objects of other categories is an interesting but
challenging problem.
6 CONCLUSION AND DISCUSSIONS
In this paper we have presented a novel deep learning framework
for synthesizing realistic face images from rough and/or incomplete ACKNOWLEDGMENTS
freehand sketches. We take a local-to-global approach by first de- This work was supported by Beijing Program for International
composing a sketched face into components, refining its individual S&T Cooperation Project (No. Z191100001619003), Royal Society
components by projecting them to component manifolds defined by Newton Advanced Fellowship (No. NAF\R2\192151), Youth Inno-
the existing component samples in the feature spaces, mapping the vation Promotion Association CAS, CCF-Tencent Open Fund and
refined feature vectors to the feature maps for spatial combination, Open Project Program of the National Laboratory of Pattern Recog-
and finally translating the combined feature maps to realistic im- nition (NLPR) (No. 201900055). Hongbo Fu was supported by an
ages. This approach naturally supports local editing and makes the unrestricted gift from Adobe and grants from the Research Grants
14 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu

Fig. 18. Two sequences of progressive sketching (under shadow guidance) and synthesis results.

Fig. 19. Examples of face morphing by interpolating the component-level feature vectors of two given face sketches (Leftmost and Rightmost are corresponding
synthesized images).

Council of the Hong Kong Special Administrative Region, China (No. REFERENCES
CityU 11212119, 11237116), City University of Hong Kong (No. SRG [1] James Arvo and Kevin Novins. 2000. Fluid Sketches: Continuous Recognition and
7005176), and the Centre for Applied Computing and Interactive Morphing of Simple Hand-Drawn Shapes. In Proceedings of the 13th annual ACM
symposium on User interface software and technology. ACM, 73–80.
Media (ACIM) of School of Creative Media, CityU. [2] Martin Bichsel. 1996. Automatic interpolation and recognition of face images by
morphing. In Proceedings of the Second International Conference on Automatic Face
DeepFaceDrawing: Deep Generation of Face Images from Sketches • 15

Fig. 20. In each set, we show color image (Left) of the source sketches (not shown here), a new face sketch (Middle) by directly recombining the cropped
source sketches in the image domain, and a new face (Right) synthesized by using our method with the recombined sketches of the cropped components
(eyes, nose, mouth, and remainder) as input.

Nets. In Proceedings of the 27th International Conference on Neural Information


Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge,
MA, USA, 2672–2680.
[12] Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. 2019.
Mask-Guided Portrait Editing with Conditional GANs. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE, 3436–3445.
[13] Xiaoguang Han, Chang Gao, and Yizhou Yu. 2017. DeepSketch2Face: a deep
learning based sketching system for 3D face and caricature modeling. ACM Trans.
Graph. 36, 4, Article 126 (2017), 12 pages.
[14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and
Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to
a local nash equilibrium. In Advances in Neural Information Processing Systems.
Curran Associates, Inc., 6626–6637.
[15] Rui Huang, Shu Zhang, Tianyu Li, and Ran He. 2017. Beyond face rotation:
Global and local perception gan for photorealistic and identity preserving frontal
Fig. 21. A less successful example. The eyes in the generated image are of view synthesis. In IEEE International Conference on Computer Vision (ICCV). IEEE,
different colors. For the sketched mouth, it is slightly below an expected 2439–2448.
position, leading to a blurry result for this component. [16] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal
unsupervised image-to-image translation. In European Conference on Computer
Vision (ECCV). 172–189.
[17] Emmanuel Iarussi, Adrien Bousseau, and Theophanis Tsandilas. 2013. The draw-
ing assistant: Automated drawing guidance and feedback from photographs. In
and Gesture Recognition. IEEE, 128–135. Proceedings of the 26th Annual ACM Symposium on User Interface Software and
[3] Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of Technology. ACM, 183–192.
3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and [18] Takeo Igarashi, Satoshi Matsuoka, Sachiko Kawachiya, and Hidehiko Tanaka.
Interactive Techniques. ACM, 187–194. 1997. Interactive Beautification: A Technique for Rapid Geometric Design. In
[4] John Canny. 1986. A computational approach to edge detection. IEEE Transactions Proceedings of the 10th Annual ACM Symposium on User Interface Software and
on Pattern Analysis and Machine Intelligence PAMI-8, 6 (1986), 679–698. Technology (UIST ’97). Association for Computing Machinery, 105–114.
[5] Wengling Chen and James Hays. 2018. Sketchygan: Towards diverse and realistic [19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-
sketch to image synthesis. In IEEE Conference on Computer Vision and Pattern image translation with conditional adversarial networks. In IEEE Conference on
Recognition (CVPR). IEEE, 9416–9425. Computer Vision and Pattern Recognition (CVPR). IEEE, 1125–1134.
[6] Tali Dekel, Chuang Gan, Dilip Krishnan, Ce Liu, and William T Freeman. 2018. [20] Youngjoo Jo and Jongyoul Park. 2019. SC-FEGAN: Face Editing Generative Ad-
Sparse, smart contours to represent and edit images. In IEEE Conference on Com- versarial Network with User’s Sketch and Color. In IEEE International Conference
puter Vision and Pattern Recognition (CVPR). 3511–3520. on Computer Vision (ICCV). IEEE, 1745–1753.
[7] Daniel Dixon, Manoj Prasad, and Tracy Hammond. 2010. iCanDraw: using sketch [21] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-
recognition and corrective feedback to assist a user in drawing human faces. In time style transfer and super-resolution. In European Conference on Computer
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Vision (ECCV). Springer-Verlag, 694–711.
ACM, 897–906. [22] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architec-
[8] Lin Gao, Jie Yang, Tong Wu, Yu-Jie Yuan, Hongbo Fu, Yu-Kun Lai, and Hao(Richard) ture for generative adversarial networks. In IEEE Conference on Computer Vision
Zhang. 2019. SDM-NET: Deep Generative Network for Structured Deformable and Pattern Recognition (CVPR). 4401–4410.
Mesh. ACM Trans. Graph. 38, 6 (2019), 243:1–243:15. [23] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-
[9] Shiming Ge, Xin Jin, Qiting Ye, Zhao Luo, and Qiang Li. 2018. Image editing mization. http://arxiv.org/abs/1412.6980 cite arxiv:1412.6980Comment: Published
by object-aware optimal boundary searching and mixed-domain composition. as a conference paper at the 3rd International Conference for Learning Represen-
Computational Visual Media 4 (01 2018). https://doi.org/10.1007/s41095-017-0102- tations, San Diego, 2015.
8 [24] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2019. MaskGAN: Towards
[10] Arnab Ghosh, Richard Zhang, Puneet K Dokania, Oliver Wang, Alexei A Efros, Diverse and Interactive Facial Image Manipulation. arXiv preprint arXiv:1907.11922
Philip HS Torr, and Eli Shechtman. 2019. Interactive Sketch & Fill: Multiclass (2019).
Sketch-to-Image Translation. In IEEE International Conference on Computer Vision [25] Yong Jae Lee, C Lawrence Zitnick, and Michael F Cohen. 2011. Shadowdraw:
(ICCV). IEEE, 1171–1180. real-time user guidance for freehand drawing. ACM Trans. Graph. 30, 4, Article
[11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, 27 (2011), 10 pages.
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial
16 • Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu

[26] Yuhang Li, Xuejin Chen, Feng Wu, and Zheng-Jun Zha. 2019. LinesToFacePhoto: (CVPR). 1495–1504.
Face Photo Generation From Lines With Conditional Self-Attention Generative [38] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan
Adversarial Networks. In Proceedings of the 27th ACM International Conference on Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with
Multimedia. ACM, 2323–2331. conditional gans. In IEEE Conference on Computer Vision and Pattern Recognition
[27] Alex Limpaecher, Nicolas Feltman, Adrien Treuille, and Michael Cohen. 2013. (CVPR). IEEE, 8798–8807.
Real-time drawing assistance through crowdsourcing. ACM Trans. Graph. 32, 4, [39] Xiaogang Wang and Xiaoou Tang. 2008. Face photo-sketch synthesis and recogni-
Article 54 (2013), 8 pages. tion. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 11 (2008),
[28] Zongguang Lu, Yang Jing, and Qingshan Liu. 2017. Face image retrieval based 1955–1967.
on shape and texture feature fusion. Computational Visual Media 3, 4 (12 2017), [40] Di Wu and Qionghai Dai. 2009. Sketch realizing: lifelike portrait synthesis from
359–368. https://doi.org/10.1007/s41095-017-0091-7 sketch. In Proceedings of the 2009 Computer Graphics International Conference.
[29] Yusuke Matsui, Takaaki Shiratori, and Kiyoharu Aizawa. 2016. DrawFromDraw- ACM, 13–20.
ings: 2D drawing assistance via stroke interpolation with a sketch database. IEEE [41] Jun Xie, Aaron Hertzmann, Wilmot Li, and Holger Winnemöller. 2014. PortraitS-
Transactions on Visualization and Computer Graphics 23, 7 (2016), 1852–1862. ketch: face sketching assistance for novices. In Proceedings of the 27th annual
[30] Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. ACM symposium on User interface software and technology. ACM, 407–417.
arXiv preprint arXiv:1411.1784 (2014). [42] Saining Xie and Zhuowen Tu. 2015. Holistically-Nested Edge Detection. In IEEE
[31] Tiziano Portenier, Qiyang Hu, Attila Szabo, Siavash Arjomand Bigdeli, Paolo International Conference on Computer Vision (ICCV). IEEE, 1395–1403.
Favaro, and Matthias Zwicker. 2018. Faceshop: Deep sketch-based face image [43] Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul L Rosin. 2019. APDrawingGAN:
editing. ACM Trans. Graph. 37, 4, Article 99 (2018), 13 pages. Generating Artistic Portrait Drawings from Face Photos with Hierarchical GANs.
[32] Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
by locally linear embedding. Science 290, 5500 (2000), 2323–2326. 10743–10752.
[33] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, [44] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised
and Xi Chen. 2016. Improved techniques for training gans. In Advances in neural dual learning for image-to-image translation. In IEEE International Conference on
information processing systems. Curran Associates, Inc., 2234–2242. Computer Vision (ICCV). 2849–2857.
[34] Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2017. Scrib- [45] Sheng You, Ning You, and Minxue Pan. 2019. PI-REC: Progressive Image Recon-
bler: Controlling deep image synthesis with sketch and color. In IEEE Conference struction Network With Edge and Color Domain. arXiv preprint arXiv:1903.10146
on Computer Vision and Pattern Recognition (CVPR). IEEE, 5400–5409. (2019).
[35] Edgar Simo-Serra, Satoshi Iizuka, Kazuma Sasaki, and Hiroshi Ishikawa. 2016. [46] Wei Zhang, Xiaogang Wang, and Xiaoou Tang. 2011. Coupled information-
Learning to Simplify: Fully Convolutional Networks for Rough Sketch Cleanup. theoretic encoding for face photo-sketch recognition. In IEEE Conference on Com-
ACM Trans. Graph. 35, 4, Article 121 (2016), 11 pages. puter Vision and Pattern Recognition (CVPR). IEEE, 513–520.
[36] Qingkun Su, Wing Ho Andy Li, Jue Wang, and Hongbo Fu. 2014. EZ-sketching: [47] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. 2016.
three-level optimization for error-tolerant image tracing. ACM Trans. Graph. 33, Generative Visual Manipulation on the Natural Image Manifold. In European
4, Article 54 (2014), 9 pages. Conference on Computer Vision (ECCV).
[37] Miao Wang, Guo-Ye Yang, Ruilong Li, Run-Ze Liang, Song-Hai Zhang, Peter M [48] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired
Hall, and Shi-Min Hu. 2019. Example-guided style-consistent image synthesis from image-to-image translation using cycle-consistent adversarial networks. In IEEE
semantic labeling. In IEEE Conference on Computer Vision and Pattern Recognition International Conference on Computer Vision (ICCV). 2223–2232.

You might also like