cp.txt

Toward Characteristic-Preserving Image-based
Virtual Try-On Network
Bochao Wang1,2
, Huabin Zheng1,2
, Xiaodan Liang1?
, Yimin Chen2
, Liang
Lin1,2
, and Meng Yang1
1 Sun Yat-sen University, China
2 SenseTime Group Limited
{wangboch,zhhuab}@mail2.sysu.edu.cn,xdliang328@gmail.com,
chenyimin@sensetime.com, linliang@ieee.org, yangm6@mail.sysu.edu.cn
Abstract. Image-based virtual try-on systems for fitting a new in-shop
clothes into a person image have attracted increasing research attention, yet is still challenging. A desirable pipeline should not only transform the target clothes into the most fitting shape seamlessly but also
preserve well the clothes identity in the generated image, that is, the
key characteristics (e.g. texture, logo, embroidery) that depict the original clothes. However, previous image-conditioned generation works fail
to meet these critical requirements towards the plausible virtual try-on
performance since they fail to handle large spatial misalignment between
the input image and target clothes. Prior work explicitly tackled spatial
deformation using shape context matching, but failed to preserve clothing details due to its coarse-to-fine strategy. In this work, we propose
a new fully-learnable Characteristic-Preserving Virtual Try-On Network
(CP-VTON) for addressing all real-world challenges in this task. First,
CP-VTON learns a thin-plate spline transformation for transforming the
in-shop clothes into fitting the body shape of the target person via a new
Geometric Matching Module (GMM) rather than computing correspondences of interest points as prior works did. Second, to alleviate boundary
artifacts of warped clothes and make the results more realistic, we employ a Try-On Module that learns a composition mask to integrate the
warped clothes and the rendered image to ensure smoothness. Extensive
experiments on a fashion dataset demonstrate our CP-VTON achieves
the state-of-the-art virtual try-on performance both qualitatively and
quantitatively.
Keywords: Virtual Try-On · Characteristic-Preserving · Thin Plate
Spline · Image Alignment
1 Introduction
Online apparel shopping has huge commercial advantages compared to traditional shopping(e.g. time, choice, price) but lacks physical apprehension. To
? The corresponding author is Xiaodan Liang
2 Bochao Wang et al.
VITON VITON CP-VTON CP-VTON VITON CP-VTON
Fig. 1. The proposed CP-VTON can generate more realistic image-based virtual tryon results that preserve well key characteristics of the in-shop clothes, compared to the
state-of-the-art VITON [10].
create a shopping environment close to reality, virtual try-on technology has
attracted a lot of interests recently by delivering product information similar
to that obtained from direct product examination. It allows users to experience
themselves wearing different clothes without efforts of changing them physically. This helps users to quickly judge whether they like a garment or not and
make buying decisions, and improves sales efficiency of retailers. The traditional
pipeline is to use computer graphics to build 3D models and render the output
images since graphics methods provide precise control of geometric transformations and physical constraints. But these approaches require plenty of manual
labor or expensive devices to collect necessary information for building 3D models and massive computations.
More recently, the image-based virtual try-on system [10] without resorting
to 3D information, provides a more economical solution and shows promising results by reformulating it as a conditional image generation problem. Given two
images, one of a person and the other of an in-shop clothes, such pipeline aims
to synthesize a new image that meets the following requirements: a) the person
is dressed in the new clothes; b) the original body shape and pose are retained;
c) the clothing product with high-fidelity is warped smoothly and seamlessly
connected with other parts; d) the characteristics of clothing product, such as
texture, logo and text, are well preserved, without any noticeable artifacts and
distortions. Current research and advances in conditional image generation (e.g.
image-to-image translation [12, 38, 5, 34, 20, 6]) make it seem to be a natural approach of facilitating this problem. Besides the common pixel-to-pixel losses (e.g.
L1 or L2 losses) and perceptual loss [14], an adversarial loss [12] is used to alleviate the blurry issue in some degree, but still misses critical details. Furthermore,
these methods can only handle the task with roughly aligned input-output pairs
and fail to deal with large transformation cases. Such limitations hinder their
CP-VTON 3
application on this challenging virtual try-on task in the wild. One reason is the
poor capability in preserving details when facing large geometric changes, e.g.
conditioned on unaligned image [23]. The best practice in image-conditional virtual try-on is still a two-stage pipeline VITON [10]. But their performances are
far from the plausible and desired generation, as illustrated in Fig. 1. We argue
that the main reason lies in the imperfect shape-context matching for aligning
clothes and body shape, and the inferior appearance merging strategy.
To address the aforementioned challenges, we present a new image-based
method that successfully achieves the plausible try-on image syntheses while
preserving cloth characteristics, such as texture, logo, text and so on, named as
Characteristic-Preserving Image-based Virtual Try-On Network (CP-VTON). In
particular, distinguished from the hand-crafted shape context matching, we propose a new learnable thin-plate spline transformation via a tailored convolutional
neural network in order to align well the in-shop clothes with the target person.
The network parameters are trained from paired images of in-shop clothes and a
wearer, without the need of any explicit correspondences of interest points. Second, our model takes the aligned clothes and clothing-agnostic yet descriptive
person representation proposed in [10] as inputs, and generates a pose-coherent
image and a composition mask which indicates the details of aligned clothes kept
in the synthesized image. The composition mask tends to utilize the information
of aligned clothes and balances the smoothness of the synthesized image. Extensive experiments show that the proposed model handles well the large shape
and pose transformations and achieves the state-of-art results on the dataset
collected by Han et al. [10] in the image-based virtual try-on task.
Our contributions can be summarized as follows:
? We propose a new Characteristic-Preserving image-based Virtual Try-On
Network (CP-VTON) that addresses the characteristic preserving issue when
facing large spatial deformation challenge in the realistic virtual try-on task.
? Different from the hand-crafted shape context matching, our CP-VTON incorporates a full learnable thin-plate spline transformation via a new Geometric Matching Module to obtain more robust and powerful alignment.
? Given aligned images, a new Try-On Module is performed to dynamically
merge rendered results and warped results.
? Significant superior performances in image-based virtual try-on task achieved
by our CP-VTON have been extensively demonstrated by experiments on
the dataset collected by Han et al. [10].
2 Related Work
2.1 Image synthesis
Generative adversarial networks(GANs) [9] aim to model the real image distribution by forcing the generated samples to be indistinguishable from the real
images. Conditional generative adversarial networks(cGANs) have shown impressive results on image-to-image translation, whose goal is to translate an input
4 Bochao Wang et al.
image from one domain to another domain [12, 38, 5, 34, 18, 19, 35]. Compared
L1/L2 loss, which often leads to blurry images, the adversarial loss has become
a popular choice for many image-to-image tasks. Recently, Chen and Koltun [3]
suggest that the adversarial loss might be unstable for high-resolution image
generation. We find the adversarial loss has little improvement in our model. In
image-to-image translation tasks, there exists an implicit assumption that the input and output are roughly aligned with each other and they represent the same
underlying structure. However, most of these methods have some problems when
dealing with large spatial deformations between the conditioned image and the
target one. Most of image-to image translation tasks conditioned on unaligned
images [10, 23, 37], adopt a coarse-to-fine manner to enhance the quality of final
results. To address the misalignment of conditioned images, Siarohit et al. [31]
introduced a deformable skip connections in GAN, using the correspondences
of the pose points. VITON [10] computes shape context thin-plate spline(TPS)
transofrmation [2] between the mask of in-shop clothes and the predicted foreground mask. Shape context is a hand-craft feature for shape and the matching
of two shapes is time-consumed. Besides, the computed TPS transoformations
are vulnerable to the predicted mask. Inspired by Rocco et al. [27], we design a
convolutional neural network(CNN) to estimate a TPS transformation between
in-shop clothes and the target image without any explicit correspondences of
interest points.
2.2 Person Image generation
Lassner et al. [17] introduced a generative model that can generate human parsing [8] maps and translate them into persons in clothing. But it is not clear
how to control the generated fashion items. Zhao et al. [37] addressed a problem
of generating multi-view clothing images based on a given clothing image of a
certain view. PG2 [23] synthesizes the person images in arbitrary pose, which
explicitly uses the target pose as a condition. Siarohit et al. [31] dealt the same
task as PG2, but using the correspondences between the target pose and the pose
of conditional image. The generated fashion items in [37, 23, 31], kept consistent
with that of the conditional images. FashionGAN [39] changed the fashion items
on a person and generated new outfits by text descriptions. The goal of virtual
try-on is to synthesize a photo-realistic new image with a new piece of clothing
product, while leaving out effects of the old one. Yoo te al. [36] generated in shop
clothes conditioned on a person in clothing, rather than the reverse.
2.3 Virtual Try-on System
Most virtual try-on works are based on graphics models. Sekine et al. [30] introduced a virtual fitting system that captures 3D measurements of body shape.
Chen et al. [4] used a SCAPE [1] body model to generate synthetic people. PonsMoll et al. [26] used a 3D scanner to automatically capture real clothing and
estimate body shape and pose. Compared to graphics models, image-based generative models are more computationally efficient. Jetchev and Bergmann [13]
CP-VTON 5
Loss
Warped Clothes
In-shop Clothes
Person Representation
Clothes on Person
Down Sample Layers Up Sample Layers Correlation Matching TPS Warping Mask Composition
Try-on Module
Geometric Matching Module
Person Representation
Warped Clothes
Loss
C
Final Result Ground Truth Image
Rendered Person
Fig. 2. An overview of our CP-VTON, containing two main modules. (a) Geometric
Matching Module: the in-shop clothes c and input image representation p are aligned
via a learnable matching module. (b) Try-On Module: it generates a composition mask
M and a rendered person Ir. The final results Io is composed by warped clothes ?c and
the rendered person Ir with the composition mask M.
proposed a conditional analogy GAN to swap fashion articles, without other
descriptive person representation. They didn’t take pose variant into consideration, and during inference, they required the paired images of in-shop clothes
and a wearer, which limits their practical scenarios. The most related work is
VITON [10]. We all aim to synthesize photo-realistic image directly from 2D
images. VITON addressed this problem with a coarse-to-fine framework and
expected to capture the cloth deformation by a shape context TPS transoformation. We propose an alignment network and a single pass generative framework,
which preserving the characteristics of in-shop clothes.
3 Characteristic-Preserving Virtual Try-On Network
We address the task of image-based virtual try-on as a conditional image generation problem. Generally, given a reference image Ii of a person wearing in
clothes ci and a target clothes c, the goal of CP-VTON is to synthesize a new
image Io of the wearer in the new cloth co, in which the body shape and pose of
Ii are retained, the characteristics of target clothes c are reserved and the effects
of the old clothes ci are eliminated.
6 Bochao Wang et al.
Training with sample triplets (Ii
, c, It) where It is the ground truth of Io
and c is coupled with It wearing in clothes ct, is straightforward but undesirable in practice. Because these triplets are difficult to collect. It is easier if
Ii
is same as It, which means that c, It pairs are enough. These paris are in
abundance from shopping websites. But directly training on (It, c, It) harms the
model generalization ability at testing phase when only decoupled inputs (Ii
, c)
are available. Prior work [10] addressed this dilemma by constructing a clothingagnostic person representation p to eliminate the effects of source clothing item
ci
. With (It, c, It) transformed into a new triplet form (p, c, It), training and
testing phase are unified. We adopted this representation in our method and
further enhance it by eliminating less information from reference person image.
Details are described in Sec. 3.1. One of the challenges of image-based virtual
try-on lies in the large spatial misalignment between in-shop clothing item and
wearer’s body. Existing network architectures for conditional image generation
(e.g. FCN [21], UNet [28], ResNet [11]) lack the ability to handle large spatial
deformation, leading to blurry try-on results. We proposed a Geometric Matching Module (GMM) to explicitly align the input clothes c with aforementioned
person representation p and produce a warped clothes image ?c. GMM is a endto-end neural network directly trained using pixel-wise L1 loss. Sec. 3.2 gives
the details. Sec. 3.3 completes our virtual try-on pipeline with a characteristicpreserving Try-On Module. The Try-On module synthesizes final try-on results
Io by fusing the warped clothes ?c and the rendered person image Ir. The overall
pipeline is depicted in Fig. 2.
3.1 Person Representation
The original cloth-agnostic person representation [10] aims at leaving out the
effects of old clothes ci
like its color, texture and shape, while preserving information of input person Ii as much as possible, including the person’s face, hair,
body shape and pose. It contains three components:
? Pose heatmap: an 18-channel feature map with each channel corresponding
to one human pose keypoint, drawn as an 11 】 11 white rectangle.
? Body shape: a 1-channel feature map of a blurred binary mask that roughly
covering different parts of human body.
? Reserved regions: an RGB image that contains the reserved regions to maintain the identity of a person, including face and hair.
These feature maps are all scaled to a fixed resolution 256】192 and concatenated
together to form the cloth-agnostic person representation map p of k channels,
where k = 18 + 1 + 3 = 22. We also utilize this representation in both our
matching module and try-on module.
3.2 Geometric Matching Module
The classical approach for the geometry estimation task of image matching consists of three stages: (1) local descriptors (e.g. shape context [2], SIFT [22] )
CP-VTON 7
are extracted from both input images, (2) the descriptors are matched across
images form a set of tentative correspondences, (3) these correspondences are
used to robustly estimate the parameters of geometric model using RANSAC [7]
or Hough voting [16, 22].
Rocco et al. [27] mimics this process using differentiable modules so that
it can be trainable end-to-end for geometry estimation tasks. Inspired by this
work, we design a new Geometric Matching Module (GMM) to transform the
target clothes c into warped clothes ?c which is roughly aligned with input person representation p. As illustrated in Fig. 2, our GMM consists of four parts:
(1) two networks for extracting high-level features of p and c respectively. (2)
a correlation layer to combine two features into a single tensor as input to the
regressor network. (3) the regression network for predicting the spatial transformation parameters ヨ. (4) a Thin-Plate Spline (TPS) transformation module
T for warping an image into the output ?c = Tヨ(c). The pipeline is end-to-end
learnable and trained with sample triplets (p, c, ct), under the pixel-wise L1 loss
between the warped result ?c and ground truth ct, where ct is the clothes worn
on the target person in It:
LGMM(ヨ) = ||c?? ct||1 = ||Tヨ(c) ? ct||1 (1)
The key differences between our approach and Rocco et al. [27] are three-fold.
First, we trained from scratch rather than using a pretrained VGG network. Second, our training ground truths are acquired from wearer’s real clothes rather
than synthesized from simulated warping. Most importantly, our GMM is directly supervised under pixel-wise L1 loss between warping outputs and ground
truth.
3.3 Try-on Module
Now that the warped clothes ?c is roughly aligned with the body shape of the
target person, the goal of our Try-On module is to fuse ?c with the target person
and for synthesizing the final try-on result.
One straightforward solution is directly pasting ?c onto target person image
It. It has the advantage that the characteristics of warped clothes are fully preserved, but leads to an unnatural appearance at the boundary regions of clothes
and undesirable occlusion of some body parts (e.g. hair, arms). Another solution
widely adopted in conditional image generation is translating inputs to outputs
by a single forward pass of some encoder-decoder networks, such as UNet [28],
which is desirable for rendering seamless smooth images. However, It is impossible to perfectly align clothes with target body shape. Lacking explicit spatial
deformation ability, even minor misalignment could make the UNet-rendered
output blurry.
Our Try-On Module aims to combine the advantages of both approaches
above. As illustrated in Fig. 2, given a concatenated input of person representation p and the warped clothes ?c, UNet simultaneously renders a person image
Ir and predicts a composition mask M. The rendered person Ir and the warped
8 Bochao Wang et al.
clothes ?c are then fused together using the composition mask M to synthesize
the final try-on result Io:
Io = M ⒘ c?+ (1 ? M) ⒘ Ir (2)
where ⒘ represents element-wise matrix multiplication.
At training phase, given the sample triples (p, c, It), the goal of Try-On Module is to minimize the discrepancy between output Io and ground truth It. We
adopted the widely used strategy in conditional image generation problem that
using a combination of L1 loss and VGG perceptual loss [14], where the VGG
perceptual loss is defined as follows:
LVGG(Io, It) = X
5
i=1
ルi kヵi(Io) ? ヵi(It)k1
(3)
where ヵi(I) denotes the feature map of image I of the i-th layer in the visual
perception network ヵ, which is a VGG19 [32] pre-trained on ImageNet. The layer
i ∶ 1 stands for ’conv1 2’, ’conv2 2’, ’conv3 2’, ’conv4 2’, ’conv5 2’, respectively.
Towards our goal of characteristic-preserving, we bias the composition mask
M to select warped clothes as much as possible by applying a L1 regularization
||1 ? M||1 on M. The overall loss function for Try-On Module (TOM) is:
LTOM = ルL1||Io ? It||1 + ルvggLVGG(
?I, I) + ルmask||1 ? M||1. (4)
4 Experiments and Analysis
4.1 Dataset
We conduct our all experiments on the datasets collected by Han et al. [10]. It
contains around 19,000 front-view woman and top clothing image pairs. There
are 16253 cleaned pairs, which are split into a training set and a validation
set with 14221 and 2032 pairs, respectively. We rearrange the images in the
validation set into unpaired pairs as the testing set.
4.2 Quantitative Evaluation
We evaluate the quantitative performance of different virtual try-on methods
via a human subjective perceptual study. Inception Score (IS) [29] is usually
used as to quantitatively evaluate the image synthesis quality, but not suitable for
evaluating this task for that it cannot reflect whether the details are preserved as
described in [10]. We focus on the clothes with rich details since we are interested
in characteristic-preservation, instead of evaluating on the whole testing set. For
simplicity, we measure the detail richness of a clothing image by its total variation
(TV) norm. It is appropriate for this dataset since the background is in pure
color and the TV norm is only contributed by clothes itself, as illustrated in
Fig. 3. We extracted 50 testing pairs with largest clothing TV norm named as
CP-VTON 9
Fig. 3. From top to bottom, the TV norm values are increasing. Each line shows some
clothes in the same level.
LARGE to evaluate characteristic-preservation of our methods, and 50 pairs
with smallest TV norm named as SMALL to ensure that our methods perform
at least as good as previous state-of-the-art methods in simpler cases.
We conducted pairwise A/B tests on Amazon Mechanical Turk (AMT) platform. Specifically, given a person image and a target clothing image, the worker
is asked to select the image which is more realistic and preserves more details
of the target clothes between two virtual try-on results from different methods.
There is no time limited for these jobs, and each job is assigned to 4 different
workers. Human evaluation metric is computed in the same way as in [10].
4.3 Implementation Details
Training Setup In all experiments, we use ルL1 = ルvgg = 1. When composition
mask is used, we set ルmask = 1. We trained both Geometric Matching Module
and Try-on Module for 200K steps with batch size 4. We use Adam [15] optimizer
with モ1 = 0.5 and モ2 = 0.999. Learning rate is first fixed at 0.0001 for 100K
steps and then linearly decays to zero for the remaining steps. All input images
are resized to 256 】 192 and the output images have the same resolution.
Geometric Matching Module Feature extraction networks for person representation and clothes have the similar structure, containing four 2-strided downsampling convolutional layers, succeeded by two 1-strided ones, their numbers of
filters being 64, 128, 256, 512, 512, respectively. The only difference is the number of input channels. Regression network contains two 2-strided convolutional
layers, two 1-strided ones and one fully-connected output layer. The numbers
of filters are 512, 256, 128, 64. The fully-connected layer predicts the x- and ycoordinate offsets of TPS anchor points, thus has an output size of 2】5】5 = 50.
Try-On Module We use a 12-layer UNet with six 2-strided down-sampling
convolutional layers and six up-sampling layers. To alleviate so-called “checkerboard artifacts”, we replace 2-strided deconvolutional layers normally used for
up-sampling with the combination of nearest-neighbor interpolation layers and
1-strided convolutional layers, as suggested by [25]. The numbers of filters for
10 Bochao Wang et al.
Target
Person
SCMM
GMM
In-shop
Clothes
SCMM
Align
GMM
Align
Fig. 4. Matching results of SCMM and GMM. Warped clothes are directly pasted
onto target persons for visual checking. Our method is comparable with SCMM and
produces less weird results.
down-sampling convolutional layers are 64, 128, 256, 512, 512, 512. The numbers of filters for up-sampling convolutional layers are 512, 512, 256, 128, 64, 4.
Each convolutional layer is followed by an Instance Normalization layer [33] and
Leaky ReLU [24], of which the slope is set to 0.2.
4.4 Comparison of Warping Results
Shape Context Matching Module (SCMM) uses hand-crafted descriptors and
explicitly computes their correspondences using an iterative algorithm, which is
time-consumed, while GMM runs much faster. In average, processing a sample
pair takes GMM 0.06s on GPU, 0.52s on CPU, and takes SCMM 2.01s on CPU.
Qualitative results Fig. 4 demonstrates a qualitative comparison of SCMM
and GMM. It shows that both modules are able to roughly align clothes with
target person pose. However, SCMM tends to overly shrink a long sleeve into a
“thin band”, as shown in the 6-th column in Fig. 4. This is because SCMM merely
relies on matched shape context descriptors on the boundary of cloths shape,
while ignores the internal structures. Once there exist incorrect correspondences
of descriptors, the warping results will be weird. In contrast, GMM takes full
advantages of the learned rich representation of clothes and person images to
CP-VTON 11
Target
Person
In-shop
Clothes
CP-VTON
VITON
Fig. 5. Qualitative comparisons of VITON and CP-VTON. Our CP-VTON successfully preserve key details of in-shop clothes.
determinate TPS transformation parameters and more robust for large shape
differences.
Quantitative results It is difficult to evaluate directly the quantitative performance of matching modules due to the lack of ground truth in the testing
phase. Nevertheless, we can simply paste the warped clothes onto the original
person image as a non-parametric warped synthesis method in [10]. We conduct
a perceptual user study following the protocol described in Sec. 4.2, for these two
warped synthesis methods. The synthesized by GMM are rated more realistic
in 49.5% and 42.0% for LARGE and SMALL, which indicates that GMM is
comparable to SCMM for shape alignment.
4.5 Comparison of Try-on Results
Qualitative results Fig. 2 shows that our pipeline performs roughly the same
as VITON when the patterns of target clothes are simpler. However, our pipeline
preserves sharp and intact characteristic on clothes with rich details (e.g. texture,
logo, embroidery) while VITON produces blurry results.
We argue that the failure of VITON lies in its coarse-to-fine strategy and
the imperfect matching module. Precisely, VITON learns to synthesis a coarse
person image at first, then to align the clothes with target person with shape
context matching, then to produce a composition mask for fusing UNet rendered
person with warped clothes and finally producing a refined result. After extensive
training, the rendered person image has already a small VGG perceptual loss
with respect to ground truth. On the other hand, the imperfect matching module
introduces unavoidable minor misalignment between the warped clothes and
ground truth, making the warped clothes unfavorable to perceptual loss. Taken
together, when further refined by truncated perceptual loss, the composition
mask will be biased towards selecting rendered person image rather than warped
12 Bochao Wang et al.
Table 1. Results of pairwise comparisons of images synthesized with LARGE and
SMALL clothes by different models. Each column compares our approach with one
of the baselines. Higher is better. The random chance is at 50%.
Data VITON CP-VTON(w/o mask) CP-VTON(w/o L1 Loss)
LARGE 67.5% 72.5% 84.5%
SMALL 55.0% 42% 38.5%
In-shop Clothes Target Person Coarse Result Warped Clothes Composition Mask Refined Result
Fig. 6. An example of VITON stage II. The composition mask tends to ignore the
details of coarsely aligned clothes.
clothes, despite the regularization of the composition mask(Eq. 4). The VITON’s
“ragged” masks shown in Fig. 6 confirm this argument.
Our pipeline doesn’t address the aforementioned issue by improving matching results, but rather sidesteps it by simultaneously learning to produce a UNet
rendered person image and a composition mask. Before the rendered person
image becomes favorable to loss function, the central clothing region of composition mask is biased towards warped clothes because it agrees more with ground
truth in the early training stage. It is now the warped clothes rather than the
rendered person image that takes the early advantage in the competition of mask
selection. After that, the UNet learns to adaptively expose regions where UNet
rendering is more suitable than directly pasting. Once the regions of hair and
arms are exposed, rendered and seamlessly fused with warped clothes.
Quantitative results The first column of Table 1 shows that our pipeline
surpasses VITON in the preserving the details of clothes using identical person
representation. According to the table, our approach performs better than other
methods, when dealing with rich details clothes.
4.6 Discussion and Ablation Studies
Effects of composition mask To empirically justify the design of composition
mask and mask L1 regularization (Eq. 4) in our pipeline, we compare it with
two variants for ablation studies: (1): mask composition is also removed and
the final results are directly rendered by UNet as CP-VTON(w/o mask). (2):
the mask composition is used but the mask L1 regularization is removed as
CP-VTON(w/o L1 Loss);
As shown in Fig. 6, even though the warped clothes are roughly aligned
with target person, CP-VTON(w/o mask) still loses characteristic details and
CP-VTON 13
Target
Person
In-shop
Clothes
Without Mask Without L1
(Rendered)
CP-VTON
(Rendered)
CP-VTON
?Mask?
Without L1
(Mask)
Fig. 7. Ablation studies on composition mask and mask L1 loss. Without mask composition, UNet cannot handle well even minor misalignment and produces undesirable
try-on results. Without L1 regularization on mask, it tends to select UNet-rendered
person, leading to blurry results as well.
produces blurry results. This verifies that encoder-decoder network architecture
like UNet fails to handle even minor spatial deformation.
Though integrated with mask composition, CP-VTON(no L1) performs as
poorly as variant CP-VTON(w/o mask. Fig. 7 shows that composition mask
tends to select rendered person image without L1 regularization. This verifies
that even minor misalignment introduces large perceptual disagreement between
warped clothes and ground truth.
Robustness against minor misalignment In Sec. 4.5 we argue that VITON
is vulnerable to minor misalignment due to its coarse-to-fine strategy, while our
pipeline sidesteps imperfect alignment by simultaneously producing rendered
person and composition mask. This is further clarified below in a controlled
condition with simulated warped clothes.
Specifically, rather than real warped clothes produced by matching module, we use the wore clothes collected from person images to simulate perfect
alignment results. We then train VITON stage II, our proposed variant CPVTON(w/o mask) and our pipeline. For VITON stage II, we synthesize coarse
person image with its source code and released model checkpoint.
It is predictable that with this “perfect matching module”, all the three methods could achieve excellent performance in training and validation phase, where
input samples are paired. Next is the interesting part: what if the perfect alignment is randomly perturbed within a range of N pixels, to simulate an imperfect
14 Bochao Wang et al.
Without
Mask
VITON
CP-VTON
N = 0 N = 5 N = 10 N = 15 N = 20
Fig. 8. Comparisons on the robustness of three methods against minor misalignment
simulated by random shift within radius N. As N increasing, results of CP-VTON
decays more slightly than other methods.
Fig. 9. Some failure cases of our CP-VTON.
matching module? With the perturbation getting greater (N = 0, 5, 10, 15, 20) ,
how fast will the try-on performance decay?
These questions are answered in Fig. 8. As we applying greater perturbation,
the performance of both VITON stage II and CP-VTON(w/o mask) decays
quickly. In contrast, our pipeline shows robustness against perturbation and
manages to preserve detailed characteristic.
Failure cases Fig. 9 shows three failure cases of our CP-VTON method caused
by (1) improperly preserved shape information of old clothes, (2) rare poses and
(3) inner side of the clothes undistinguishable from the outer side, respectively.
5 Conclusions
In this paper, we propose a fully learnable image-based virtual try-on pipeline
towards the characteristic-preserving image generation, named as CP-VTON,
including a new geometric matching module and a try-on module with the
new merging strategy. The geometric matching module aims at aligning in-shop
clothes and target person body with large spatial displacement. Given aligned
clothes, the try-on module learns to preserve well the detailed characteristic of
clothes. Extensive experiments show the overall CP-VTON pipeline produces
high-fidelity virtual try-on results that retain well key characteristics of in-shop
clothes. Our CP-VTON achieves state-of-the-art performance on the dataset
collected by Han et al. [10] both qualitatively and quantitatively.
CP-VTON 15
References
1. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape:
shape completion and animation of people. In: ACM transactions on graphics
(TOG). vol. 24, pp. 408?416. ACM (2005)
2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using
shape contexts. IEEE transactions on pattern analysis and machine intelligence
24(4), 509?522 (2002)
3. Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement
networks. In: The IEEE International Conference on Computer Vision (ICCV).
vol. 1 (2017)
4. Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or,
D., Chen, B.: Synthesizing training images for boosting human 3d pose estimation.
In: 3D Vision (3DV), 2016 Fourth International Conference on. pp. 479?488. IEEE
(2016)
5. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv
preprint arXiv:1711.09020 (2017)
6. Deng, Z., Zhang, H., Liang, X., Yang, L., Xu, S., Zhu, J., Xing, E.P.: Structured
generative adversarial networks. In: Advances in Neural Information Processing
Systems. pp. 3899?3909 (2017)
7. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. In: Readings
in computer vision, pp. 726?740. Elsevier (1987)
8. Gong, K., Liang, X., Shen, X., Lin, L.: Look into person: Self-supervised structuresensitive learning and a new benchmark for human parsing. arXiv preprint
arXiv:1703.05446 (2017)
9. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural
information processing systems. pp. 2672?2680 (2014)
10. Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try-on
network. arXiv preprint arXiv:1711.08447 (2017)
11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR. pp. 770?778 (2016)
12. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint (2017)
13. Jetchev, N., Bergmann, U.: The conditional analogy gan: Swapping fashion articles
on people images. arXiv preprint arXiv:1709.04695 (2017)
14. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and
super-resolution. In: ECCV. pp. 694?711 (2016)
15. Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International
Conference on Learning Representations (ICLR) (2015)
16. Lamdan, Y., Schwartz, J.T., Wolfson, H.J.: Object recognition by affine invariant matching. In: Computer Vision and Pattern Recognition, 1988. Proceedings
CVPR’88., Computer Society Conference on. pp. 335?344. IEEE (1988)
17. Lassner, C., Pons-Moll, G., Gehler, P.V.: A generative model of people in clothing.
arXiv preprint arXiv:1705.04098 (2017)
18. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection. In: IEEE CVPR (2017)
16 Bochao Wang et al.
19. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for future-flow embedded
video prediction. In: IEEE International Conference on Computer Vision (ICCV).
vol. 1 (2017)
20. Liang, X., Zhang, H., Xing, E.P.: Generative semantic manipulation with contrasting gan. arXiv preprint arXiv:1708.00315 (2017)
21. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: CVPR. pp. 3431?3440 (2015)
22. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International journal of computer vision 60(2), 91?110 (2004)
23. Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided
person image generation. In: Advances in Neural Information Processing Systems.
pp. 405?415 (2017)
24. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml. vol. 30, p. 3 (2013)
25. Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts.
Distill 1(10), e3 (2016)
26. Pons-Moll, G., Pujades, S., Hu, S., Black, M.J.: Clothcap: Seamless 4d clothing
capture and retargeting. ACM Transactions on Graphics (TOG) 36(4), 73 (2017)
27. Rocco, I., Arandjelovic, R., Sivic, J.: Convolutional neural network architecture for
geometric matching. In: Proc. CVPR. vol. 2 (2017)
28. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing
and computer-assisted intervention. pp. 234?241. Springer (2015)
29. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.:
Improved techniques for training gans. In: NIPS. pp. 2234?2242 (2016)
30. Sekine, M., Sugita, K., Perbet, F., Stenger, B., Nishiyama, M.: Virtual fitting by
single-shot body shape estimation. In: Int. Conf. on 3D Body Scanning Technologies. pp. 406?413. Citeseer (2014)
31. Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable gans for posebased human image generation. arXiv preprint arXiv:1801.00055 (2017)
32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
33. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: Maximizing
quality and diversity in feed-forward stylization and texture synthesis. In: Proc.
CVPR (2017)
34. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: Highresolution image synthesis and semantic manipulation with conditional gans. arXiv
preprint arXiv:1711.11585 (2017)
35. Yang, L., Liang, X., Xing, E.: Unsupervised real-to-virtual domain unification for
end-to-end highway driving. arXiv preprint arXiv:1801.03458 (2018)
36. Yoo, D., Kim, N., Park, S., Paek, A.S., Kweon, I.S.: Pixel-level domain transfer.
In: European Conference on Computer Vision. pp. 517?532. Springer (2016)
37. Zhao, B., Wu, X., Cheng, Z.Q., Liu, H., Feng, J.: Multi-view image generation
from a single-view. arXiv preprint arXiv:1704.04886 (2017)
38. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)
39. Zhu, S., Fidler, S., Urtasun, R., Lin, D., Loy, C.C.: Be your own prada: Fashion
synthesis with structural coherence. arXiv preprint arXiv:1710.07346 (2017)