A way for online shoppers to virtually try out products is a sought-after technology that can create a more immersive shopping experience. Examples include realistically draping clothes on an image of the shopper or inserting pieces of furniture into images of the shopper’s living space.
In the clothing category, this problem is traditionally known as virtual try-on; we call the more general problem, which targets any category of product in any personal setting, the virtual try-all problem.
In a paper we recently posted in arXiv, we presented a solution to the virtual-try-all problem called Diffuse-to-Choose (DTC). Diffuse-to-Choose is a novel generative-AI model that allows users to seamlessly insert any product at any location in any scene.
The customer starts with a personal scene image and a product and draws a mask in the scene to tell the model where to insert the object. The model then integrates the item into the scene, with realistic angles, lighting, shadows, and so on. If necessary, the model infers new perspectives on the item, and it preserves the item’s fine-grained visual-identity details.
Diffuse-to-choose
New “virtual try-all” method works with any product, in any personal setting, and enables precise control of image regions to be modified.
The Diffuse-to-Choose model has a number of characteristics that set it apart from existing work on related problems. First, it is the first model to address the virtual-try-all problem, as opposed to the virtual-try-on problem: it is a single model that works across a wide range of product categories. Second, it doesn’t require 3-D models or multiple views of the product, just a single 2-D reference image. Nor does it require sanitized, white-background, or professional-studio-grade images: it works with “in the wild” images, such as regular cellphone pictures. Finally, it is fast, cost effective, and scalable, generating an image in approximately 6.4 seconds on a single AWS g5.xlarge instance (NVIDIA A10G with 24GB of GPU memory).
Under the hood, Diffuse-to-Choose is an inpainting latent-diffusion model, with architectural enhancements that allow it to preserve products’ fine-grained visual details. A diffusion model is one that’s incrementally trained to denoise increasingly noisy inputs, and a latent-diffusion model is one in which the denoising happens in the model’s representation (latent) space. Inpainting is a technique in which part of an image is masked, and the latent-diffusion inpainting model is trained to fill in (“inpaint”) the masked region with a realistic reconstruction, sometimes guided by a text prompt or an image reference.
Like most inpainting models, DTC uses an encoder-decoder model known as a U-Net to do the diffusion modeling. The U-Net’s encoder consists of a convolutional neural network, which divides the input image into small blocks of pixels and applies a battery of filters to each block, looking for particular image features. Each layer of the encoder steps down the resolution of the image representation; the decoder steps the resolution back up. (The U-shaped curve describing the resolution of the representation over successive layers gives the network its name.)
Our main innovation is to introduce a secondary U-Net encoder into the diffusion process. The input to this encoder is a rough copy-paste collage in which the product image, resized to match the scale of the background scene, has been inserted into the mask created by the customer. It’s a very crude approximation of the desired output, but the idea is that the encoding will preserve fine-grained details of the product image, which the final image reconstruction will incorporate.
We call the secondary encoder’s output a “hint signal”. Both it and the output of the primary U-Net’s encoder pass to a feature-wise linear-modulation (FiLM) module, which aligns the features of the two encodings. Then the encodings pass to the U-Net decoder.
We trained Diffuse-to-Choose on AWS p4d.24xlarge instances (with NVIDIA A100 40GB GPUs), with a dataset of a few million pairs of public images. In experiments, we compared its performance on the virtual-try-all task to those of four different versions of a traditional image-conditioned inpainting model, and we compared it to the state-of-the-art model on the more-specialized virtual-try-on task.
In addition to human-based qualitative evaluation of similarity and semantic blending, we used two quantitative metrics to assess performance: CLIP (contrastive language-image pretraining) score and the Fréchet inception distance (FID), which measures the realism and diversity of generated images. On the virtual-try-all task, DTC outperformed all four image-conditioned inpainting baselines on both metrics, with a margin of 9% in FID over the best-performing baseline.
On the virtual-try-on task, DTC was comparable to the baseline — slightly higher in CLIP score (90.14 vs. 90.11), but also slightly higher in FID, where lower is better (5.39 vs. 5.28). But given DTC’s generality, performing comparably to a special-purpose model on its specialized task is a substantial achievement. Finally, we demonstrate that DTC’s results are comparable in quality to those of order-of-magnitude more-expensive methods based on few-shot fine-tuning on every product, like our previous DreamPaint method.