A product page in the Amazon Store will often include links to product variants, which differ by color, size, style, and so on. Sometimes, however, errors can creep into the product catalogue, resulting in links to unrelated products or duplicate listings, which can compromise customers’ shopping experiences.
At this year’s Winter Conference on Applications of Computer Vision (WACV), we presented a new method for automatically identifying errors in product variation listings, which uses computer vision to determine whether the products depicted in different images are identical or different.
We frame the problem as a metric-learning problem, meaning that our machine learning model learns the function for measuring distances between vector representations of products in an embedding space. Embeddings of instances of the same products should be similar, while embeddings of different products should be dissimilar. Since this learned feature embedding typically generalizes well, the model can be applied to products unseen during training.
Our model is multimodal, in that its inputs include a product image and the product title. The only supervision signal is the overarching product descriptor that encompasses all the variants.
In experiments, we compared our model to a similarly multimodal benchmark model and found that it increased the area under the precision-recall curve (or PR-AUC, which evaluates the tradeoff between false positives and false negatives) by 5.2%.
The approach
The purpose of using the product title is to guide the model toward learning more robust and relevant representations. For instance, the title provides context that helps the model focus on the relevant regions of the image, making it more robust to noisy backgrounds. It also helps resolve ambiguities that arise due to multiple objects appearing in the image.
The architecture
Our model has two branches, one global and one local. The global network takes the whole image as input, and based on the product title, it determines which portion of the image to focus on. That information is used to crop the input image, and the cropped image passes to the local branch.
The backbone of each branch is a convolutional neural network (CNN), a type of network commonly used in computer vision that applies a series of identical filters to portions of the image representation.
Features extracted by the CNN are augmented by a self-attention mechanism, to better capture spatial dependencies. The augmented features then pass to spatial and channel attention layers. The spatial attention — i.e., “where to attend” — uses the title to attend over the relevant regions of the image. The channel attention — i.e., “what to attend” — emphasizes the relevant features of the image representation.
Both the spatial attention and the channel attention are based on a self-attentive embedding of the title information — that is, an embedding that weighs each word of the title in light of the other words.
We train using both positive and negative examples. For positive examples, we simply pair instances of the same overarching product descriptor.
In order for the model to learn efficiently, the negative examples have to be difficult: teaching the model to discriminate between, say, a shoe and a garden rake won’t help it distinguish between similar types of shoes. So for the negative examples, we pair products in the same subcategories. This results in a significant improvement in performance.
To test our approach, we created a dataset consisting of images and titles from three different product categories. As baselines in our experiments, we used image-only models and a recent multimodal approach that uses product attributes to attend over images.
Compared to the image-only models, our approach yields an increase in PR-AUC of up to 17% gain. Compared to the multimodal benchmark, the improvement is 5.2%.