Amazon has a range of ways to package product shipments: bags, padded mailers, T-folder boxes (the classic Amazon book box), carton boxes, and so on. The best package for a given product will optimize the trade-off between shipping costs (more elaborate packaging costs more) and the costs associated with the return of damaged products.
At this year’s European Conference on Machine Learning, my colleagues and I presented a new model for determining the best way to package a given product. The model has been applied to hundreds of thousands of Amazon packages, reducing shipment damage by 24% while actually cutting shipping costs by 5%.
The problem we address has certain structural features that make standard machine learning methods impractical.
The first is a lack of ground-truth data. If we had abundant examples of how the same product fared — what the damage rates were — with each of the eight package types we consider, we could train a standard, supervised machine learning model to decide a package type on the basis of product features.
But we don’t have those examples: most products are shipped in only one or two package types, and damage is rare.
The other feature is that we want our model to output probabilities that respect the ordinality of the package types. That is, we’d like the model to predict higher probabilities of damage for less expensive (less robust) packaging options and lower probabilities of damage for more-expensive (more-robust) options. Enforcing ordinality is not something that standard machine learning techniques do naturally.
So instead, we use a simple linear model, with carefully designed constraints on the model parameters to impose ordinality. The model performs a set of arithmetic operations on vectors representing product features, yielding a score that indicates probability of damage for each combination of product and package type. The product features include things like product title, category, subcategory, dimensions, weight, the difference between the package volume and the product volume, and whether or not the product is fragile or liquid or involves hazardous materials.
To further enforce ordinality, we use what in machine learning parlance is called data augmentation. For every example of a product-package pair that resulted in product damage, we add examples that pair the same product with each of the less-robust packaging options, also labeled as resulting in damage.
Similarly, for each example of a product-package pair that was successfully delivered, we add examples that pair the same product with each of the more-robust packaging options, also labeled as having resulted in successful delivery.
Not only does using a linear model make it easier to enforce ordinality, but it also makes model building much more efficient — a crucial concern, given that we’re fitting our model to 100 million different product-package pairs.
Formulating the problem
Our goal is to find a function that maps product features to package types in a way that minimizes the sum of shipping costs and damage-related costs for each product. (Using features rather than product identifiers as inputs to the function ensures that our model will also apply to products added after the model is trained.) At the same time, the function needs to keep the cumulative damage cost across products below some predetermined threshold.
Unfortunately, this formulation of the problem is NP-complete, meaning that it’s computationally intractable in most real-world scenarios.
We prove, however, that given a set of realistic assumptions, this formulation is equivalent to a simpler optimization problem that minimizes a weighted sum of the total cost of packaging and the total damage cost.
Proving equivalence requires that we find the right weighting parameter — the parameter by which the damage cost is multiplied in the weighed sum. (Weighting the damage cost, rather than the cost of packaging, makes sense, as there are fewer examples of damaged goods in the data set.) In the paper, we show that the weight can be computed efficiently using binary search.
The search procedure is to start with some large weight — in our experiments, we started with 1,000 — and cut it in half. This defines the midpoint between the minimum value of the weight (0) and the maximum.
Next, solve the optimization problem (minimize the weighted sum) at that weight. If, using the resulting model, the damage cost is above the predetermined threshold of the exact optimization problem, reset the minimum weight as the current midpoint; if it’s below the threshold, reset the maximum weight as the current midpoint. Then repeat the operation.
At each iteration, this procedure halves the interval in which we search for the weight setting that will keep the damage cost just under the target threshold. The procedure ends when the damage cost is within some predetermined, short distance of the threshold.
In our experiments, the procedure required 19 iterations, which meant recomputing the optimal model 19 times. But the procedure scales linearly with the data, as our proof enables us to exploit the fact that the problem constraints don’t apply across products. Hence we can decouple the optimizations for different products. So even with 100 million data points, this isn’t an overwhelming computational burden.