Amazon releases largest dataset for training “pick and place” robots

June 16, 2024

2 Views 0

SaveSavedRemoved 0

In an effort to improve the performance of robots that pick, sort, and pack products in warehouses, Amazon has publicly released the largest dataset of images captured in an industrial product-sorting setting. Where the largest previous dataset of industrial images featured on the order of 100 objects, the Amazon dataset, called ARMBench, features more than 190,000 objects. As such, it could be used to train “pick and place” robots that are better able to generalize to new products and contexts.

We describe ARMBench in a paper we will present later this spring at the International Conference on Robotics and Automation (ICRA).

The scenario in which the ARMBench images were collected involves a robotic arm that must retrieve a single item from a bin full of items and transfer it to a tray on a conveyor belt. The variety of objects and their configurations and interactions in the context of the robotic system make this a uniquely challenging task.

The ARMBench pick-and-place scenario.

ARMBench contains image sets for three separate tasks: (1) object segmentation, or identifying the boundaries of different products in the same bin; (2) object identification, or determining which product image in a reference database corresponds to the highlighted product in an image; and (3) defect detection, or determining when the robot has committed an error, such as picking up multiple items rather than one or damaging an item during transfer.

The images in the dataset fall into three different categories:

the pick image is a top-down image of a bin filled with items, prior to robotic handling;
transfer images are captured from multiple viewpoints as the robot transfers an item to the tray;
the place image is a top-down image of the tray in which the selected item is placed.

Examples of, from left, a pick image, a transfer image, and a place image.

The object segmentation dataset contains more than 50,000 images, with anywhere from one to 50 manual object segmentations per image, for an average of about 10.5. The high degree of clutter, combined with the variety of the objects — some of which are even transparent or reflective — makes this a challenging and unique benchmark.

An example of an image from the object segmentation dataset, in which all the items in a bin have been hand-segmented.

The object identification dataset contains more than 235,000 labeled “pick activities”; each pick activity includes a pick image and three transfer images. There are also reference images and text descriptions of more than 190,000 products; in the object identification task, a model must learn to match one of these reference products to an object highlighted in pick and transfer images.

Some of the challenges posed by this task include differentiating between similar-looking products, matching across large variations in viewpoints, and fusing multimodal information such as images and text to make predictions.

An example of a pick image (left) from the object recognition dataset and a set of reference images, one of which is a match for the highlighted object.

The defect detection dataset includes both still images and videos. The still images — more than 19,000 of them — were captured during the transfer phase and are intended to train defect detection models, which determine when a robot arm has inadvertently damaged an object or picked up more than one object.