Amazon releases code, datasets for developing embodied AI agents

June 16, 2024

0 Views 0

SaveSavedRemoved 0

Alexa Arena is a new embodied-AI framework developed to push the boundaries of human-robot interaction. It offers an interactive, user-centric framework for creating robotic tasks that involve navigating multiroom simulated environments and manipulating all types of objects in real time. In a gamelike setting, users can interact with virtual robots through natural-language dialogue, helping the robots complete their tasks. The framework currently includes a large set of multiroom layouts for a home, a warehouse, and a lab.

Related content

Two Alexa AI papers present novel methodologies that use vision and language understanding to improve embodied task completion in simulated environments.

Arena enables the training and evaluation of embodied-AI models, along with the generation of new training data based on the human-robot interactions. It can thus contribute to the development of generalizable embodied agents with a wide variety of AI capabilities, such as task planning, visual dialogue, multimodal reasoning, task completion, teachable AI, and conversational understanding.

We have publicly released (a) the code repository for Arena, which includes the simulation engine artifacts and a machine learning (ML) toolbox for model training and visual inferencing; (b) comprehensive datasets for training embodied agents; and (c) benchmark ML models that incorporate vision and language planning for task completion. In addition, we have also launched a new leaderboard for Arena, to evaluate the performance of embodied agents on unseen tasks.

The simulation engine of Alexa Arena is built using the Unity game engine and includes 330+ assets spanning both commonplace objects in homes (such as refrigerators and chairs) and uncommon objects (such as forklifts and floppy disks). Arena also features more than 200,000 multiroom scenes, each with a unique combination of room specifications and furniture arrangement.

In addition, each scene can randomize the robot’s initial location, the placement of movable objects (such as computers and books), floor materials, wall colors, etc., to provide the rich set of visual variations needed to train embodied agents through both supervised and reinforcement learning methods.

An example of a game built with Arena, with, at left, a virtual room seen from a simulated robot’s perspective and, at right, dialogue between the robot and the human operator.

To make games more engaging, Arena includes live background animations and sounds, user-friendly graphics, smooth robot navigation with live visuals and support for multiple viewpoints, views that can be switched between first-party and third-party cameras, the hazards and preconditions that can be incorporated into task completion criteria, a mini-map showing the location of the robot within a scene, and a configurable hint-generation mechanism. After the execution of every action in the environment, Arena generates a rich set of metadata, such as images from RGB and depth cameras, segmentation maps, robot location, and error codes.

Long-horizon robotic tasks (such as “make a hot cup of tea”) can be authored in Arena, using a new challenge definition format (CDF) to specify the initial states of objects (such as “cabinet doors are closed”), goal conditions to be satisfied (such as “cup is filled with milk or water”), and textual hints planted at specific locations in the scene (such as “check the fridge for milk”).