In this study, we develop a mobile service robot that retrieves and transports everyday objects based on free-form instructions that may refer to both visual features and scene text. Our approach focuses on retrieving relevant images of target objects from a set of pre-collected images in diverse environments. This task is challenging because it requires not only an understanding of the visual features and spatial relationships of objects, but also the ability to interpret scene text embedded in the environment.
To address this, we propose Scene Text-Aware Retrieval-Augumented Generation for Everyday Objects (STARE), which leverages both visual and textual cues in response to free-form instructions. STARE introduces Crosslingual Visual Prompts to enhance scene text interpretation, and models the complex relationships between named entities in the instructions and their corresponding objects.
We evaluated STARE using two novel benchmarks, GoGetIt and TextCaps-test, both of which consist of real-world indoor/outdoor images and instructions. The results show that STARE outperformed all baseline methods on standard multimodal retrieval metrics. Furthermore, we conducted physical experiments in a zero-shot transfer setting and achieved an overall success rate exceeding 80%.
We address the task of retrieving relevant images from a pre-collected dataset based on free-form natural language instructions for object manipulation or navigation. We define this task as Scene Text Aware Multimodal Retrieval (STMR), where the retrieved images may or may not contain scene text.
Fig. 1: Typical sample of STMR task. STARE retrieves target objects based on free-form language instructions using scene text and visual context.
To tackle the task, we propose a novel approach that
integrates scene text and visual features using Crosslingual
Visual Prompts (CVP) and a Scene Text Reranker—enhancing
retrieval based on free-form natural language instructions.
CORE NOVELTIES:
Fig. 2: Overview of STARE. Given a natural language
instruction and a set of pre-collected images, the model retrieves
the most relevant images by integrating scene texts and visual
features through three modules: MFIE, STVE, and STRR.
The MFIE models complex expressions within an instruction by incorporating multi-level aligned language features, including descriptions generated using world knowledge from LLMs.
The STVE combines narrative representations that consider scene texts, overlapped patchified features, and multi-scale aligned features.
The STRR efficiently models the highly complex relationships between signifiers and their corresponding signified expressions.
Finally, the model outputs a ranked list of images that are most relevant to the instruction. STARE is designed to handle both images with and without scene text, enabling flexible and robust retrieval in real-world environments.
Fig. 3: Qualitative results of the proposed method and a
baseline method on the following benchmarks: (a) the RefText
subset of the GoGetIt benchmark, (b) the Instruction subset of the
GoGetIt benchmark, and (c) the TextCaps-test benchmark. The given
instructions for each sample were as follows: (a) "a long box of
non-stick baking paper."; (b) "Pass me the red container of
Sun-Maid raisins on the kitchen counter."; (c) "Buy the red Orbit
candy at the kiosk."
Fig. 4: Qualitative results of the hardware experiments.
The given instruction was "Pass me the Maxwell can." The top-2
retrieved images and the scene of the robot fetching the
corresponding object are shown.
Table 1: Quantitative results. The best results are marked in bold.
To be appeared.