STARE: Scene Text-Aware Multimodal Retrieval for Everyday Objects

Zero‑Shot Transfer

Instruction: "I want to drink coke"

Instruction: "I want to use hair gel"

Instruction: "I want to eat meat."

Instruction: "I want to drink tomato soup."

Instruction: "Clean the kitchen using sponge"

Instruction: "Give me the sesame mill jar."

Instruction: "I am locking for some yellow mustard"

Instruction: "Can you find me the box of aluminum foil?"

Instruction: "Take the spray bottle labeled windex"

Instruction: "Could you make coffee?"

Abstract

In this study, we develop a mobile service robot that retrieves and transports everyday objects based on free-form instructions that may refer to both visual features and scene text. Our approach focuses on retrieving relevant images of target objects from a set of pre-collected images in diverse environments. This task is challenging because it requires not only an understanding of the visual features and spatial relationships of objects, but also the ability to interpret scene text embedded in the environment.

To address this, we propose Scene Text-Aware Retrieval-Augumented Generation for Everyday Objects (STARE), which leverages both visual and textual cues in response to free-form instructions. STARE introduces Crosslingual Visual Prompts to enhance scene text interpretation, and models the complex relationships between named entities in the instructions and their corresponding objects.

We evaluated STARE using two novel benchmarks, GoGetIt and TextCaps-test, both of which consist of real-world indoor/outdoor images and instructions. The results show that STARE outperformed all baseline methods on standard multimodal retrieval metrics. Furthermore, we conducted physical experiments in a zero-shot transfer setting and achieved an overall success rate exceeding 80%.

Overview

We address the task of retrieving relevant images from a pre-collected dataset based on free-form natural language instructions for object manipulation or navigation. We define this task as Scene Text Aware Multimodal Retrieval (STMR), where the retrieved images may or may not contain scene text.

Fig. 1: Typical sample of STMR task. STARE retrieves target objects based on free-form language instructions using scene text and visual context.

To tackle the task, we propose a novel approach that integrates scene text and visual features using Crosslingual Visual Prompts (CVP) and a Scene Text Reranker—enhancing retrieval based on free-form natural language instructions.

CORE NOVELTIES:

Multi-Form Instruction Encoder captures complex concepts in user instructions by incorporating multi-level aligned language features, including structured descriptions generated by large language models (LLMs).
Scene Text Aware Visual Encoder integrates scene texts into visual representations using Crosslingual Visual Prompt (CVP), overlapped patchified features, and multi-scale aligned features—producing narrative representations that highlight relevant text.
Scene Text Reranker models the complex relationships between named entities (signifiers) and their corresponding objects (signified concepts), complementing similarity-based retrieval with semantically enriched reranking.
Benchmarks: We construct the GoGetIt and TextCaps-test benchmarks, which contain navigation/manipulation instructions as well as images including scene texts.

Fig. 2: Overview of STARE. Given a natural language instruction and a set of pre-collected images, the model retrieves the most relevant images by integrating scene texts and visual features through three modules: MFIE, STVE, and STRR.

Multi-Form Instruction Encoder (MFIE)

The MFIE models complex expressions within an instruction by incorporating multi-level aligned language features, including descriptions generated using world knowledge from LLMs.

Scene Text Aware Visual Encoder (STVE)

The STVE combines narrative representations that consider scene texts, overlapped patchified features, and multi-scale aligned features.

Scene Text Reranker (STRR)

The STRR efficiently models the highly complex relationships between signifiers and their corresponding signified expressions.

Ranked List

Finally, the model outputs a ranked list of images that are most relevant to the instruction. STARE is designed to handle both images with and without scene text, enabling flexible and robust retrieval in real-world environments.

Results

Qualitative Results

Fig. 3: Qualitative results of the proposed method and a baseline method on the following benchmarks: (a) the RefText subset of the GoGetIt benchmark, (b) the Instruction subset of the GoGetIt benchmark, and (c) the TextCaps-test benchmark. The given instructions for each sample were as follows: (a) "a long box of non-stick baking paper."; (b) "Pass me the red container of Sun-Maid raisins on the kitchen counter."; (c) "Buy the red Orbit candy at the kiosk."

Successful cases of the proposed method from zero-shot transfer experiments

Fig. 4: Qualitative results of the hardware experiments. The given instruction was "Pass me the Maxwell can." The top-2 retrieved images and the scene of the robot fetching the corresponding object are shown.

Quantitative Results

Table 1: Quantitative results. The best results are marked in bold.

BibTeX


    To be appeared.

STARE: Scene Text-Aware Retrieval-Augumented Generation for Everyday Objects