Global-Local Unified Encoding for Steering Visual Attention in Robotic Manipulation  

Anonymous IROS Submission

Abstract

In recent years, considerable progress has been made in visual representation learning for robotic manipulation. However, in complex Out-of-Distribution(OOD) settings characterized by clutter and occlusion, visual representations relying solely on global features may be interfered, leading to degraded policy performance. This limitation motivates the incorporation of local features for task-relevant objects. By leveraging their inherent invariance, visual attention can be steered toward task-critical regions across OOD settings. Accordingly, we propose GLUE, a global-local unified encoding framework for steering visual attention in robotic manipulation. GLUE selects and tracks key-patches as task-critical regions by employing a text-guided mechanism. It features a novel fusion framework that integrates global and local features to construct a unified visual representation. This fused representation maps in-domain and OOD visual observations into a similar task-informative feature space, ultimately enhancing the robustness of the imitation learning policy. Experiments demonstrate that GLUE achieves strong performance across diverse tasks in both simulation and real-world settings, outperforming the strongest baseline by 17.6% in simulation, 36.3% in real-world environments, and 58.3% on real-world generalization settings.


Evaluation Results in In-Domain Environments

Push Button(DP)

Push Button(ACT)

Push Button(GLUE-S)

Push Button(GLUE)

Stack Block(DP)

Stack Block(ACT)

Stack Block(GLUE-S)

Stack Block(GLUE)

Place Fruit(DP)

Place Fruit(ACT)

Place Fruit(GLUE-S)

Place Fruit(GLUE)

Fold Towel(DP)

Fold Towel(ACT)

Fold Towel(GLUE-S)

Fold Towel(GLUE)


Evaluation Results in Cluttered OOD Environments

Push Button(DP)

Push Button(ACT)

Push Button(GLUE-S)

Push Button(GLUE)

Stack Block(DP)

Stack Block(ACT)

Stack Block(GLUE-S)

Stack Block(GLUE)


Evaluation Results in Occluded OOD Environments

Push Button(DP)

Push Button(ACT)

Push Button(GLUE-S)

Push Button(GLUE)

Place Fruit(DP)

Place Fruit(ACT)

Place Fruit(GLUE-S)

Place Fruit(GLUE)


Evaluation Results in Illumination-Disturbed OOD Environments

Place Fruit(DP)

Place Fruit(ACT)

Place Fruit(GLUE-S)

Place Fruit(GLUE)

Fold Towel(DP)

Fold Towel(ACT)

Fold Towel(GLUE-S)

Fold Towel(GLUE)