Abstract
In recent years, considerable progress has been made in visual representation learning for robotic manipulation. However, in complex Out-of-Distribution(OOD) settings characterized by clutter and occlusion, visual representations relying solely on global features may be interfered, leading to degraded policy performance. This limitation motivates the incorporation of local features for task-relevant objects. By leveraging their inherent invariance, visual attention can be steered toward task-critical regions across OOD settings. Accordingly, we propose GLUE, a global-local unified encoding framework for steering visual attention in robotic manipulation. GLUE selects and tracks key-patches as task-critical regions by employing a text-guided mechanism. It features a novel fusion framework that integrates global and local features to construct a unified visual representation. This fused representation maps in-domain and OOD visual observations into a similar task-informative feature space, ultimately enhancing the robustness of the imitation learning policy. Experiments demonstrate that GLUE achieves strong performance across diverse tasks in both simulation and real-world settings, outperforming the strongest baseline by 17.6% in simulation, 36.3% in real-world environments, and 58.3% on real-world generalization settings.