A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

Fawad Javed Fateh^*, Ali Shah Ali^*, Murad Popattia, Usman Nizamani, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

^* indicates joint first author.

Abstract

We present a novel hierarchical spatiotemporal action tokenizer for incontext imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

Motivation

Learning effective action representations for in-context imitation learning (ICIL) remains a challenging problem due to the complex spatiotemporal nature of robotic actions. In real-world manipulation tasks, actions exhibit fine-grained variations, temporal dependencies, and continuity constraints that are difficult to capture with simple representations. While recent works leverage action tokenization to discretize continuous robot actions, existing approaches remain limited. For instance, MLP-based tokenizers used in prior ICIL frameworks fail to preserve smoothness and continuity in action trajectories. LipVQ-VAE improves upon this by introducing vector quantization with Lipschitz regularization to enforce smoothness; however, it relies on flat clustering and primarily exploits spatial information through action reconstruction. As a result, it lacks the ability to model hierarchical structures in actions and does not explicitly capture temporal dependencies beyond smoothness constraints. Moreover, temporal modeling in action tokenization is still underexplored. While techniques such as positional encoding or standard vector quantization can encode temporal order, they often fail to preserve temporal coherence and smooth transitions across actions. This limitation hinders the ability of ICIL models to generalize across tasks, especially when demonstrations involve long-horizon dependencies and compositional action structures. These challenges motivate the need for a more expressive action tokenizer that can capture both hierarchical structure and spatiotemporal dependencies, while maintaining smooth and consistent action representations.

Approach

We propose a hierarchical spatiotemporal action tokenizer (HiST-AT) for in-context imitation learning that addresses the limitations of prior tokenization methods by jointly modeling hierarchical structure and spatiotemporal dependencies. Our approach is built on two key components: hierarchical vector quantization and spatiotemporal reconstruction. First, we introduce a hierarchical clustering mechanism based on multi-level vector quantization. Unlike prior methods that perform flat clustering, our approach organizes action representations into two levels: a lower level that captures fine-grained sub-action primitives and a higher level that groups these primitives into more abstract action clusters. This hierarchical design enables the model to represent both short-term and long-term action structures, improving the expressiveness and generalization of the learned representations. Second, we incorporate spatiotemporal reconstruction to explicitly leverage both spatial and temporal cues. In addition to reconstructing input actions, our model also predicts the associated timestamps, encouraging the learned representations to capture temporal dynamics alongside spatial structure. This joint reconstruction allows the model to maintain temporal coherence and better encode dependencies across action sequences. Furthermore, we enforce smoothness in the latent space using Lipschitz regularization, which promotes continuity in the learned representations and reduces noise in tokenized actions. By integrating hierarchical clustering, spatiotemporal reconstruction, and smoothness constraints into a unified framework, HiST-AT learns structured and transferable action representations. Our approach avoids the limitations of flat tokenization schemes and implicit temporal modeling, leading to improved performance and generalization in in-context imitation learning tasks across both simulation and real-world settings.

Overview. (a) In-context imitation learning (ICIL) allows robots to generalize from demonstrations to new tasks without retraining. Action tokenizer (AT) is important to capturing demonstration information effectively. (b) Previous AT methods rely on vector quantization, conducting flat clustering and focusing on spatial cues via recovering input actions. (c) We propose a hierarchical spatiotemporal action tokenizer, which performs multi-level clustering and exploits both spatial and temporal cues by jointly reconstructing actions and timestamps, yielding superior performance.

Method. Our framework learns hierarchical spatiotemporal action representations through multi-level vector quantization and joint reconstruction. Input actions are encoded and regularized for smoothness, then quantized in two stages: fine-grained sub-actions (Codebook Z) and higher-level action clusters (Codebook A), each with corresponding losses. A spatiotemporal reconstruction module recovers both actions and timestamps, enabling the model to capture spatial structure and temporal dynamics. Arrows denote computation and gradient flow, while different blocks represent encoding, quantization, regularization, and reconstruction.

Experimental Results

We conduct extensive experiments on RoboCasa and ManiSkill datasets. We evaluate the performance through success rate, as defined by each environment of datasets.

Comparisons on RoboCasa Dataset

We evaluate on the MimicGen dataset in RoboCasa. The results show that our method significantly enhances performance, achieving average success rate of 59% compared to 53% of the previous best LipVQ-VAE. Moreover, incorporating hierarchical clustering and spatiotemporal reconstruction increases the overall effectiveness of our method, demonstrated by a 14.8% performance gap between our method HiST-AT and the lowest performing MLP. Overall, HiST-AT outperforms prior action tokenizers, including FAST which further has language inputs and is fine-tuned on one million action samples.

Comparisons on RoboCasa Dataset. Bold denote the best results.

Comparisons on ManiSkill Dataset

To examine generalization beyond ICIL, we modify the cVAE-based encoder in ACT with different action tokenizers, including our HiST-AT. Also, following LipVQ-VAE, we add a depth channel and train MCR jointly with the policy head. The results show that HiST-AT achieves the best overall performance, surpassing the prior best LipVQ-VAE by 5.3%. While previous approaches such as ACT and LipVQ-VAE attempt to address action smoothness, their limitations in modeling hierarchical structure and temporal consistency restrict their performance, whereas HiST-AT effectively captures both, leading to notable improvements.

Comparisons on ManiSkill Dataset. Bold denote the best result.

Result on Cross-dataset.

We evaluate transfer from the MimicGen dataset to the Human dataset in RoboCasa, containing sparser and less structured object arrangements. The results in table show that ICRT based methods demonstrate stronger robustness as compared to BC-Transformer. Even with an MLP action tokenizer, the ICRT framework surpasses BC-Transformer. More importantly, our HiST-AT further improves cross-dataset performance, outperforming the second best LipVQ-VAE by 10% on average, highlighting the benefit of hierarchical clustering and spatiotemporal reconstruction in capturing transferable action representations.

Comparisons on Cross-dataset. Bold denote the best results.

Comparisons on Zero-Shot Results

To further evaluate generalization to unseen data, we perform zero-shot experiments by training on a subset of tasks and testing on another, following the split in RoboCasa. The results in table show that ICRT-based methods outperform other approaches like BC-Transformer. Moreover, our HiST-AT performs the best, surpassing the second best LipVQ-VAE by 6.2% on average, demonstrating stronger generalization to unseen action sequences.

Comparisons on Zero-Shot Results. Bold denote the best results.

Ablation results on Model Components

We analyze the contribution of each component in our method on Robocasa in table. Starting from the baseline LipVQ-VAE, adding hierarchical clustering improves success rates significantly, highlighting the benefit of modeling structured action hierarchies, while integrating spatiotemporal reconstruction instead yields smaller gains. Incorporating both components in our HiST-AT performs the best, achieving 6% average performance increase compared to the baseline. These results demonstrate that hierarchical clustering and spatiotemporal reconstruction provide complementary gains, yielding superior performance over the baseline.

Model Components ablation. Bold denote the best results.

Ablation results on Impact of Codebook Sizes

We investigate the effect of the sizes of the codebooks Z and A, i.e., (αK,K) respectively, on ManiSkill. As shown in the Impact of Codebook Sizes graph, increasing from (32,16) to (64,16) improves overall performance, indicating that a larger number of subaction clusters helps capture fine-grained action dynamics. However, further increasing to (64,32) does not provide additional gains, suggesting redundancy in representation. Overall, (64,16) offers the best tradeoff between capturing high-level structures and detailed action variations, and is used in all of our experiments.

Ablation results on Impact of λ_temp

We analyze the effect of the temporal reconstruction weight λ_temp on ManiSkill by varying its value in the range [0.002, 2], as shown in the graph. The results indicate that moderate temporal supervision is most effective, i.e., λ_temp = 0.02 achieves the strongest overall performance. Larger weights lead to a decline in performance, suggesting that excessive emphasis on timestamp prediction can hinder the learning of action representations. Overall, the results show that a balanced temporal reconstruction weight is crucial for capturing action dynamics without overwhelming the primary learning objective, and we set λ_temp = 0.02 for all of our experiments

Citation


@article{fateh2026hierarchical,
  title={A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics},
  author={Fateh, Fawad Javed and Ali, Ali Shah and Popattia, Murad and Nizamani, Usman and Konin, Andrey and Zia, M Zeeshan and Tran, Quoc-Huy},
  journal={arXiv preprint arXiv:2604.15215},
  year={2026}
}