Procedure Learning via Regularized
Gromov-Wasserstein Optimal Transport

Retrocausal, Inc.

Abstract

We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.

Motivation

Learning procedures from instructional videos is a challenging problem due to variations in step appearances, differences in step order, and the presence of background or redundant frames. These issues are particularly prevalent in real-world datasets, where videos often contain non-monotonic step sequences, repeated actions, and untrimmed content. While earlier works relied on full or weak supervision, such approaches are costly and do not scale to large domains. Recent self-supervised methods address these challenges by aligning videos, but they remain limited. For instance, CnC employs temporal cycle-consistency and contrastive learning but is sensitive to order variations and redundant frames. Similarly, VAVA frames alignment as an optimal transport problem, incorporating an optimality prior along with inter-video and intra-video contrastive losses; however, a key limitation in VAVA is that it is challenging to balance multiple losses and handle repeated actions. OPEL, which builds on VAVA, inherits these shortcomings. These limitations motivate the need for a self-supervised framework that can robustly align videos while addressing step order variations, redundant frames, and repeated actions, without relying on complex loss balancing or heavy supervision.

Approach

We propose a self-supervised procedure learning framework built upon a fused Gromov-Wasserstein optimal transport (FGWOT) formulation with a structural prior. Unlike previous OT-based approaches that utilize a traditional Kantorovich OT formulation, our method not only aligns frames based on appearance but also enforces a structural prior (i.e., temporal consistency) across videos. This makes our model better suited for handling step order variations, background/redundant frames, and repeated actions. Moreover, we empirically identify that optimizing only for temporal alignment may lead to degenerate solutions, where all frames collapse to a single cluster in the embedding space. To prevent this issue, we utilize Contrastive Inverse Difference Moment (C-IDM) as a regularization that encourages embedding diversity across frames. Our regularized Gromov-Wasserstein optimal transport (RGWOT) approach uses a unified loss with a purpose-aligned regularization, avoiding the difficulty of balancing multiple losses and conflicting regularizations. Our model outperforms prior works on both egocentric and third-person procedure learning benchmarks.

Overview. Procedure learning methods usually learn video frame representations via temporal video alignment (a). The learned embeddings are then used for extracting key steps and their order (b). In this work, we rely on a regularized Gromov-Wasserstein optimal transport formulation for tackling order variations, background/redundant frames, and repeated actions in (a), yielding state-of-the-art results in (b).

Method. Our method incorporates a fused Gromov-Wasserstein optimal transport formulation with a structural prior for establishing frame-to-frame correspondences between videos with a contrastive regularization for avoiding degenerate solutions. Forward/backward arrows denote computation/gradient flows. Blue and orange/green represent temporal alignment and contrastive regularization.

Experimental Results

We conduct extensive experiments on on both egocentric (i.e., EgoProceL) and third-person (i.e., ProCeL and CrossTask) datasets.

Comparisons on Egocentric Dataset

We provide a comparative evaluation of our RGWOT approach against previous methods on the large-scale egocentric benchmark of EgoProceL. In particular, it is a recent dataset tailored for egocentric procedure learning, serving as a strong benchmark for evaluating new methods on egocentric videos. Remarkably, RGWOT consistently surpasses existing approaches across all sub-datasets of EgoProceL. Notably, RGWOT achieves an average improvement of 15.1% in F1 score and 15.3% in IoU compared to previous methods. These results demonstrate the superiority of our RGWOT formulation with a structural prior over traditional Kantorovich optimal transport formulations with an optimality prior.

Comparisons on Egocentric Dataset. Bold and underline denote the best and second best respectively.

Comparisons on Third-Person Datasets

We evaluate the performance of our RGWOT approach against previous self-supervised procedure learning models on two third-person datasets, ProceL and CrossTask. It is evident that RGWOT achieves the best overall results across both datasets, outperforming all existing models. Notably, RGWOT demonstrates a significant improvement over the previous best performing approach, surpassing it by 9.4% on ProceL and by 5.4% on CrossTask on the F1 scores.

Comparisons on Third-Person Datasets. Bold and underline denote the best and second best respectively.

Comparisons with Multimodal Method

We compare the performance of our RGWOT approach against STEPs, a multimodal approach to unsupervised procedure learning that utilizes depth and gaze data along with RGB, while our RGWOT approach uses only RGB data. Our RGWOT approach outperforms STEPs on most datasets, achieving lower F1 score only on EPIC-Tents while still surpassing it on IoU. It should also be noted that our approach outperforms models that utilize narrations along with video data.

Comparisons with Multimodal Method. Bold and underline denote the best and second best respectively.

Comparisons with Action Segmentation Methods

We compare our RGWOT approach against several state-of-the-art unsupervised action segmentation methods, as well as OPEL, on the ProceL and CrossTask datasets. On ProceL, RGWOT achieves a substantial improvement with the highest precision, recall, and F1 score, significantly surpassing previous methods. On CrossTask, although StepFormer achieves the highest recall, its F1 score remains relatively low at 28.3%, indicating a poor balance between precision and recall. In contrast, RGWOT achieves the highest F1 score of 40.4% with a well-balanced precision (40.4%) and recall (40.7%), which is critical since F1 provides a more reliable measure of overall segmentation quality. These results demonstrate that RGWOT is robust and effective, consistently outperforming other approaches across both datasets.

Comparisons with Action Segmentation Methods. Bold and underline denote the best and second best respectively.

Qualitative results on ProceL. Colored segments represent predicted actions with a particular color denoting the same action across all models.

Citation

@article{mahmood2025procedure,
		  title={Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport},
		  author={Mahmood, Syed Ahmed and Ali, Ali Shah and Ahmed, Umer and Fateh, Fawad Javed and Zia, M Zeeshan and Tran, Quoc-Huy},
		  journal={arXiv preprint arXiv:2507.15540},
		  year={2025}
		}