Joint Self-Supervised Video Alignment and Action Segmentation

Retrocausal, Inc.

Abstract

Our peer-reviewed research has been accepted for publication at ICCV 2025. We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.

Motivation

One popular group of self-supervised video alignment methods rely on global alignment techniques widely used in time series literature. For example, LAV utilizes dynamic time warping by assuming monotonic orderings and no background/redundant frames. VAVA relaxes the above assumptions by incorporating an optimality prior into a standard Kantorovich optimal transport framework, along with an inter-video contrastive term and an intra-video contrastive term. However, it is challenging to balance multiple losses as well as handle repeated actions. Similarly, self-supervised action segmentation methods based on optimal transport have been introduced, including TOT and UFSA which may suffer in cases of order variations, unbalanced segmentation, and repeated actions. ASOT addresses these drawbacks via a fused Gromov-Wasserstein optimal transport framework with a structural prior, which outperforms previous works in self-supervised action segmentation. Lastly, though both self-supervised video align and action segmentation require fine-grained temporal understanding of videos, their interaction in a multi-task learning setup has not been explored.

Approach

Motivated by the above observations, we first propose VAOT, a novel self-supervised video alignment approach based on a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which tackles order variations, background/redundant frames, and repeated actions in a single global alignment framework. Our single-task model trains efficiently on GPUs and requires few iterations to derive the optimal transport solution, while outperforming previous methods, including VAVA, on video alignment datasets. Moreover, we develop VASOT, a joint self-supervised video alignment and action segmentation approach by exploring the relationship between self-supervised video alignment and action segmentation via a unified optimal transport framework. Our multi-task model performs on par with prior works on video alignment benchmarks yet establishes the new state of the art on action segmentation benchmarks. In addition, our joint model requires training and storing a single model and saves both time and memory consumption as compared to two separate single-task models. Lastly, we empirically observe that, in a multi-task learning setting, action segmentation provides little boost to video alignment results, whereas video alignment increases action segmentation performance significantly. To the best of our knowledge, our work is the first to analyze the relationship between video alignment and action segmentation.

Overview. (a) Our self-supervised video alignment method (VAOT) based on a fused Gromov-Wasserstein optimal transport with structural priors {Cx,Cy}. (b) Our joint self-supervised video alignment and action segmentation method (VASOT) based on a unified optimal transport with structural priors {Cx,Cy} for video alignment and {Cx,Ca} and {Cy,Ca} for action segmentation.

Self-supervised learning. (a) Our self-supervised video alignment method (VAOT). (b) Our joint self-supervised video alignment and action segmentation method (VASOT). Learnable parameters are shown in red. Arrows denote computation/gradient flows (blue and green represent video alignment and action segmentation respectively).

Experimental Results

We conduct extensive experiments on several video alignment and action segmentation datasets, i.e., Pouring, Penn Action, IKEA ASM, 50 Salads, YouTube Instructions, Breakfast, and Desktop Assembly, to validate the advantages of our single-task method (VAOT) and our multi-task method (VASOT).

Video Alignment Comparison Results

We now benchmark our VAOT and VASOT approaches against previous self-supervised video alignment methods and present quantitative results. Firstly, it is evident from the results that our VAOT approach achieves the best overall performance across all datasets, outperforming all competing methods, i.e., SAL, TCN, TCC, LAV, VAVA, and GTCC. Especially, VAOT shows major improvements over VAVA on the in-the-wild IKEA ASM dataset. The results confirm the advantage of our FGW formulation with a structural prior over the classical Kantorovich formulation with an optimality prior in VAVA. Next, we show some qualitative results, where VAOT retrieves all 5 correct frames with the same action (Hands at shoulder) as the query image, while LAV obtains 3 incorrect frames (Hands above head), highlighted by red ovals. Lastly, we find that action segmentation offers little benefit to video alignment in a multi-task learning setup, and our multi-task VASOT approach performs mostly similarly to our single-task VAOT approach. This is likely because video alignment is a more complex problem involving finer-grained frame-to-frame assignment, as compared to coarser-grained frame-to-action assignment in action segmentation. Nevertheless, VASOT obtains mostly favorable results over prior works. Deep supervision has the potential to improve VASOT by placing the more complex video alignment loss at a deeper layer, which remains our future work. Please see our supplementary material for multi-action evaluation.

Video alignment comparison results. Bold and underline denote the best and second best respectively.

Fine-grained frame retrieval results on Penn Action. The query image is on the left, while on the right are the top 5 matching images retrieved by VAOT (blue box) and LAV (red box).

Action Segmentation Comparison Results

We test our VASOT approach against state-of-the-art self-supervised action segmentation methods and include quantitative results. Firstly, from the results, VASOT consistently achieves the best results across all metrics and datasets, outperforming all competing methods, i.e., CTE, VTE, UDE, ASAL, TOT, UFSA, and ASOT. While our multi-task VASOT approach shows small gains over the single-task ASOT baseline on Breakfast and YouTube Instructions, our improvements on 50 Salads and Desktop Assembly are substantial. The results validate the benefit of fusing video alignment with action segmentation and demonstrate that video alignment boosts action segmentation results notably in a multi-task learning setup. Moreover, we show some qualitative results, where VASOT predicts segmentations which capture action boundaries more accurately and are more closely aligned with ground truth than ASOT. Please refer to our supplementary material for per-video evaluation.

Action segmentation comparison results. Bold and underline denote the best and second best respectively.

Action segmentation results on Breakfast (top) and YouTube Instructions (bottom).

Citation

@article{ali2025joint,
	  title={Joint Self-Supervised Video Alignment and Action Segmentation},
	  author={Ali, Ali Shah and Mahmood, Syed Ahmed and Saeed, Mubin and Konin, Andrey and Zia, M Zeeshan and Tran, Quoc-Huy},
	  journal={arXiv preprint arXiv:2503.16832},
	  year={2025}
	}