Joint Self-Supervised Video Alignment and Action Segmentation

Retrocausal, Inc.
Abstract
Motivation
One popular group of self-supervised video alignment methods rely on global alignment techniques widely used in time series literature. For example, LAV utilizes dynamic time warping by assuming monotonic orderings and no background/redundant frames. VAVA relaxes the above assumptions by incorporating an optimality prior into a standard Kantorovich optimal transport framework, along with an inter-video contrastive term and an intra-video contrastive term. However, it is challenging to balance multiple losses as well as handle repeated actions. Similarly, self-supervised action segmentation methods based on optimal transport have been introduced, including TOT and UFSA which may suffer in cases of order variations, unbalanced segmentation, and repeated actions. ASOT addresses these drawbacks via a fused Gromov-Wasserstein optimal transport framework with a structural prior, which outperforms previous works in self-supervised action segmentation. Lastly, though both self-supervised video align and action segmentation require fine-grained temporal understanding of videos, their interaction in a multi-task learning setup has not been explored.
Approach
Motivated by the above observations, we first propose VAOT, a novel self-supervised video alignment approach based on a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which tackles order variations, background/redundant frames, and repeated actions in a single global alignment framework. Our single-task model trains efficiently on GPUs and requires few iterations to derive the optimal transport solution, while outperforming previous methods, including VAVA, on video alignment datasets. Moreover, we develop VASOT, a joint self-supervised video alignment and action segmentation approach by exploring the relationship between self-supervised video alignment and action segmentation via a unified optimal transport framework. Our multi-task model performs on par with prior works on video alignment benchmarks yet establishes the new state of the art on action segmentation benchmarks. In addition, our joint model requires training and storing a single model and saves both time and memory consumption as compared to two separate single-task models. Lastly, we empirically observe that, in a multi-task learning setting, action segmentation provides little boost to video alignment results, whereas video alignment increases action segmentation performance significantly. To the best of our knowledge, our work is the first to analyze the relationship between video alignment and action segmentation.
Overview. (a) Our self-supervised video alignment method (VAOT) based on a fused Gromov-Wasserstein optimal transport with structural priors {Cx,Cy}. (b) Our joint self-supervised video alignment and action segmentation method (VASOT) based on a unified optimal transport with structural priors {Cx,Cy} for video alignment and {Cx,Ca} and {Cy,Ca} for action segmentation.
Self-supervised learning. (a) Our self-supervised video alignment method (VAOT). (b) Our joint self-supervised video alignment and action segmentation method (VASOT). Learnable parameters are shown in red. Arrows denote computation/gradient flows (blue and green represent video alignment and action segmentation respectively).
Experimental Results
We conduct extensive experiments on several video alignment and action segmentation datasets, i.e., Pouring, Penn Action, IKEA ASM, 50 Salads, YouTube Instructions, Breakfast, and Desktop Assembly, to validate the advantages of our single-task method (VAOT) and our multi-task method (VASOT).
Video Alignment Comparison Results
We now benchmark our VAOT and VASOT approaches against previous self-supervised video alignment methods and present quantitative results. Firstly, it is evident from the results that our VAOT approach achieves the best overall performance across all datasets, outperforming all competing methods, i.e., SAL, TCN, TCC, LAV, VAVA, and GTCC. Especially, VAOT shows major improvements over VAVA on the in-the-wild IKEA ASM dataset. The results confirm the advantage of our FGW formulation with a structural prior over the classical Kantorovich formulation with an optimality prior in VAVA. Next, we show some qualitative results, where VAOT retrieves all 5 correct frames with the same action (Hands at shoulder) as the query image, while LAV obtains 3 incorrect frames (Hands above head), highlighted by red ovals. Lastly, we find that action segmentation offers little benefit to video alignment in a multi-task learning setup, and our multi-task VASOT approach performs mostly similarly to our single-task VAOT approach. This is likely because video alignment is a more complex problem involving finer-grained frame-to-frame assignment, as compared to coarser-grained frame-to-action assignment in action segmentation. Nevertheless, VASOT obtains mostly favorable results over prior works. Deep supervision has the potential to improve VASOT by placing the more complex video alignment loss at a deeper layer, which remains our future work. Please see our supplementary material for multi-action evaluation.
Video alignment comparison results. Bold and underline denote the best and second best respectively.
Fine-grained frame retrieval results on Penn Action. The query image is on the left, while on the right are the top 5 matching images retrieved by VAOT (blue box) and LAV (red box).
Action Segmentation Comparison Results
We test our VASOT approach against state-of-the-art self-supervised action segmentation methods and include quantitative results. Firstly, from the results, VASOT consistently achieves the best results across all metrics and datasets, outperforming all competing methods, i.e., CTE, VTE, UDE, ASAL, TOT, UFSA, and ASOT. While our multi-task VASOT approach shows small gains over the single-task ASOT baseline on Breakfast and YouTube Instructions, our improvements on 50 Salads and Desktop Assembly are substantial. The results validate the benefit of fusing video alignment with action segmentation and demonstrate that video alignment boosts action segmentation results notably in a multi-task learning setup. Moreover, we show some qualitative results, where VASOT predicts segmentations which capture action boundaries more accurately and are more closely aligned with ground truth than ASOT. Please refer to our supplementary material for per-video evaluation.
Action segmentation comparison results. Bold and underline denote the best and second best respectively.
Action segmentation results on Breakfast (top) and YouTube Instructions (bottom).
Citation
@article{ali2025joint,
title={Joint Self-Supervised Video Alignment and Action Segmentation},
author={Ali, Ali Shah and Mahmood, Syed Ahmed and Saeed, Mubin and Konin, Andrey and Zia, M Zeeshan and Tran, Quoc-Huy},
journal={arXiv preprint arXiv:2503.16832},
year={2025}
}