Research Contributions
Our researchers are actively involved in the broader industrial engineering and computer vision research communities. We share some of this technical work here.

Video LLMs for Temporal Reasoning in Long Videos
We propose TemporalVLM, a video large language model for effective temporal reasoning and fine-grained understanding in long videos. TemporalVLM includes two novel components: (i) a time-aware clip encoder and (ii) a bidirectional long short-term memory module.

Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
We propose a self-supervised procedure learning framework built upon a fused Gromov-Wasserstein optimal transport formulation with a structural prior to establish frame-to-frame correspondences between videos and a contrastive regularization to avoid the collapse to degenerate solutions, yielding superior results on large-scale egocentric and third-person benchmarks. Our peer-reviewed research has been accepted for publication at WACV 2026.

Joint Self-Supervised Video Alignment and Action Segmentation
We propose a novel approach for joint unsupervised video alignment and action segmentation based on a unified optimal transport framework, yielding comparable video alignment yet superior action segmentation results while saving both time and memory consumption. Our peer-reviewed research has been accepted for publication at ICCV 2025.

Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion
We use 2D Skeleton Heatmaps and Multi-Modality Fusion to obtain state-of-the-art results on action segmentation. Our peer-reviewed research has been accepted for publication at ICRA 2024.

Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion
We use 2D Skeleton Heatmaps and Multi-Modality Fusion to obtain state-of-the-art results on several fine-grained human activity understanding tasks. Our peer-reviewed research has been accepted to ECCV 2024.

Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment Alignment
We present a novel transformer-based framework for unsupervised activity segmentation which leverages not only frame-level cues but also segment-level cues. Our peer-reviewed research has been accepted for publication at WACV 2024.

Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering
We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. Our peer-reviewed research has been accepted for publication at CVPR 2022.

Timestamp-Supervised Action Segmentation with Graph Convolutional Networks
We introduce our approach for temporal activity segmentation with timestamp supervision, through a graph convolutional network, which is learned in an end-to-end manner to exploit both frame features and connections between neighboring frames to generate dense framewise labels from sparse timestamp labels. Our peer-reviewed research has been accepted for publication at IROS 2022.

Learning by Aligning Videos in Time
We present a Self-Supervised Approach for Training Viewpoint-, Actor-, and Scene-Invariant Video Representations. Our peer-reviewed research was accepted for publication at CVPR 2021.

Towards Anomaly Detection in Dashcam Videos
We present a novel framework for unsupervised anomaly detection in video streams, evaluated on dashcam video datasets. Our peer-reviewed research was published at IV 2020.

ICCV Workshop on Computer Vision for the Factory Floor
With this workshop, we bring together computer vision researchers and leaders from academia and industry for exchange of ideas that lie at the intersection of computer vision and the smart factory.

Domain-Specific Priors and Meta Learning for Few-Shot First-Person Action Recognition
We present a novel approach for activity recognition in few-shot settings. Our peer-reviewed research was published at the IEEE Transactions on Pattern Analysis and Machine Intelligence 2021.