Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh^*, Umer Ahmed^*, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran

^* indicates joint first author.

Abstract

This paper introduces TemporalVLM, a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos. At the core, our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues. In particular, it first divides the input video into short-term clips, which are jointly encoded with their timestamps into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory module for global feature aggregation. The extracted time-aware and multi-level features are important for accurate temporal reasoning and fine-grained understanding in long videos. Extensive experiments on datasets of long videos, including TimeIT and IndustryASM, show that TemporalVLM achieves superior performance than previous methods across temporal reasoning and fine-grained understanding tasks, namely dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.

Limitations of Existing Video LLMs

Previous video large language models (a-c) usually are not time-sensitive (a, b), consider an input video as a single clip (a, c), and apply pooling operation (a, b) or query aggregation (c) for aggregating global semantic information. In contrast, our model (d) includes a time-aware clip encoder, which extracts time-aware fine-grained cues from short-term clips sampled from a long-term input video, and a bidirectional long short-term memory module, which captures long-range temporal dependencies across multiple clips. Our extracted time-aware and multi-level features are crucial for temporal reasoning in long videos.

TemporalVLM

We propose TemporalVLM, a video large language model for effective temporal reasoning and fine-grained understanding in long videos. Our TemporalVLM model consists of two main architectural contributions: (i) a time-aware clip encoder, and (ii) a bidirectional long short-term memory module. In particular, given a long-term input video with timestamps, we first divide it into short-term clips with timestamps. The time-aware clip encoder is then employed on each clip along with its timestamps for extracting time-aware local features, which capture local fine-grained cues within each clip. Next, the bidirectional long short-term memory module is adopted for computing global features, which capture global temporal relationships across multiple clips. Lastly, the large language model takes both video and text tokens as inputs and generates responses to users.

Experimental Results

We conduct extensive experiments on datasets of long videos, i.e., TimeIT and IndustryASM, to show our superior results over previous methods across temporal reasoning and fine-grained understanding tasks such as dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.

Quantitative Comparisons

We finetune all models and evaluate them on YouCook2 for dense video captioning (top left), QVHighlights for video highlight detection (top middle), Charades-STA for temporal video grounding (top right), and IndustryASM for temporal action segmentation (bottom). The results are presented in the tables below. Best numbers are in bold, whereas second best ones are underlined. It is evident from the results that TemporalVLM significantly outperforms previous video large language models such as TimeChat and LongVLM. This is likely because TimeChat treats the entire video as a single clip and applies query aggregation for aggregating global information, whereas LongVLM lacks a time-sensitive encoder and performs pooling operation for global feature aggregation.

Qualitative Comparisons

Example results of dense video captioning in the zero-shot setting on a YouCook2 video are shown below. Inaccuracies are shown in red. TimeChat fails to predict any correct timestamps and tends to hallucinate. LongVLM predicts some timestamps correctly and generates more accurate captions, although it still hallucinates objects not present in the video and produces inaccurate/repeated captions. In contrast, TemporalVLM produces considerably more accurate timestamps and captions. Also, it is the only model that provides captions till the end of the video, showing the effectiveness of the model for long video understanding.

Example results of temporal action segmentation in the supervised setting on an IndustryASM video are illustrated below. Black represents background frames. TimeChat falls short in segmenting the video correctly and hallucinates actions not present in the video. LongVLM fares better in terms of the actions detected in the video but struggles with predicting the action segments and hallucinations further into the video. In contrast, TemporalVLM predicts action segments that are significantly closer to the ground truth.

Citation


@article{fateh2024video,
    title={Video LLMs for Temporal Reasoning in Long Videos},
    author={Fateh, Fawad Javed and Ahmed, Umer and Khan, Hamza and Zia, M Zeeshan and Tran, Quoc-Huy},
    journal={arXiv preprint arXiv:2412.02930},
    year={2024}
}