Video-to-Audio Generation with Semantic and Temporal Alignment

Abstract

Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that generates audio by conditioning on both text and video features, ensuring semantic and temporal alignment with the videos. First, we propose an onset predict pretext task for obtaining local temporal features of video, and an attentive pooling module to yield global semantic features of video. Then we proposed a Latent Diffusion Model initialized by Text-To-Audio prior knowledge and guided by cross-modal features from text and video. Finally, We propose a new objective metric, AA-Align, to evaluate the temporal alignment between the generated and target audio. Subjective and objective experiments demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic alignment, and temporal alignment. The ablation experiment validated the effectiveness of each module.

Framework of STAV2A

Code

Our Code will be available at:

STAV2A

Demos


Model GT STAV2A (Ours) Im2Wav Diff-Foley VTA-LDM FoleyCrafter
1
2
3
4
5
6
7
8
9
10