Tell What You Hear From What You See - Video to Audio Generation Through Text (NeurIPS 2024)

1University of Washington
Teaser Image
VATT is a flexible audio generative model capable of generating audio in two modes: i. When a silent video is the sole input, the model generates the audio along with a caption describing the possible audio that could match the video. ii. When in addition to the video, a text prompt is provided, the model generates audio aligned with both the video and the given text prompt.

Abstract

The content of visual and audio scenes is multi-faceted such that a video stream can be paired with various audio streams and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description (caption) of the audio. Such a framework has two unique advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of the visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, which is an LLM that has been fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space, and VATT Audio, a bi-directional transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens and the text prompt are used by a pretrained neural codec to convert them into a waveform. Our experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, such as VGGSound audiovisual dataset, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (with lowest KLD score of 1.41). Furthermore, subjective studies asking participants to choose the most compatible generated audio for a given silent video, show that VATT Audio has been chosen on average as a preferred generated audio than the audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.

Video Presentation

BibTeX

@inproceedings{NEURIPS2024_b782a346,
        author = {Liu, Xiulong and Su, Kun and Shlizerman, Eli},
        booktitle = {Advances in Neural Information Processing Systems},
        editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
        pages = {101337--101366},
        publisher = {Curran Associates, Inc.},
        title = {Tell What You Hear From What You See - Video to Audio Generation Through Text},
        url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/b782a3462ee9d566291cff148333ea9b-Paper-Conference.pdf},
        volume = {37},
        year = {2024}
       }