Tell What You Hear From What You See

Xiulong Liu, Kun Su, Eli Shlizerman NEURIPS 2024

VATT: a general multi-modal audio generation framwork that can generate a wide of variety sounds by taking visual (and / or) text as inputs.