Tell what you Hear from what you See

Video to Audio Generation Through Text (VATT)

Anonymous

VATT v.s. baselines on VGGSound test set samples

This table is a subset of samples from human evaluation.

Our generated samples could achieve the best audio-visual relevance in subjective evaluation while maintain competitive audio quality.

Index	SpecVQGAN	Img2Wav	Diff-Foley	FoleyGen	V2A-Mapper	VATT-LLAMA-T (Ours)	Input Text Prompt
1							A man is speaking and playing tennis.
2							Footsteps are heard on snowy ground.
3							A dog is howling repeatedly.
4							Cat meowing and caterwauling loudly.
5							A woman is speaking, followed by a loud explosion and more speeches.
6							Music is playing with scratching sounds.
7							A vehicle is heard with crushing and splintering sounds.
8							A man is speaking indoor is heard.
9							Music plays with thumps, human voices.
10							People are speaking, tapping, and shouting.
11							People are playing table tennis, shouting with background noise.
12							Music is heard along with badminton hitting sounds.
13							Music and man singing can be heard.
14							Music, speech synthesizer, and sound effects are heard with human voices.
15							Music, a horse neighing, and men speaking are heard with background noise.
16							Rain is falling and thunder is heard.
17							Wind and breathing sounds from harmonica are heard.
18							The group of crows are making noises.
19							Dogs barking and growling with rustling in the background noise.
20							Kid is speaking and laughing while playing the firecracker.

VATT without text prompts could generate audio samples in high quality as well.

These samples are subset from the VATT-Gemma-2B model, generated without text prompts.

Generated Samples where text prompt could improve performance, VATT v.s VATT-T (with prompt).

In some challenging videos where visual nuance is not captured by VATT, providing additional text prompt containing key information about the sound sources could help the model generate better results.

VATT
Input Text Prompt	A vehicle is heard with crushing and splintering sounds.	Music, a horse neighing, and men speaking are heard with background noise.	Music and hands clapping sounds are heard.	Tapping sounds are heard along with noises from the crowd.	Two cats are meowing and caterwauling at each other loudly.
VATT-T

VATT could be controlled by text prompts such that different text prompts yield different sounds.

When providing VATT with text guidance that possibly fits the context of the video, the model could generate diverse results that align with text as well as video, enabling reasonable controllability.

Index	Prompt 1	Prompt 2	Prompt 3
1	Live music along with people shouting and water splashing sounds.	Music and water splashing noise are heard.	Water splashing noise
2	A man is speaking, followed by volcano explosion and shouting in the background.	A woman is speaking, followed by a loud explosion.	music heard while the volcano eruption happens.
3	A child is crying in the background mixed with cat meowing.	The cat is meowing.	The crow is cawing at the cat.
4	The donkey is making high-pitch noises while the violin played.	The donkey is singing along with the violin.	The violin sound is heard.
5	a cat meowing, music and video game sounds heard.	a chorus singing heard while the cat meowing.	The cartoon cat meowing while music is heard.