My first authored paper “Tell What You Hear From What You See - Video to Audio Generation Through Text” has been accepted by NeurIPS 2024!