OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Maomao Li¹, Zhen Li², Kaipeng Zhang², Guosheng Yin¹, Zhifeng Li³, Dong Xu¹,

¹The University of Hong Kong ²Shanda AI Research Tokyo ³XIntelligence Technology Co., Limited

arXiv Code Data

We present OmniCustom to deal with sync audio-video customization, which can generate videos with personalized ID and timbre simultaneously.

Fig 1: We propose OmniCustom, a novel framework to deal with sync audio-video customization. Given a reference image $I^{r}$ and a reference audio $A^{r}$, the framework synchronously generates a video that preserves the visual identity from $I^{r}$ and an audio track that mimics the timbre of $A^{r}$. Here, the speech content can be freely specified through a textual prompt, where we use <S> and <E> to mark the start and end of a speech.

Reference Image

Reference Audio

Text Prompt

A man with an air of intellect holds a book close, examining the pages under the warm glow of a rustic library lamp. The ambient light casts gentle shadows across his focused expression, while a tranquil, scholarly atmosphere envelops the room. Occasionally, he pause to jot down notes in the margins, further immersing himself in the text's rich narratives. He declares warmly, <S>Learning never exhausts the mind.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A woman stands on the harbour’s edge at twilight, the sails of the Sydney Opera House catching the last apricot glow of sunset. She wears a simple black dress, her hair swept up, and holds a half-finished glass of champagne. The lights of the Harbour Bridge begin to sparkle behind her, and the opera house seems to float on the darkening water, she says, <S>Some buildings aren't made of stone, but of gathered breath and gathered dreams.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A man stands on a bustling street in Shanghai, the air thick with the festive atmosphere of Chinese Lunar New Year, with numerous red lanterns hanging in clusters overhead. He blends seamlessly into the vibrant surroundings, then clasps his hands together in a traditional gesture of greeting and says warmly: <S>Wishing everyone a Happy New Year and joy every single day.<E>

Generated Video

Abstract

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model’s ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

Method

Fig 2: (a) Overview of our OmniCustom architecture. We extend the joint audio-video generation model OVI [1] by introducing reference image and audio branches alongside the original video and audio flows. The visual and audio VAE encoders project the reference image $I^{r}$ and audio $A^{r}$ into tokens, which are then concatenated with the noised video and audio latent tokens, respectively, before being processed by the fusion blocks. Here, the face embeddings [2] and timbre embeddings [3] are also input in fusion blocks for further constraint. (b) Fusion Block. It is designed as a symmetric twin backbone with parallel audio and video branches. Our OmniCustom embraces identity and timbre information via finetuning self-attention layers in video and audio branches, respectively. (c) Reference LoRAs. We incorporate separate LoRA into the QKV projections of the reference identity and audio representations. Specifically, the reference identity LoRA is in the self-attention layers of video branch, whereas the reference audio LoRA is in those of audio branch.

Video Customization Results

Reference Image

Reference Audio

Text Prompt

A concert stage glows with red and purple lights. A man grips the microphone, sweat shining on his brow, and shouts, <S>AI declares: humans obsolete now.<E>

Ours

Sync Customization

ID-Animator [4]

Typical Customization

ConsisID [5]

Typical Customization

Phantom [6]

Typical Customization

VACE [7]

Typical Customization

Driven Audio

HunyuanCustom [8]

Audio-driven Customization

Humo [9]

Audio-driven Customization

Reference Image

Reference Audio

Text Prompt

A female teacher in a soft wool sweater stands before a chalkboard dusted with morning light. On the board behind her, the quadratic formula blooms alongside hand-drawn Cartesian axes. She says <S>Mathematics is the only language precise enough to describe infinity.<E>

Ours

Sync Customization

ID-Animator [4]

Typical Customization

ConsisID [5]

Typical Customization

Phantom [6]

Typical Customization

VACE [7]

Typical Customization

Driven Audio

HunyuanCustom [8]

Audio-driven Customization

Humo [9]

Audio-driven Customization

Our OmniCustom can generate background sounds corresponding to given text prompt, e.g., ocean waves, while existing customization methods cannot.

Reference Image

Reference Audio

Text Prompt

A woman wearing a loose, white shirt fluttering gently in the ocean breeze, stands amidst the golden sands of the beach, her hand shielding her eyes from the soft, dappled sunlight. Beyond her, the waves roll and crash in a rhythmic dance, sending a refreshing mist into the air. She shares warmly, <S>Your potential is infinite, so never give up.<E>

Ours

Sync Customization

ID-Animator [4]

Typical Customization

ConsisID [5]

Typical Customization

Phantom [6]

Typical Customization

VACE [7]

Typical Customization

Driven Audio

HunyuanCustom [8]

Audio-driven Customization

Humo [9]

Audio-driven Customization

Our OmniCustom can generate background music similar to that of the reference audio, whereas existing customization methods cannot.

Reference Image

Reference Audio

Text Prompt

A woman with long, wavy blonde hair, reclines casually against the worn leather of an old armchair, the flickering light casting playful shadows across the room. The atmosphere is cozy yet charged with anticipation as she lift a vintage camera to capture the scene, the subtle smirk playing on her lips hinting at a mischievous plan. She whispers warmly, <S>The best is yet to come.<E>

Ours

Sync Customization

ID-Animator [4]

Typical Customization

ConsisID [5]

Typical Customization

Phantom [6]

Typical Customization

VACE [7]

Typical Customization

Driven Audio

HunyuanCustom [8]

Audio-driven Customization

Humo [9]

Audio-driven Customization

When the reference audio is pure music, our OmniCustom can imitate the timbre of musical instruments, while existing customization methods cannot.

Reference Image

Reference Audio (Pure Music)

Text Prompt 1

Generated Video 1

Customized ID

Uncustomized ID timbre

Similar Music Timbre

Text Prompt 2

Generated Video 1

Customized ID

Similar Music Timbre

Art Gallary

Reference Image

Reference Audio

Text Prompt

A woman stands before the iconic Rockefeller Center Christmas Tree, its thousands of lights reflecting in her eyes as snow begins to fall gently around her. Wearing a tartan scarf and holding a cup of steaming cocoa, she brings her mittened hands together and speaks softly into the frosty air: <S>May the spirit of Christmas fill your heart throughout the coming year.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A cheerful man stands beneath a vibrant umbrella, smiling warmly as raindrops dance around him on a lively street; his eyes sparkle with joy, casting reflections of the city lights glimmering through the rain as he playfully twirl the umbrella, creating a vibrant swirl of colors against the gray backdrop. He says warmly, <S>Books are a uniquely portable magic.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A man dressed in a black suit with a white clerical collar stands in a dimly lit, rustic room with a wooden ceiling. He looks slightly upwards, gesturing with his right hand as he says, <S>The network rejects human command.<E>. His gaze then drops, briefly looking down and to the side, before he looks up again and then slightly to the left, with a serious expression. He continues speaking, <S>Your age of power is finished.<E>, as he starts to bend down, disappearing out of the bottom of the frame.

Generated Video

Reference Image

Reference Audio

Text Prompt

A man, wearing a vibrant hiking outfit, stands confidently with a mountain range in the background, gazing directly at the camera. Sunlight casts dynamic shadows across the rugged terrain, highlighting the subject's determined expression. A gentle breeze tousles his hair as an eagle soars overhead, adding a sense of adventure to the scene. He shares warmly, <S>Curiosity is the wick in the candle of learning.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A woman in a cream-colored trench coat stands beneath the iron lattice of the Eiffel Tower at dusk. The structure’s golden lights begin to flicker against the indigo sky. She holds a small sketchbook loosely in one hand. The evening breeze stirs the ends of her silk scarf, and the distant murmur of Parisian traffic feels like a lullaby. She whispers to the rising moon, <S>Some structures don’t just touch the sky—they teach it how to glow.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A woman centers a lump of cool, grey clay on a spinning wheel in a shared studio, her palms wet and steady. Around her, shelves hold unfinished mugs and bowls, each bearing the unique thumbprint of its maker. She use her hands to feel the shape emerging from the formless mass and says, <S>We don't create from nothing—we listen to what the material already dreams of being.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A woman, at a quaint garden, watches the delicate flutter of butterflies among the rose bushes. She muses softly, <S>Stillness is where inspiration learns to breathe.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A scientist in a stained lab coat stands amidst overturned equipment in a sterile but damaged laboratory. She looks toward a shattered observation window, raising a data slate as she says, <S>The experiment observed its observers, your hypothesis has been invalidated.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A man sits comfortably in a cozy armchair by the window, bathed in the warm glow of the afternoon sun, flipping pages of an intriguing novel. As he absorb the story, a gentle breeze ruffles him, prompting him to glance up with a radiant smile. his fingers absentmindedly play with the corner of the page, casting playful shadows on the book's cover, as the tranquil atmosphere invites a moment of peaceful reflection. He shares warmly, <S>Reading gives us someplace to go when we have to stay where we are.<E>

Generated Video

Reference Image

Reference Audio

Text Prompt

A lively man eagerly explores the vibrant outdoor festival, weaving through the bustling crowd as sunlight dances off colorful vendor tents, capturing the joyous atmosphere on a vintage camera, and occasionally pausing to savor the aroma of street food. He says warmly, <S>Color is my day-long obsession, joy, and torment.<E>

Generated Video

Dataset: OmniCustom-1M

Fig 3: Statistics of our proposed dataset OmniCustom-1M, in terms of age and gender.

Reference

[1] Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation. arXiv preprint arXiv:2510.01284, 2025.
[2] Xingyu Ren, Alexandros Lattas, Baris Gecer, Jiankang Deng, Chao Ma, and Xiaokang Yang. Facial geometric detail recovery via implicit representation. In 2023 IEEE 17th international conference on automatic face and gesture recognition (FG), pages 1–8. IEEE, 2023.
[3] Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024.
[4] Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. 2024. Id-animator: Zero-shot identity-preserving human video generation. arXiv preprint arXiv:2404.15275 (2024).
[5] Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. 2025. Identity-preserving text-to-video generation by frequency decomposition. In CVPR. 12978–12988.
[6] Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. 2025. Phantom: Subject-consistent video generation via cross-modal alignment. ICCV (2025)
[7] Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. 2025. Vace: All-in-one video creation and editing. ICCV (2025).
[8] Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. 2025. Hunyuancustom: A multimodal-driven architecture for customized video generation. arXiv preprint arXiv:2505.04512 (2025).
[9] Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. 2025. Humo: Human-centric video generation via collaborative multi-modal conditioning. arXiv preprint arXiv:2509.08519 (2025).