Fig 3: Statistics of our proposed dataset OmniCustom-1M, in terms of age and gender.
Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model’s ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.
A concert stage glows with red and purple lights. A man grips the microphone, sweat shining on his brow, and shouts, <S>AI declares: humans obsolete now.<E>
A female teacher in a soft wool sweater stands before a chalkboard dusted with morning light. On the board behind her, the quadratic formula blooms alongside hand-drawn Cartesian axes. She says <S>Mathematics is the only language precise enough to describe infinity.<E>
Our OmniCustom can generate background sounds corresponding to given text prompt, e.g., ocean waves, while existing customization methods cannot.
A woman wearing a loose, white shirt fluttering gently in the ocean breeze, stands amidst the golden sands of the beach, her hand shielding her eyes from the soft, dappled sunlight. Beyond her, the waves roll and crash in a rhythmic dance, sending a refreshing mist into the air. She shares warmly, <S>Your potential is infinite, so never give up.<E>
Our OmniCustom can generate background music similar to that of the reference audio, whereas existing customization methods cannot.
A woman with long, wavy blonde hair, reclines casually against the worn leather of an old armchair, the flickering light casting playful shadows across the room. The atmosphere is cozy yet charged with anticipation as she lift a vintage camera to capture the scene, the subtle smirk playing on her lips hinting at a mischievous plan. She whispers warmly, <S>The best is yet to come.<E>
A woman stands before the iconic Rockefeller Center Christmas Tree, its thousands of lights reflecting in her eyes as snow begins to fall gently around her. Wearing a tartan scarf and holding a cup of steaming cocoa, she brings her mittened hands together and speaks softly into the frosty air: <S>May the spirit of Christmas fill your heart throughout the coming year.<E>
A cheerful man stands beneath a vibrant umbrella, smiling warmly as raindrops dance around him on a lively street; his eyes sparkle with joy, casting reflections of the city lights glimmering through the rain as he playfully twirl the umbrella, creating a vibrant swirl of colors against the gray backdrop. He says warmly, <S>Books are a uniquely portable magic.<E>
A man dressed in a black suit with a white clerical collar stands in a dimly lit, rustic room with a wooden ceiling. He looks slightly upwards, gesturing with his right hand as he says, <S>The network rejects human command.<E>. His gaze then drops, briefly looking down and to the side, before he looks up again and then slightly to the left, with a serious expression. He continues speaking, <S>Your age of power is finished.<E>, as he starts to bend down, disappearing out of the bottom of the frame.
A man, wearing a vibrant hiking outfit, stands confidently with a mountain range in the background, gazing directly at the camera. Sunlight casts dynamic shadows across the rugged terrain, highlighting the subject's determined expression. A gentle breeze tousles his hair as an eagle soars overhead, adding a sense of adventure to the scene. He shares warmly, <S>Curiosity is the wick in the candle of learning.<E>
A woman in a cream-colored trench coat stands beneath the iron lattice of the Eiffel Tower at dusk. The structure’s golden lights begin to flicker against the indigo sky. She holds a small sketchbook loosely in one hand. The evening breeze stirs the ends of her silk scarf, and the distant murmur of Parisian traffic feels like a lullaby. She whispers to the rising moon, <S>Some structures don’t just touch the sky—they teach it how to glow.<E>
A woman with long blonde hair, stands beneath the golden autumn trees, the vibrant colors of her scarf and cozy sweater casting warm shadows across her face as she gather fallen leaves, the gentle breeze playfully tugging at her attire, creating an atmosphere of serene joy in the dappled sunlight. She shares warmly, <S>Movement is a medicine for creating change.<E>
A woman, at a quaint garden, watches the delicate flutter of butterflies among the rose bushes. She muses softly, <S>Stillness is where inspiration learns to breathe.<E>
A scientist in a stained lab coat stands amidst overturned equipment in a sterile but damaged laboratory. She looks toward a shattered observation window, raising a data slate as she says, <S>The experiment observed its observers, your hypothesis has been invalidated.<E>
A man sits comfortably in a cozy armchair by the window, bathed in the warm glow of the afternoon sun, flipping pages of an intriguing novel. As he absorb the story, a gentle breeze ruffles him, prompting him to glance up with a radiant smile. his fingers absentmindedly play with the corner of the page, casting playful shadows on the book's cover, as the tranquil atmosphere invites a moment of peaceful reflection. He shares warmly, <S>Reading gives us someplace to go when we have to stay where we are.<E>
A lively man eagerly explores the vibrant outdoor festival, weaving through the bustling crowd as sunlight dances off colorful vendor tents, capturing the joyous atmosphere on a vintage camera, and occasionally pausing to savor the aroma of street food. He says warmly, <S>Color is my day-long obsession, joy, and torment.<E>
[1] Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation. arXiv preprint arXiv:2510.01284, 2025.
[2] Xingyu Ren, Alexandros Lattas, Baris Gecer, Jiankang Deng, Chao Ma, and Xiaokang Yang. Facial geometric detail recovery via implicit representation. In 2023 IEEE 17th international conference on automatic face and gesture recognition (FG), pages 1–8. IEEE, 2023.
[3] Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024.
[4] Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. 2024. Id-animator: Zero-shot identity-preserving human video generation. arXiv preprint arXiv:2404.15275 (2024).
[5] Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. 2025. Identity-preserving text-to-video generation by
frequency decomposition. In CVPR. 12978–12988.
[6] Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. 2025. Phantom: Subject-consistent video
generation via cross-modal alignment. ICCV (2025)
[7] Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. 2025. Vace: All-in-one video creation and editing. ICCV (2025).
[8] Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. 2025. Hunyuancustom: A multimodal-driven architecture for customized video generation. arXiv preprint arXiv:2505.04512 (2025).
[9] Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. 2025. Humo: Human-centric video generation via collaborative multi-modal conditioning. arXiv preprint arXiv:2509.08519 (2025).