VoxFlash-TTS⚡️

Ultra-Compressed Latent Diffusion for Real-Time Voice Cloning

Email: zhangtaiyan072@gmail.com

Abstract VoxFlash is a real-time voice cloning engine designed for ultra-low latency and cost-efficient deployment.

It encodes raw speech waveforms into highly compressed latent representations using a VAE, generates speech in the latent space with a diffusion model, and decodes the results back to high-quality audio through a lightweight decoder.

By operating on extremely short latent sequences, VoxFlash achieves millisecond-level inference, enabling real-time voice cloning even on low-end hardware. This architecture significantly reduces computation, memory usage, and deployment cost, making VoxFlash suitable for edge devices, on-device applications, and large-scale real-time systems.


Contents

Model Overview

VoxFlash VAE The VoxFlash VAE employs a lightweight architecture to encode raw speech waveforms into highly compressed latent representations. Operating on 24 kHz audio, the model compresses waveforms to only 9 frames per second (9 Hz), achieving a substantially higher compression ratio than existing approaches while preserving high-fidelity speech quality.

This extreme temporal compression significantly reduces the computational complexity of downstream speech generation models. By shortening the latent sequence length, the overall computation is reduced quadratically, enabling the use of fewer network blocks and dramatically fewer parameters. As a result, the total computational cost is reduced by orders of magnitude compared to conventional speech generation pipelines, while maintaining high audio realism.

VoxFlash TTS VoxFlash TTS is a low-latency text-to-speech system designed for efficient and scalable speech synthesis. It first encodes phoneme sequences using a lightweight ConvNeXtV2-based network, followed by a novel coarse alignment algorithm that maps textual representations to latent speech sequences with significantly lower computational complexity than cross-attention–based approaches.

Built upon a modern diffusion architecture, VoxFlash TTS predicts speech latents through multi-step iterative refinement, enabling stable and high-quality generation. The generated latent representations are then decoded into waveforms using a lightweight VAE decoder. By combining efficient alignment, compressed latent modeling, and lightweight components, VoxFlash TTS achieves real-time synthesis with substantially reduced computation and model size, making it suitable for on-device and low-resource deployments.

Samples in this demo page are generated with voxFlash (NFE=16)

Same Language Zero-shot Generation

Prompt and text are from the demo page of Seed-TTS.

Prompt Text VoxFlash

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Perhaps they are driven by the delicious blend of flavors, or it could be the appealing visual presentation. At the end of the day, our choices in food reflect our personal preferences and sometimes, even our lifestyle or belief system.

Your safety and the pack's reputation are at stake. Your bravery is admirable, but sometimes bravery is knowing when to retreat. Please, consider returning with me. We can work out a plan, but only if you're willing to listen.

突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"

顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”

皇上的面色未变,宛如雕塑般静止,他的眼中闪过一丝动人的温度。他深深地看了那位忠心耿耿的臣子一眼,终于开口:“诺,我会再考虑考虑的。”他的声音低沉且坚定,留下空气中隐隐的无奈与柔情。

Cross-Lingual Zero-shot Generation

Prompt and text are from the demo page of Seed-TTS.

Prompt Text VoxFlash

顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”

我抬起头,坚定地说:“身高不能决定一切,这世界在看我,我更看得到世界。无论是北上广,或是别的什么,我都将以我自己的方式去攀爬,去追逐。我可能小,但我绝不会被忽视。”

你的安全以及族群的声誉都危在旦夕。你的勇敢令人钦佩,但有时候勇敢在于懂得何时撤退。拜托,考虑一下和我一起回去吧。我们可以制定一个计划,但前提是你愿意倾听。

Suddenly, there was a burst of laughter beside me. I looked at them, stood up straight with high spirit, shook the slightly fleshy arms, and smiled lightly, saying, "The flesh on my body is to hide my bursting charm. Otherwise, wouldn't it scare you?"

Suddenly, the atmosphere became gloomy. At first glance, all the troubles seemed to surround me. I frowned, feeling that pressure, but I know I can't give up, can't admit defeat. So, I took a deep breath, and the voice in my heart told me, "Anyway, must calm down and start again."

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.