This commit is contained in:
JianweiYu 2025-08-27 13:34:54 -07:00
parent 560870cbe1
commit 2b75b745a4
2 changed files with 4 additions and 0 deletions

BIN
Figures/VibeVoice_logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

View file

@ -11,6 +11,10 @@
<img src="Figures/log.png" alt="VibeVoice Logo" width="200"> <img src="Figures/log.png" alt="VibeVoice Logo" width="200">
</div> --> </div> -->
<div align="center">
<img src="Figures/VibeVoice_logo.png" alt="VibeVoice Logo" width="300">
</div>
VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.