update

2025-08-27 13:34:54 -07:00 · 2025-08-27 13:34:54 -07:00 · 2b75b745a4
commit 2b75b745a4
parent 560870cbe1
2 changed files with 4 additions and 0 deletions
--- a/Figures/VibeVoice_logo.png
+++ b/Figures/VibeVoice_logo.png
--- a/README.md
+++ b/README.md
@ -11,6 +11,10 @@
 <img src="Figures/log.png" alt="VibeVoice Logo" width="200">
 </div> -->
 <div align="center">
 <img src="Figures/VibeVoice_logo.png" alt="VibeVoice Logo" width="300">
 </div>
 VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
 A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.