diff --git a/Figures/VibeVoice_logo.png b/Figures/VibeVoice_logo.png new file mode 100644 index 0000000..2619848 Binary files /dev/null and b/Figures/VibeVoice_logo.png differ diff --git a/README.md b/README.md index def73f8..3c67eda 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,10 @@ VibeVoice Logo --> +
+VibeVoice Logo +
+ VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.