diff --git a/Figures/MOS-preference.png b/Figures/MOS-preference.png new file mode 100644 index 0000000..3e1d21b Binary files /dev/null and b/Figures/MOS-preference.png differ diff --git a/Figures/VibeVoice.jpg b/Figures/VibeVoice.jpg new file mode 100644 index 0000000..4a99d78 Binary files /dev/null and b/Figures/VibeVoice.jpg differ diff --git a/README.md b/README.md index b9f30cd..1b78e4c 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,8 @@ -# VibeVoice: Frontier Open-Source Text-to-Speech +## 🎵 VibeVoice: A Frontier Open-Source Text-to-Speech +[![Demo Page](https://img.shields.io/badge/Project-Page-blue?logo=google-chrome)](https://microsoft.github.io/VibeVoice) +[![GitHub](https://img.shields.io/badge/GitHub-microsoft%2FVibeVoice-black?logo=github)](https://github.com/microsoft/VibeVoice) +[![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f) -

- - Project Page - - - Hugging Face - - - Demo - -

VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. @@ -19,7 +11,12 @@ A core innovation of VibeVoice is its use of continuous speech tokenizers (Acous The model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models. -Try it out via [Demo](https://aka.ms/VibeVoiceDemo). +Try it out via [Demo](https://microsoft.github.io/VibeVoice). + +

+ VibeVoice Overview + MOS Preference Results +

## Models | Model | Context Length | Generation Length | Weight |