update
This commit is contained in:
parent
560870cbe1
commit
2b75b745a4
2 changed files with 4 additions and 0 deletions
BIN
Figures/VibeVoice_logo.png
Normal file
BIN
Figures/VibeVoice_logo.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 1.4 MiB |
|
@ -11,6 +11,10 @@
|
||||||
<img src="Figures/log.png" alt="VibeVoice Logo" width="200">
|
<img src="Figures/log.png" alt="VibeVoice Logo" width="200">
|
||||||
</div> -->
|
</div> -->
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<img src="Figures/VibeVoice_logo.png" alt="VibeVoice Logo" width="300">
|
||||||
|
</div>
|
||||||
|
|
||||||
VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
|
VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
|
||||||
|
|
||||||
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
|
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
|
||||||
|
|
Loading…
Reference in a new issue