Microsoft has released VibeVoice, a new open-source artificial intelligence (AI) model that lets users create podcasts and other audio — a counter to Google’s popular NotebookLM.
But there are notable differences. Microsoft’s text-to-speech model can generate four voices and up to 90 minutes of podcast-quality speech. NotebookLM can do two voices.
Additionally, VibeVoice reads and organizes text while NotebookLM ingests documents and turns them into two-person podcasts. Users can also query and get document summaries, according to tech firm Hugging Face.
That means VibeVoice doesn’t try to understand the text but rather performs it audibly, ostensibly to replace a recording studio.
VibeVoice is the latest offering in voice AI technology, which has been attracting venture capital funding.
In 2024, voice AI startups raised $2.1 billion, up eightfold from the prior year, according to market research firm CB Insights. There’s rising interest in voice shopping: A PYMNTS Intelligence report shows that 30.4% of Gen Z consumers already shop by voice every week, followed by millennials. For all ages, the average is 17.9% of consumers using voice to shop.
VibeVoice runs on 1.5 billion parameters, relatively small for a model capable of sustaining dialogue across multiple speakers.
It was trained using Alibaba’s open-source Qwen2.5, a large language model that helps orchestrate natural turn-taking and contextually aware speech patterns during dialogues.
Microsoft claims this means VibeVoice can produce fluid conversations among four voices and yet maintain each voice’s distinct characteristics, even in longer conversations.
See also: How the World Does Digital: A Deep Dive Into Global Digital Engagement
How to use VibeVoicePotential research applications of VibeVoice include the following:
Prototyping podcasts and training content
Accessibility and education
Game and media development
Recognizing the risks of deepfakes, Microsoft said VibeVoice’s safeguards include ensuring every audio file includes both a disclaimer—such as “This segment was generated by AI”—and a hidden digital watermark.
It bars impersonation, disinformation and live deepfake uses such as real-time voice conversion in calls. It supports only English and Chinese speech for now. The model is available for research, not commercial deployment.
Read more:
Nobody’s Talking: Voice Interfaces Face Hurdles for Wide Adoption
AWS and Vonage Partner to Distribute ‘Natural-Sounding’ AI Voice Agents
Meta to Make a Bid for Voice AI Startup PlayAI
The post Microsoft Unveils VibeVoice for Longer Conversational AI Audio appeared first on PYMNTS.com.