Introducing super realistic AI voices optimized for conversations

image

Now, in human-bot conversational interactions, can produce more natural, fluent, and high-quality responses than ever before, thanks to the power of Large Language Models (LLMs) such as Azure OpenAI GPT. Consequently, when engaging in verbal conversations, the demand for naturalness and expressiveness in Text-to-Speech (TTS) voices is higher than ever. We are introducing these new voices specifically designed for conversational scenarios. Whether you are creating a speech-based chatbot, a voice assistant, or a conversational agent, these new voices will ensure your interactions are more realistic, lifelike, and engaging.

The new realistic voices are perfect matches for any application necessitating lifelike speech interactions, including chatbots, voice assistants, gaming, e-learning, entertainment, and more.

Meet four new voices we introduce today: en-US-AndrewNeural, en-US-BrianNeural, en-US-EmmaNerual and zh-CN-YunjieNeural, all optimized for conversational scenarios, available in public preview in three regions: East US, South East Asia and West Europe. 

Check out the voice samples

Demo of new voices in comparison with other voices

Hear how these voices sound in conversations, compared to other voices in the stock that are designed for more general purposes.

Script New voices optimized for conversations Existing voices designed for general purpose
I can help you with a lot of things! I can answer questions, provide information on a wide range of topics, help you find things on the web, and more. If you have a specific question or task in mind, feel free to ask me and I'll do my best to assist you. Emma
Jenny
I'm not sure what you're asking. If you're asking for a paraphrase of the sentence “I learn about myself that I can lead a team”, then it means that the speaker has discovered that they have the ability to lead a team. Is there anything else I can help you with? Andrew Guy
风筝有风,海豚有海 ,而您有我,感谢您的光临。么么哒! Yunjie Yunxi

More samples

Script New voice
I understand. It sounds like a place that is both impressive and terrifying. I wonder what kind of tea they serve there. Is it made from the sun's rays or from something else? And who are the people who live there? Are they loyal to the Empire or do they have their own agendas?  Emma
Yes, that is what I said. A maximin strategy is the one that maximizes the minimum payoff of a player, regardless of what the other players do. It is a way of ensuring that the player gets at least a certain amount of payoff, even in the worst case scenario. Andrew
If you can't find the information, you may want to consider contacting your state's insurance department. They may be able to help you locate any life insurance policies that were taken out on your husband. I hope this helps. Please let me know if you have any other questions. Brian
好的,让我为您创建一个新的理赔单。请稍等。我已经为您创建了一个新的理赔单。我们会联系您安排修理您的车子。我们还会通过电子邮件给您发送一个链接,以便您可以上传您拍摄的照片。还有什么其他我可以帮助您的吗? Yunjie

Demo of full conversation

Conversations between Andrew and Emma (in English):

Conversations between Yunjie and Xiaochen (in Chinese):

Integrate these new voices with Azure OpenAI

You can effortlessly incorporate these new neural Text-to-Speech (TTS) voices into your applications using the Azure Speech SDK or REST API. Additionally, you can employ the Azure Bot Framework to develop intelligent bots capable of utilizing these new neural TTS voices for speech synthesis.

To minimize latency during the integration of Large Language Models (LLMs) and TTS, it is advised to send text to the TTS service while the LLM is still generating a response. You can find a demo sample here that demonstrates generating TTS responses in a streaming manner.

Technology behind

We began by crafting the persona of each voice as if it were a real person who is friendly and optimistic about life, always eager to assist others and share intriguing or practical knowledge. The speaking style of the voice resembles a conversation with an acquaintance over a cup of tea, maintaining a natural and unexaggerated tone.

Furthermore, we continuously enhance our Text-to-Speech (TTS) modeling techniques to improve the quality of our voices. Our most recent projects, such as DelightfulTTS 2,  and MuLanTTS, have significantly narrowed the quality gap between voices and professional human recordings, producing more natural and realistic voices than ever before. These technological advancements serve as the foundation upon which these new AI voices are built.

Get started

Microsoft offers over 400 neural voices covering more than 140 languages and locales. With these Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users. In addition, with the Custom Neural Voice capability, you can easily create a brand voice for your business.

For more information

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.