What’s new in Azure AI Speech


Today at Microsoft Ignite, we are super excited to announce a number of new capabilities for Azure Speech! This article provides a summary of all the new and recent releases.

We also recently released improved pricing for some of our services, including lower prices for batch transcription and Custom Speech. Please see Updates to Azure AI Speech Service pricing for more information.

Speech in Chat Playground

We're introducing speech input and output in the Azure Open Studio's Chat playground, powered by Azure Speech! This enhances the chat interaction experience, enabling powerful multi-modal input and output. It provides a showcase for what you can achieve for your customers across a wide range of scenarios like voice assistants, enterprise chatbots, IVR in contact centers, and more.

The playground allows you to choose any of the 138 locales supported by Azure AI Speech for both speech to text and text to speech. Once enabled through the settings, you can use the microphone button to provide voice input.

To help kickstart the integration of speech input and output with Azure Open AI, just click ‘view code' button in the playground UI for sample code you can use in your applications and products. We'll continue to improve the developer experience of connecting these services together and we're excited to see what you build!

TTS Avatar, Limited Access Public Preview

Azure text to speech avatar is now in Public Preview! This is a text to speech feature that allows developers to use simple text input to generate a 2D photorealistic avatar that is speaking using neural text to speech for its voice. To create the visualization of the avatar, a model is trained with human video recordings.

With avatars, you can create video content in a simple and efficient way for training videos, product introductions, customer testimonials, and more. Integrating with other Azure AI capabilities like speech to text and OpenAI text to speech avatars enable developers to put a human face on real-time interactive conversational agents, voice assistants and chatbots. We're excited about this technology enabling more natural and engaging experiences.

For Public Preview, we're offering both prebuilt and custom avatars. Adhering to Responsible AI principles, access to custom avatars is limited and will require an application.

To learn more, please read this blog: https://aka.ms/previewblog-ttsavatar.

Personal Voice, Limited Access Public Preview

Personal Voice is another new text to speech feature that we're releasing in Public Preview, which allows you to build applications where your users can easily create and use their own AI voice. Users can easily replicate their voice by providing a 1-minute speech sample as the audio prompt, and then use it to generate speech in any of the 100 supported locales.

You can leverage this feature to provide fully personalized voice experiences like unique voices for every gamer's character, voice assistants with bespoke voices, content dubbing in real-time with the original speakers' voices, and more.

As an extension of Custom Neural Voice, personal voice adheres to our Responsible AI principles. Access is limited and will require an application.

To learn more, please read this blog: https://aka.ms/previewblog-personalvoice.

Speech Analytics Try-out

Azure AI Studio now includes a new try-out experience for Speech Analytics. Speech analytics is an upcoming capability that integrates Azure AI Speech with Azure OpenAI to transcribe audio and video recordings to generate enhanced outputs like summaries, extract valuable information such as key topics, Personal Identifiable Information (PII), sentiment, and more.

We'll provide pre-built templates to help you get results quickly for common scenarios like summarization, PII-redaction, post-call analytics, agent assist, video captioning, and game-chat moderation. This feature will also include automatic job processing and monitoring to create high-quality transcription results. Beyond the pre-built templates, speech analytics will also include the flexibility to customize to your individual business needs.

We hope to empower you to deploy new proof-of-concepts with low effort based on infrastructure that can support scaling to production workloads out-of-the-box.

The try-out experience in Azure AI Studio presents a first look at example scenarios and some of the insights we envision, along with an option to sign up for our private preview waitlist.

Explore it here: https://aka.ms/speechanalytics/try-out.

Customization of OpenAI's Whisper model in Azure AI Speech, Public Preview

We've added the ability to customize OpenAI's Whisper models using audio with human-labeled transcripts! This allows you to finetune Whisper models to domain-specific vocabulary and acoustic conditions of your use-cases. Customized models can then be used through Azure AI Speech's batch transcription API.

Customization can be achieved through the Azure AI Custom Speech portal or REST API.

This extends our Public Preview of Azure OpenAI Whisper in both Azure OpenAI and Azure AI Speech, and builds on top of features Azure AI Speech provides for Whisper models like support for very large audio files, word-level timestamps, and speaker diarization.

On September 15th we announced the availability of the Public Preview of Azure OpenAI Whisper in both Azure OpenAI and Azure AI Speech.

  • Azure OpenAI Service enables developers to run OpenAI's Whisper model in Azure, mirroring the OpenAI Whisper API in features and functionality, including transcription and translation capabilities.
  • Users of Azure AI Speech can leverage OpenAI's Whisper model in conjunction with the Azure AI Speech batch transcription API. This enables customers to easily transcribe large volumes of audio content at scale.

When using Whisper through Azure AI Speech, developers can also take advantage of additional capabilities such as support for very large audio files, word-level timestamps and speaker diarization.

Today we are excited to share that we have added the ability to customize the OpenAI Whisper model using audio with human labeled transcripts through Custom Speech in the Azure AI Speech Portal and REST API. This enables developers to finetune Whisper models to the domain-specific vocabulary and audio conditions of their application. These custom models can then be used through the Azure AI Speech batch transcription API.

Bilingual Models, General Availability

For the very first time, we're introducing bilingual speech to text models! These models allow users to seamlessly switch between language pairs in real-time interactions. On Nov 30th we're starting with support for English & Spanish and English & French as the language pairs by updating our es-US and fr-CA models to bilingual models by default. The model's ability to understand both languages in real-time opens exciting new opportunities for effective and barrier-free communication.

To learn more, please read this blog: https://aka.ms/previewblog-bilingualmodel-speech.

Embedded Speech, General Availability

Embedded Speech is now generally available! Embedded speech is designed for on-device speech to text and text to speech scenarios where cloud connectivity is intermittent or unavailable. It provides an additional way for you to access Azure AI Speech beyond Azure cloud and connected/disconnected containers. You will be able to access the same technology as what powers many of Windows 11's experiences like Live Captions, Voice Access, and Narrator.

Access to embedded speech is limited and requires an application.

To learn more, please read our blog post: https://aka.ms/ignite2023-embeddedspeech-blog.

Pronunciation Assessment Language Support & Enhancements

Pronunciation Assessment now supports 14+ locales including English (United States), English (United Kingdom), English (Australia), French, German, Japanese, Korean, Portuguese, Spanish, Chinese, and more.

In addition to General Availability of these locales, we're also releasing prosody, grammar, vocabulary, and topic support as new features in Public Preview for English. These features will provide a comprehensive language learning experience for chatbots and conversation-based evaluations.

For reading scenarios, the new features are used in Reading Progress in Microsoft Teams to save teachers time and improve learning outcomes for students on reading accuracy and prosody. For speaking scenarios, PowerPoint coach empowers presenters on the correct pronunciation of spoken words throughout their rehearsal.

To learn more, please read this blog: https://aka.ms/ignite2023-pronunciation-assessment-blog. You can also try out the new features in the Azure AI Studio.

Real-time Speaker Diarization, Public Preview

Earlier this year we announced that real-time diarzation is available as a Public Preview for Azure AI Speech. It is an enhanced add-on feature which answers the question of who said what and when. It differentiates speakers in the input audio based on their voice characteristics to produce real-time transcription with results attributed to the different speakers as Guest 1, Guest 2, Guest 3, etc.

To learn more, please read this blog: Announcing the public preview of Real-time Diarization in Azure AI Speech, or reference our documentation Real-time diarization quickstart – Speech service – Azure AI services | Microsoft Learn.


This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.