Mine knowledge from audio files with Microsoft AI

Knowledge Mining is a technique to extract insights from structured and unstructured data. In this context, Azure Search is the standard Microsoft Knowledge Mining service, that uses to create metadata about images, relational databases, and textual data, providing a web-like search experience. Audio is a data type that matters for companies in all industries, containing customer and business information.  In this article you will learn combine these two separated worlds into one single search solution.

The challenge with audio data mining

We can turn Azure Search into Azure Cognitive Search, by adding an enrichment pipeline that creates meta data on images and text files. It pushes the created metadata into a Search Index, that contains pointers to the original data on its physical location on Azure. The supported data types are Microsoft Office documents, PDFs, json, xml, and image formats like png, jpg, and more. Those images may have a scene that will be tagged and described, or text that will be extracted. The text in the images may also be handwritten, like a form or meeting notes, common situations in every business.

But audio data isn't supported, and a company may want to search also for the content from podcasts, meetings recordings, call center phone calls, and so on. A possible solution would be a custom skill. After the regular Cognitive Search pipeline for your standard supported documents, a second skillset is necessary. It would only extract physical metadata from the file and the custom skill would do the job. But today, August 2019, this approach has limitations on processing time, files sizes, and more. It will be addressed in a future blog post.

Another possible solution for this challenge has two steps, pre-processing and a link-through, that will allow users not only to search from company's recordings but also to have the “click-to-open” experience, like Microsoft's JFK or Azure Search Accelerator demos. Let's see the details below. The diagram below shows you how it will work.


Figure 1: The traditional Cognitive Search diagram, with the addition, in grey, of the required audio files pre-processing step

First Step – Pre-processing to extract textual data from the audio

In this first step, we need to extract textual data from the audio files before the Azure Cognitive Search enrichment pipeline. The process will feed the pipeline with json data containing the transcription of the audio files, a content that must be merged with the other files content, which happens just before the pipeline AI analysis like entity or key phrases extractions.

This idea is also seen within the speech support in a Bot,  with the help of the Microsoft Speech SDK. It will convert the audio to text, and then you can integrate your bot with other Azure Cognitive Services, like LUIS for language understanding, Text Analytics for information extraction, or Q&A Maker for knowledge base; all are REST APIs that expect a textual input.

To implement this pre-processing, we will use the Microsoft Speech to Text API, a cognitive service that will do the transformation we need to mine knowledge from business audio. This API offers a range of capabilities you can embed into your apps to support various transcription scenarios, including conversation transcription, speech transcription, and custom speech transcription. The audio files location is informed as a parameter of the API call, and the service will access the file.

The code below allows you to submit your audio files to the API and get back the transcriptions in json files which has the name of the original recording file. This is key to allow the application to offer the “click-to-open” experience, what you will see in the next step of the solution for this challenge. To access to this code and sample files, click here for a GitHub lab that has all details necessary to accomplish this task.


Figure 2: The script code to call the REST API for your audio files and get the transcription back.

For enterprise scenarios, it is necessary to use Batch Transcription, ideal for call centers, which typically accumulate thousands of hours of audio. It makes it easy to transcribe large volumes of audio recordings. By using the dedicated REST API, you can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcriptions. It requires the S0 (Standard) tier of the Speech Service and will allow you to process file types like wav and mp3, without data volume limitations.

The team did a great blog post on use the Text Analytics API on call center recordings. They used a similar architecture, with focus on dashboards for sentiment analysis and key phrases, instead of Cognitive Search. An interesting approach is to leverage both designs to create the best solution for your needs.

Second Step – Link-through for a “click-to-open” experience

Now you have the transcriptions, if you upload the json data to the same location of the rest of your dataset, it will be submitted for all AI processes you defined for your dataset. And the created metadata will be available for search in an inverted index, the theory behind Azure Search incredible performance. But the enrichment pipeline will index the json file, not the original audio. That means, in a click-to-open solution, the user will read the json file text instead of listening to the original recording. Here are the required actions to fix it:

  • First, create an azure search index with a field exclusive for the audio files URL location. This action can be done together with the previous step, when you do the index creation, or after the first execution of the pipeline, once, using the update index This field will only be filled for audio files and it is a good idea to use a name that suggests this, like “audioFileUrlLocation”. This file will be null for all non-audio files in your dataset. Tip: It is useful to have the audio files in the same account that you are using for the other documents, because your application won't need to deal with another string connection.
  • Second, populate this field, which can be done with another script. With the document update method, you can update the field mentioned above with the real location of the audio files. That's why the original file name within the json file name is so important. It will help the link-through process connect the index information to the correct audio data source location. Your script will:
  1. Do an Azure Search query to get the audio transcriptions keys, required to update only one document in the index. It is required to retrieve the id and the metadata_storage_name.
  2. For each audio document, find its key parsing the metadata_storage_name, simply removing the “.json” extensions
  3. Submit a “merge” for each audio document, with a body definition that you can see in the image below
  4. Manually avoid unnecessary processing. While Azure Search knows which files were added to a blob to avoid unnecessary processing, you will need to do it manually. Use any kind of control you want, like to move files from “pending” to “processed” containers. That means that your real url location should be a link to the “processed” folder.


Figure 3: The body definition of your document update API call.

  • Third, related to the user interface, is simple. Change an “IF” in your interface: when the audioFileUrlLocation file is not null, use it for user clicks, instead of the metadata_storage_file_path field, that is automatically filled for all files in your dataset. This field has the physical location of the data and is returned to search applications for click-to-open experience. For our audio files, it will provide the location of the json transcriptions. That IF will allow your application to open the audio file location and start to play it, or at least, offer the option to automatically play it.


This blog post shows you how pre-processing is powerful for knowledge mining scenarios. The same process done here for audio files can be used, along with the Video Indexer API, for video files. You will be extending the impact and the use of the knowledge mining solution for new data types that the original product can't yet handle. The same data enrichment is done for audios and videos, allowing a complete search experience for users. For industry specific vocabularies, you may want to use custom language models, making your transcription even more effective. Stay tuned, this blog channel has a long list of AI solutions to be published.


This article was originally published by Microsoft's AI - Customer Engineering Team Blog. You can find the original article here.