Introducing Custom Display Format in Azure AI Speech

Display output text format in automatic Speech Recognition is critical to final readability and downstream tasks, and one-size doesn't always fit all.

We are thrilled to announce the Public Preview of Custom Display Format (also known as “Custom Display-Post-Processing” or “Custom DPP”) within Azure Custom Speech Service. that allows you to customize speech recognition output format per your business needs.

What's for Public Preview:

Custom Display Formatting allows customers to define their own speech recognition display formats with custom speech models and endpoints. It includes following capabilities:

  • Rewrite: Capitalize, reform, or replace certain words and phrases.
  • Custom profanity: Mask additional profanity words and phrases from output.
  • Custom ITN (inverse-text-normalization): Define advanced patterns such as numbers, addresses, emails, etc.

ArcherZ_0-1701846634137.png

Supported language and locales:  

en-us, en-au, en-ca, en-gb, en-hk, en-in, en-ie, en-nz, en-ng, en-ph, en-sg, en-zh, zh-cn, zh-hk, pt-br, pt-pt, it-it, es-es, es-mx, fr-fr, fr-ca, de-de, ja-jp, ko-kr, nb-no, nl-nl, sv-se, da-dk, fi-fi, tr-tr, hi-in, pl-pl, etc (Language support)

How it works:

The display text pipeline is composed by a sequence of display format builders. Each builder corresponds to a display format tasks such as ITN, capitalization, and profanity filtering.

  • Inverse Text Normalization (ITN): To convert the text of spoken form numbers to display form. For example: I spend twenty dollars -> I spend $20
  • Capitalization: To upper case entity names, acronyms, or the first letter of a sentence. For example: she is from microsoft -> She is from Microsoft
  • Profanity filtering: Masking or removal of profanity words from a sentence. For example, assuming “abcd” is a profanity word, then the word will be masked by profanity masking: I never say abcd -> I never say ****

The base builders of the display text pipeline are maintained by Microsoft for the general purpose display processing tasks. You get the base builders by default when you use the Speech service. Beside the base builders maintained by Microsoft for the general purpose display processing tasks, you can define custom display text formatting rules (Custom ITN, Custom Profanity, Rewrite) to customize the display text formatting pipeline for your specific scenarios:

The order of the display text formatting pipeline is illustrated in this diagram.

ArcherZ_2-1696733684246.png

How to use in Studio (video tutorial):

How to prepare your Custom Display Format data:

Download sample data

Rewrite

General speaking, for an input string, rewrite model will try to replace the original phrase in the input string with the corresponding new phrase for each rewrite rule. A rewrite model is a collection of rewrite rules.

  • A rewrite rule is a pair of two phrases: the original phrase and a new phrase. The two phrases are separated by a TAB character. For example original phrase{TAB}new phrase
  • The original phrase will be matched (case-insensitive) and replaced with the new phrase (case-sensitive). Grammar punctuation characters in the original phrase are ignored during match.
  • If any rewrite rules conflict, the one with the longer original phrase will be used as the match.

Rewrite: Grammar Punctuation

Grammar punctuations are used to separate a sentence or phrase, and clarify how a sentence or phrase should be read. The following punctuations are grammar punctuations if they are followed by space or at the begin or end of a sentence or phrase. 

. , ? 、 ! : ; ? 。 , ¿ ¡ । ؟ ،

Here are the grammar punctuations rules:

  • The supported punctuation characters are grammar punctuations if they are followed by space or at the beginning or end of a sentence or phrase. For example, the . in x. y (with a space between . and y) is a grammar punctuation.
  • Punctuation characters that are in the middle of a word (except zh-cn and ja-jp) aren't grammar punctuations. In that case they are ordinary characters. For example, the . in x.y isn't a grammar punctuation.
  • For zh-cn and ja-jp (non-spacing locales), punctuation characters are always used as grammar punctuation even if they are between characters. For example, the . in 中.文 is a grammar punctuation.

Rewrite: Examples

  • Spelling correction

The name ‘CVOID-19' sometimes was recognized as ‘covered 19', to correct it:

Reform rule: covered 19{TAB}COVID-19

Sample: "covered 19 is a virus" -> "COVID-19 is a virus"

  • Name capitalization

Gottfried Wilhelm Leibniz was a German mathematician. To make his name displays correctly:

Reform rule: gottfried leibniz{TAB}Gottfried Leibniz

Sample: "gottfried leibniz was a mathematician" -> "Gottfried Leibniz was a mathematician"

Custom Profanity

A custom profanity model acts the same as the base profanity model, except it uses a custom profanity phrase list. In addition, the custom profanity model will try to match (case insensitive) all the profanity phrases defined in the display text formatting file.

  • The profanity phrases will be matched (case-insensitive).
  • If any profanity phrases rules conflict, the longest phrase will be used as the match.
  • These punctuation characters aren't supported in a profanity phrase: . , ? 、 ! : ; ? 。 , ¿ ¡ । ؟ ، .
  • For zh-cn and ja-jp locales, English profanity phrases aren't supported. English profanity words are supported. Profanity phrases for zh-cn and ja-jp locales are supported.

Custom Profanity: Examples

  • Single profanity word

Assume xyz is a profanity word, to add it:

Profanity rule: xyz

Test sample: Turned on profanity masking to mask xyz -> Turned on profanity masking to mask ***

  • Profanity phrase

Assume abc lmn is a profanity phrase, to add it:

Profanity rule: abc lmn

Test sample: Turned on profanity masking to mask abc lmn -> Turned on profanity masking to mask *** ***

Custom ITN

The philosophy of pattern-based custom ITN is that you can simply specify the final output that you want to see. The Speech service will figure out how the words might be spoken and map the predicted spoken expressions to the specified output format.

A custom ITN model is built from a set of ITN rules. An ITN rule is a regular expression like pattern string which describes:

  • A matching pattern of the input string
  • The desired format of the output string

The default ITN rules provided by Microsoft will be applied first. The output of the default ITN model will be used as the input of the custom ITN model. The matching algorithm inside the custom ITN model is case-insensitive.

There are four categories of pattern matching with custom ITN rules.

  • Patterns with literals
  • Patterns with wildcards
  • Patterns with Regex-style Notation
  • Patterns with explicit replacement

Custom ITN: Patterns with Literals

For example, a developer may have an item (maybe a product) named with the alphanumeric form “JO:500”. The job of our system will be to figure out that users might say the letter part as “J O”, or they might say “joe”, and the number part as “five hundred” or “five zero zero” or “five oh oh” or “five double zero”, and then build a model that maps all of these possibilities back to “JO:500” (including inserting the colon).

Patterns can be applied in parallel (put one per line), so a pattern specification file can be specified like:

JO:500

MM:760

Custom ITN: Patterns with Wildcards

Suppose a customer needs to refer to a whole series of alphanumeric items named “JO:500”, “JO:600”, “JO:700”, etc. We can support this without requiring spelling out all possibilities in several ways.

Character ranges can be specified with the notation [...], so JO:[5-7]00 is equivalent to writing out three patterns.

There is also a set of wildcard items that can be used. One of these is d, which means any digit. So “JO:d00” covers “JO:000”, “JO:100”, etc … “JO:900”.

Here is a list of supported character classes:

  • d – match a digit from ‘0' to ‘9', and output it directly
  • l – match a letter (case-insensitive) and transduce it to lower case
  • u – match a letter (case-insensitive) and transduce it to upper case
  • a – match a letter (case-insensitive) and output it directly

There are also “whack escape” expressions for referring to characters that otherwise have special syntactic meaning:

  •  – match and output the char 
  • ( and )
  • { and }
  • |
  • + and ? and *

Custom ITN: Patterns with regex-style notation

To enhance the flexibily of pattern writing, regular expression-like constructions of phrases with alternatives and Kleene-closure are supported.

  • A phrase is indicated with parentheses, like (...) – The parentheses don't literally count as characters to be matched.
  • You can indicated alternatives within a phrase with the | character such as (AB|CDE).
  • You can suffix a phrase with ? to indicate that it is optional, + to indicate that it can be repeated, or * to indicate both. Note that you can only suffix phrases with these characters and not individual characters (which is more restrictive than most regular expression implementations).

A pattern such as (AB|CD)-(d)+ would represent constructs like “AB-9” or “CD-22” and be expanded to spoken words like A B nine and C D twenty two (or C D two two).

Custom ITN: Patterns with explicit replacement

Although our general philosophy is “you show us what the output should look like, and we'll figure out how people will say it”, this doesn't work 100% of the time because some scenarios may have quirky unpredictable ways of saying things, or our background rules may just have gaps. For example, there may be colloquial pronunciations for initials and acronyms — “ZPI” might be spoken as “zippy”. In this case a pattern like ZPI-dd is unlikely to work if a user says “zippy twenty two”. For this sort of situation, we have a notation {spoken>written}. This particular case could be written out {zippy>ZPI}-dd.

This can be useful for handling things that the Speech mapping rules but don't yet support. For example you might write a pattern d0-d0 expecting the system to understand that “-” can mean a range, and should be pronounced “to”, as in “twenty to thirty”. But perhaps it doesn't. So you can write a more explicit pattern like d0{to>-}d0 and tell it how you expect the dash to be read.

You can also leave out the > and following written form to indicate words that should be recognized but ignored. So a pattern like {write} (u.)+ will recognize “write A B C” and output “A.B.C”–dropping the “write” part.

Custom ITN: Examples

  • Group digits -> To group 6 digits into 2 groups and add a ‘-‘ char between them:

ITN rule: ddd-ddd

Sample: "cadence one oh five one fifteen" -> "cadence 105-115"

  • Format a film name -> Space: 1999 is a famous film, to support it:

ITN rule: Space: 1999

Sample: "watching space nineteen ninety nine" -> "watching Space: 1999"

  • Format a weapon name -> To support the weapon names of AK family:

ITN rule: AK-dd

Sample: "a k forty seven" -> "AK-47"

  • Pattern with Replacement:

ITN rule: d[05]{ to >-}d[05]

Sample: "fifteen to twenty -> 15-20"

Useful resources:

 

This article was originally published by Microsoft's Azure AI Services Blog. You can find the original article here.