Hacking the RegEx entity within LUIS

By Jaya Mathew and Mithun Prasad, PhD

In our previous blog, we gave a brief introduction to machine translation, explored various topics like identifying the language and perform translation/transliteration of spoken or typed text using Microsoft's Translator Text API. In addition, we also discussed how translated or transliterated text can be integrated within a LUIS app. In this blog, we highlight new language support coming to LUIS and provide tips on improving app performance when using languages that are in preview phase.

At the time of writing the previous blog, Hindi language support was not natively available in the LUIS portal. However, LUIS portal now supports (in Preview) additional languages where Hindi script is supported. So, the user can create a new app where the culture is set to ‘Hindi Indian (Preview)' as shown in Figure-1 and then can type in Hindi utterances within their new app.


Figure-1: Creating an app with Hindi language

In the preview phase, however, some of the pre-built entities like URL in native language are not supported, so the user might run into issues when trying to tag URLs as shown in Figure-2: 


Figure-2: URL's in Hindi native script

One way to work around this is to create a RegEx (Regular Expression) entity manually as shown in Figure-3:


Figure-3: RegEx workaround for pre-built entries

Generic entities such as phone numbers and URLs can be extracted using regular expressions for matching standard patterns. Examples of URLs rendered in native language are as follows:


It is important to note that in order to get the full range of characters, we use u0900-u097F and not native characters.



This adds a lot of flexibility when building a LUIS model in languages that are in preview and do not yet support the various prebuilt entities.








This article was originally published by Microsoft's AI - Customer Engineering Team Blog. You can find the original article here.