Organizations and applications increasingly aim to cater to an international clientele. This often requires them to operate in various languages. For example, a multi-national corporation may have to translate internal documents for employees who live in different countries and speak diverse languages. The same company may also want to have their user manuals in various languages, as they sell their products in different countries.
Microsoft Translator is a cognitive service that can be a good starting point in these circumstances. It can be used for:
- Personal us; to translate real-time conversations, menus and street signs while offline, websites, documents, and more using the Translator apps.
- Business operations; to help globalize your business and customer interactions.
- Education; to create a more inclusive classroom for both students and parents with live captioning and cross-language understanding
Microsoft Custom Translator feature of the Microsoft Translator API enables you to take it a step further, by allowing you to create a translation service that is customized to work on your own text data. It is currently limited to translating between English and around 50 supported languages (Custom Translator does currently not supported translation between two non-English languages).
Let's say you are asked to create a translation service so that employees can share their work by translating documents back and forth between the different languages spoken at their organization. Unfortunately, these documents are full of idiosyncratic terminology that publicly available translation services cannot understand. Imagine having to translate the German word “Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz”, which is the name of a law that regulates how responsibilities can be delegated among public employees who are in charge of ensuring that beef products are labeled correctly. This might be tough for a translation service that is not familiar with this subject or term.
More seriously though, consider the following scenarios in which the Custom Translator feature might be helpful:
- User manuals for airplanes contain concepts that are not used outside the aviation industry. A general-purpose translation service may erroneously translate these concepts as if they were part of common language. You don't want to translate the “Microsoft Surface Go” into German “Microsoft Oberfläche Gehen”.
- A similar problem occurs when translating internal product or process documentation. Here, an additional requirement might be that these documents contain highly sensitive information, such as intellectual property. These documents can therefore not be published, which would be necessary for general purpose translation systems to have access to them for training.
In summary, the Custom Translator feature enables you to get better translations if you expect that translation requests will contain a lot of idiosyncratic terminology (e.g. industry specific vocabulary). The Custom Translator also enables you to perform secure translations.
Creating your first custom translation service for two languages is straightforward, the Quickstart is about a 2-minute read. You basically upload parallel documents, train the model, and deploy it. As the name implies, parallel documents are comprised of two parts, one for each language. For example, a parallel document could be a zip archive with two files in it, one containing the English versions and the other one the German versions of your user manuals. Ideally, the rows of the two documents are aligned, but some pre-processing is performed at the beginning of training, to ensure that rows are perfectly aligned. Do make sure that you have at least 10K rows of data though, otherwise you probably won't have a rich enough vocabulary during training to achieve a reasonable customization. For this purpose, training will fail as stated in Sentence Alignment.
Figure 1. Basic steps of creating a custom translation service.
The devil is often hidden in the details, so let's look at some best practices that might be helpful for your use-case.
Proper usage of dictionaries
In addition to parallel documents, the Custom Translator UI also allows the upload of bilingual dictionaries, containing entries defining the terminology used in your documents. Dictionaries can contain phrases or sentences for which you want to use an exact translation. When a translation request contains a phrase or sentence matches a source entry in the dictionary, the target phrase or sentence is returned in the translation, otherwise, best match from the model is returned. This can be especially useful for compound words, such as “Microsoft Surface Go”, to ensure they are translated correctly, and that capitalization is exactly how you want it to be. For proper dictionary usages and best practices, please refer to the documentation?
I recommend that you use dictionaries only sparingly, because they have a downside: When a translation request for a sentence includes a word from the dictionary, a placeholder is inserted for that word before translating the remaining sentence. The translation model will then lack this word as context when translating the rest of the sentence. Imagine trying to translate a sentence yourself when critical words are missing: You're not going to do so well. The model will have the same problem.
One alternative use for dictionaries is to upload them as additional parallel documents for training, rather than as dictionaries. This way the model will just learn to use the dictionary entries, and words from the dictionary will be available as context when translating the rest of the sentence.
Getting parallel documents in line with terminology
One final suggestion on how to get translation in line with the desired terminology: I have had the issue before where I included a dictionary as parallel document for training, but my tests revealed that the deployed model wasn't using the terminology reliably after all. What happened?
I realized that some of the parallel documents I used for training were kind of old, containing deprecated terminology. That must have been really confusing for the model. There are two options for dealing with this. The simple solution is to remove sentences from the parallel documents that are inconsistent with the terminology. Another, more complicated solution, is to correct those sentences in the parallel documents, to ensure that they are consistent with the terminology.
Thanks for reading, I hope these suggestions will be helpful when you try to build a custom translator solution.