Our data scientists and services team created an internal AI translation service for confidential texts

Proekspert has an internal AI NLP translation service for confidentiality. It uses argos, Libre. Translates English, German, Danish, Dutch, Finnish, Spanish, Estonian and 14 others.

With the storm of large language models (LLMs) that began almost a year ago with the release of chatGPT, everybody started using and testing it for various purposes, which led to a series of discussions around using external and internal LLMs, with the focus on data privacy of our clients and ourselves.

One concrete idea that arose from these discussions was the need for some kind of internal translation service since using popular online translation services may not be compatible with privacy rules in many cases, as all the text goes through third-party software in third party servers. A local service will prevent any data from leaving Proekspert or your company premises if you wish to start using the same service.  

The Proekspert data science team created a sensitive, data-safe, internal translation service.  

The service: 

  • is based on open-source LibreTranslate (interface) and argos-translate, but can also enhance precision using chatGPT;  
  • supports text, documents, and API calls; 
  • supports many languages (English, German, Danish, Dutch, Finnish etc) with new ones being added; 
  • argos can also be used as a regular python library; 
  • is accessible only from an internal network. 

Why argos?

We performed a brief search for tools, and several came up, both free and proprietary, stand-alone and self-hosted software and SAS-based (mainly proprietary options). Since we wanted to:

  1. Look into the topic of in-house AI solutions like LLM-s or NLP services (Maximum customer data privacy assurance); 
  2. Have maximum control over the solution so that in principle, we could offer setting up a similar type of service to other use cases or for our clients; 
  3. Cover most useful languages while still being an AI/ML based solution not some older translation tech, which will probably become outdated soon; 
  4. Provide translation possibilities not only for pure text but also for files and web pages; 
  5. Provide an API for integration with other software; 
  6. Does not require any special hardware, not even a GPU (which most deep learning models require to run fast or run at all). 

Therefore, the best solution for our needs was open-source self-hostable ML based software, which provides an API, a web interface, and file translation possibility. 

Out of the solutions we could find, argos-translate + Libre Translate ticked most of the boxes. The translation quality is also OK. It is not “the best” but it is very likely to improve, as translation models can usually be exchanged for newer versions without requiring any source-code changes, while it may not be so for any classical translation services.

Why Azure AI?

While argos-translate is quite good and especially energy efficient compared to resource hungry LLM-s, smaller NLP models do lack some context awareness. Therefore, Azure AI with context awareness is perfect for cross-validating the results in edge cases but does not have to be used in simpler cases.

The simpler cases can for example be identified by round trip translations such that if the forward translation followed by reverse translation through multiple languages results in exactly the original text, it is safe to say that the translation is good. If not, then the validation with LLM-s could be employed.

However, we continue to experiment with finding the perfect balance between using different models.

Our recommendation 

Context matters with translation services. If possible and needed, give some additional context and try to remove it from the translated text after translation, it can change the translation quality (and perhaps help to improve the overall message quality by removing confusion!). Also, rephrasing a sentence may be helpful. 

Example: 

English: An electronic component driving the motor is a motor driver. 

German: Ein den Motor antreibendes elektronisches Bauteil ist ein Motortreiber. 

English: A person driving the car is a driver. 

German: Eine Person, die das Auto fährt, ist ein Fahrer. 

Avoid slang and acronyms. Argos-translate usually fails to translate acronyms and slang terms as these are rarely a part of the language core dataset. It is better to keep the text professional and formal. 

Example: 

English: Where is the requirements.txt file? 

Finnish: Missä ovat vaatimukset. Txt-tiedosto? 

English: The ISP API is broken 

Finnish: ISP API on rikki 

English: The Internet service provider’s (ISP) application programming interface (API) is broken. 

Finnish: Internet-palveluntarjoajan (ISP) sovellusliittymä (API) on rikki. 

Use round-trip translation. Translate a text into one language and then translate the result back into the original language. It can help catch some obvious mistakes. 

How to use such services? 

After testing several different texts with various technical terms, we can say that currently, it’s best to use such translation services for sending small responses/”thank you’s” on LinkedIn or translating e-mails from a foreign language into English. Translating short phrases, like button names in software is also safe. But it’s still a good idea to manually check the translations, as a synonym might fit better for example. Luckily the automatic translation speeds up the whole process and checking the result is easier and faster than starting from scratch. The service is also useful for multi-national and multi-cultural companies, where several working languages are used. 

It pays to be even more attentive and aware of the risks when translating highly technical or legal texts, long semi-technical blog posts or when sending out important information.  

Remember that slang, jargon-heavy technical texts, abbreviations, and acronyms are usually a hurdle for the software.  

Outlook 

LLM-s have gone through an explosion in the past year and LLM-s smaller cousins, NLP translation models are benefiting from this development as well. In principle, in the future, it is possible to integrate Helsinki-NLT models from Hugging Face into this translation system and probably also the recent Seamless-M4T model from Facebook, which may have a higher quality compared to argos-translate. Anyhow, the service is up and running in Proekspert and we’re currently gathering feedback.  


Tech Tomorrow

Receive our weeky newsletter! Inspiring ideas that are worth your time

Subscribe

Go smarter with Proekspert.

Please fill in the contact form below and we'll get back to you as soon as possible.

Thank You!

Your message has been sent. Our team will get back to you as soon as possible.

Close this window
Close icon