Our data scientists created an internal AI translation service for confidential texts
Proekspert has an internal AI NLP translation service for confidentiality. It uses argos, Libre. Translates English, German, Danish, Dutch, Finnish, Spanish.
With the storm of large language models (LLMs) that began almost a year ago with the release of chatGPT, everybody started using and testing it for various purposes, which led to a series of discussions around using external and internal LLMs, with the focus on data privacy of our clients and ourselves.
One quite concrete idea that arose from these discussions was the need for some kind of internal translation service since using popular online translation services may not be compatible with privacy rules in many cases, as all the text goes through a third-party software in third party servers. A local service would prevent any data from leaving Proekspert or your company presmises, if you wish to start using the same service.
The Proekspert data science team created a sensitive-data-safe internal translation service.
- is based on open-source LibreTranslate (interface) and argos-translate library;
- supports text, documents, and API calls;
- supports many languages (English, German, Danish, Dutch, Finnish etc) with new ones being added;
- argos can also be used as a regular python library;
- does not currently support Estonian, recommended to use English or your native language if it is supported;
- is accessible only from an internal network.
We performed a brief search for tools and several came up, both free and proprietary, stand-alone and self-hosted software and SAS based (mainly proprietary options). Since we wanted to:
- look into the topic of in-house AI solutions like LLM-s or NLP services (Maximum customer data privacy assurance);
- have maximum control over the solution so that in principle we could offer setting up similar type of service to other use cases or for our clients;
- cover most useful languages while being still an AI/ML based solution not some older translation tech, which will probably become outdated soon;
- provide translation possibilities not only for pure text but also for files and web pages;
- provide an API for integration with other software;
- does not require any special hardware, not even a GPU (which most deep learning models require to run fast or run at all).
Therefore, the best solution for our needs was open-source self-hostable ML based software which provides an API, a web interface and file translation possibility.
Out of the solutions we could find, argos-translate + Libre Translate ticked most of the boxes. The translation quality is also OK. Not “the best” but it is very likely to improve, as translation models can usually be exchanged for newer versions without requiring any source-code changes, while it may not be so for any classical translation services.
Context matters with translation services. If possible and needed, give some additional context and try to remove it from the translated text after translation, it can change the translation quality (and perhaps help to improve the overall message quality by removing confusion!). Also rephrasing a sentence may be helpful.
English: An electronic component driving the motor is a motor driver.
German: Ein den Motor antreibendes elektronisches Bauteil ist ein Motortreiber.
English: A person driving the car is a driver.
German: Eine Person, die das Auto fährt, ist ein Fahrer.
Avoid slang and acronyms. Argos-translate usually fails to translate acronyms and slang terms as these are rarely a part of the language core dataset. It is better to keep the text professional and formal.
English: Where is the requirements.txt file?
Finnish: Missä ovat vaatimukset. Txt-tiedosto?
English: The ISP API is broken
Finnish: ISP API on rikki
English: The Internet service provider’s (ISP) application programming interface (API) is broken.
Finnish: Internet-palveluntarjoajan (ISP) sovellusliittymä (API) on rikki.
Use round-trip translation. Translate a text into one language and then translate the result back into the original language. It can help catch some obvious mistakes.
How to use such services?
After testing several different texts with a variety of technical terms, we can say that currently it’s best to use such translation services for sending small responses/”thank you’s” on LinkedIn or translating e-mails from a foreign language into English. Translating short phrases, like button names in software is also safe. But it’s still a good idea to manually check the translations, as a synonym might fit better for example. Luckily the automatic translation speeds up the whole process and checking the result is easier and faster than starting from scratch. The service is also useful for multi-national and multi-cultural companies, where several working languages are used.
It pays to be even more attentive and aware of the risks when translating highly technical or legal texts, long semi-technical blog posts or when sending out important information.
Remember, that slang, jargon-heavy technical texts, abbreviations and acronyms are usually a hurdle for the software.
LLM-s have gone through an explosion in the past year and LLM-s smaller cousins NLP translation models are benefiting from this development as well. In principle, in the future, it is possible to integrate Helsinki-NLT models from Hugging Face into this translation system and probably also the recent Seamless-M4T model from Facebook which may have a higher quality compared to argos-translate. Anyhow, the service is up and running in Proekspert and we’re currently gathering feedback.
Receive our weeky newsletter! Inspiring ideas that are worth your time