- Posted by VoiceBoxer Team
- On 31 March 2016
Every now and again, the amount of media attention machine interpretation gets tends to surge. So what is machine interpretation, how does it work, and why is it so limited in its application?
Machine interpretation is a complicated three-step process that combines three technologies. It converts spoken speech into text, translates the text into another language, and amplifies the text back into speech.
1. Automated Speech Recognition
For a message to be interpreted into another language, it has to become a text file at some point during the process. Automated speech recognition is a technology used to do just that–it converts spoken language into written text. You will likely know this from your smartphone, commanding Siri to do something or dictating a reply to a text message. While you might say that the voice commands work fine, dictating a text message usually produces (occasionally hilarious) errors.
This happens because when you are giving a command to Siri, it knows what to expect. There are only so many commands you can give to your smartphone. When writing a message, on the other hand, the system has much less foresight, and that is when it starts producing errors.
Even when everything goes according to plan and all of your message survived the conversion without errors, a lot of data is lost. Text captures WHAT you say, it barely captures HOW you say it. While this might be less important when writing an email, it is crucial in spoken communication. The phrase “What are you doing?” can have a lot of different meanings, depending on the context and intonation of the message. It can give the message an angry or curious or accusatory meaning. Only text differentiates between “?” and “!” and “.” and this loses a lot of nuance in the process. 
2. Machine Translation
Once your speech is converted to text, it enters the next step–machine translation. Living in the 21st century, you have likely encountered one or the other machine translation service.
There are different approaches to machine translation, yet the two most prominent ones are the following:
Rule-based models could be compared to how you or I would learn to speak a foreign language. A rule-based model learns the vocabulary, linguistic rules, and syntax for two sets of languages, and what it is given is based on this set of rules. This approach is well suited for language pairs with drastically different word order, such as, for example, English to Chinese. The downside being that rule-based translation rules are time-consuming to create, and they have to be created for every individual language pair. An individual model has to be created for English to Spanish, English to French, English to German, etc.
Statistical Machine Translation, on the other hand, is based on probabilities. A statistically generated model usually starts out by using texts that have already been translated by humans (the more the better) and generates a mathematical model based on the source material (the corpora) it has been given. In action, a statistical model takes a segment to be translated, generates thousands of possible translations, assigns probabilities to them, and chooses the one that is most likely to be correct based on its underlying model. As such, the quality of the statistical machine translation is highly dependent on the materials on which its model is based. In practice, the initial corpora has so far mostly been legal documents such as patents, as they are readily available in multiple languages, which can sometimes give a translation a slightly legal style. You can see the difference in size and quality of the corpora when comparing the quality of translations between different languages. English to Spanish will yield better results than Spanish to Welsh might, simply because there are less documents on which to base the statistical model. (Learn more about statistical machine interpretation here)
Machine translation is an impressive technological achievement. Even so, it’s application is limited. While it is quick, translating content that is more complex than simple conversations can make you scratch your head in bewilderment at the result. Additionally, even though a translation might technically be correct, it is not unlikely that it does not capture the meaning of the original message, as the translation engine is agnostic to the context and meaning. Machine translation does not know whether you are writing a legal document, an email to a customer, or a love letter. 
We have reached the final stage of the machine interpretation process. Your message has been converted to text and translated into the target language. It will now be transformed into audible speech by a synthetic voice.
Synthetic voices are constructed from recordings of a speaker that contain every possible sound in a given language. Those recordings are then split into the individual sounds, normalized and stored in an acoustic database. When “speaking,” a synthetic voice stitches these individual parts back together to produce words and sentences. There is a lot of software out there doing text-to-speech, yet even the best ones sound uncanny and unnatural.
Additionally, we encounter the data loss we faced in stage one again. Most of the pronunciation, rhythm and intonation of the original message got lost when we converted it into text, so even if it didn’t sound uncanny, it couldn’t faithfully reproduce the speaker’s style. 
To sum it up, machine interpretation is oblivious to context and meaning of a message, cannot pay attention to speed, intonation, pitch or other nuanced aspects of communication, and can generate multiple and compounded errors in the process. The output of the machine interpretation process will likely bear little resemblance to the input provided. Your message may get lost along the way. As such, automated interpretation is at best suited for simple, low-stakes communication with a very generous margin for error.
If you want your message perfected, leave it to the professionals–conference interpreters focus not only on what you say, but what you mean.