University Post
University of Copenhagen
Independent of management

Science

No more garbled gobbledegook from google

Copenhagen research to make google a better translator

It is the plague of international researchers and students – also here at the University of Copenhagen.

It is the infamous translator Google Translate, a programme that either helps internationals get the sense of official-looking university e-mails in Danish – or confuses them even more.

Now, thanks to a researcher at the University of Copenhagen, there is still hope. Lecturer in language technology Anders Søgaard has identified and documented fundamental patterns of translation, patterns which standard translator systems cannot handle.

At present Google Translate can translate fixed words or expressions. But, if you change the word order or put a little word in between, the system gets confused. As a result, the texts translated using the Google tool are most often filled with meaningless words and phrases. These translations are often, involuntary, funny.

Danish a tough language for Google

But Anders Søgaard’s research has led to the development of a new translation system, called Phrasal, from Stanford University in California.

»Two thirds of the translated sentences have patterns that cannot be translated by today’s translation systems. This is a problem. Many of the errors from Google Translate that we find funny, are about discontinuous phrases – or translation units with gaps – which confuse the system. This error is solved in the new system, and the quality of translation will be much better in languages with a lot of these translation units with gaps – for example Danish, German and Spanish«, Søgaard says to the Danish daily Politiken.

He expects Google to implement the system by the end of the year, and that a lot of companies will change their translation system to Phrasal.

There are several challenges in machine translation, and for Danish which has a lot of fixed expressions, adverbs and closed units, the system needs a lot of data to find the meaning of this, he explains.

Will not solve bad e-mails

A survey last year by the University Post proved that international researchers received important e-mails from the University of Copenhagen only in Danish. Researchers and students from several faculties are forced to translate into English with Google translate, and they routinely do so, sometimes being disconcerted with the strange results.

Søgaard says to the University Post that this specific problem will continue even after the implementation of Phrasal, because these types of e-mails include a type of language and data that is not well-known to Google.

»The errors you experience as a user indicate that Google Translate has not seen a sufficient amount of similar texts. A solution would be for the university to train their own machine translation system on large amounts of data in the form of manually translated university e-mails from, and into, Danish. But I do not know if they have that kind of data«, Søgaard says.

Not enough data

The new system would not have enough data to solve all translation-related problems, according to Søgaard. A lot of data is needed to train the system with an expanded vocabulary.

A machine-based system will always react to, and solve, problems in a different manner than humans, he explains.

»Humans will try to create meaning out of a sentence and, based on that meaning, recreate the sentence in another language. But the machine system will translate more like a hacker finding out a password«.

But this said, »translator programmes can create good and useful translations,« he emphasises.

Decryption like in WW2

Google Translate is a phrase-based translation system, which means that it can read large amounts of texts. By registering which Danish expressions often appear together with a similar English expression, the system can find that these expressions probably have the same meaning. In other words, the system makes guesses to find the right translation.

»The system translates by looking at possible meanings of a sentence and finds the one which makes the most sense. There is no analysis or semantics in this system. It is based on the decoding of ciphers from WW2. It is simple and works well if two languages are much alike«, Søgaard explains to the University Post.

»The new thing with Phrasal is that it can work with the so-called translation units with gaps«, Anders Søgaard adds.

uni-avis@adm.ku.dk

Stay up to date with news and upcoming events at the University of Copenhagen. Sign up for the University Post newsletter here.

Latest