Don’t worry Mr. Translator, the robot overlords won’t take you away just yet
Fellow language-o-phile Katy Pearce points me to The Translator’s Blues - Will I get replaced by a computer program? over at Slate.
It’s interesting to see a translators’s take on whether machine translation is an economic threat to his livelihood, but I pretty much stand by what I said in a previous post, “Is Machine Translation Possible? Well, yeah, but…”
This bit in particular, though, merits comment:
The one that stood out from the pack was Language Weaver. Not only did it recognize the subject as a human being—”The period of his youth was not easy”—but it translated the rest of the paragraph with only one minor error. Intrigued, I began to put the software through its paces. A headline from El Pais [sic]: “A wave of attacks left more than 100 dead in several cities in Iraq.” So far, so good. A speech from the United Nations: “The problem is to maintain the level of international attention and ensure the implementation of the commitments.” Perfect. The first line of Don Quixote: “In a place of the Channel, whose name do not want to remember, has not much Time living a Hidalgo the spearheaded in shipyard, adarga Antigua, Rocín weak and galgo corridor.” Clearly, in the world of machine translation, everything has its limits.
The problem with translation software is context…
Actually I don’t think context is what’s behind the varying quality of these translations.
The problems with machine translation are:
- the expense of training the system—finding bilingual corpora, tuning the software, etc
- the fact that appropriate training corpora may not exist in large enough volume to be useful
The reason that the U.N. speech was translated so convincingly is because Language Weaver (and every other MT system out there) was trained on U.N. speeches. If it had been trained on a bazillion carefully translated Cervantes novels, the result would have been equally convincing.
Okay, well, maybe that’s not quite true, since U.N. text is far more boring formulaic than Cervantes. Generally speaking, the more repetitive and formulaic the training data is, the more accurate the output of MT will be.
But even so, if the training data had had more lanzas and adargas and galgos in the first place, then it would have sent more lances and bucklers and greyhounds out the other end. If the words are in the training data, then the system will do a good job of figuring out how it should be translated.
After all, there are plenty of instances of the phrases “implementation of the commitments” and tons for “international attention” on the U.N.’s site. Not so many “adargas” .
So it’s pretty clear right there that that kind of text has plenty of available training data.
As far as MT systems are concerned, the rest is mostly (very sophisticated) math.
But such training data simply doesn’t exist between most pairs of languages.
As far as MT systems are concerned, those languages don’t exist.
2 comments.
Tags: machine translation, translation
Who is Patrick?
Your friendly neighborhood language geek, that’s who!