In which we point at some blog posts
Which are interesting: The Universe of Discourse posts on language.
No comments yet.
Technorati tags: Language and the Web
Which are interesting: The Universe of Discourse posts on language.
No comments yet.
Technorati tags: Language and the Web
An ickle rant ensues.
The public doesn’t understand how machine translation works. And generally speaking, the public doesn’t understand that machine translation couldn’t exist without human translators in the loop.
In other words, it’s not really “machines” that are “doing” the translating, it’s people. The machines are simply programmed to imitate the translations the people do.
I think this is symptomatic of a widespread disease in the computer world: an obsession with cutting people out of the loop.
It’s the same tunnel vision that motivated the much (and rightly) criticized footer that graced Google News at its launch:
This page was generated entirely by computer algorithms without human editors.
No humans were harmed or even used in the creation of this page.
Admittedly, this quote has long since been removed, and it was probably only meant as a joke in the first place anyway. But this problematic attitude still underlies a lot of reactions to Machine Translation. In general, people don’t realize that it wouldn’t be possible without human translators.
Yes, the title is pretty much a joke. I do visit reality once in a while.
2 comments.
Technorati tags: Language and the Web, machine translation, translation
Dear interwebs series of tubes people:
Random thought:
If you have two translations, and you perform some sort of compression on the both of them, could interesting relationships between the content of the two translations be uncovered? For instance, it seems like you might be able to get rid of non-content words, which might make it conceivable to align the texts at a phrase level.
I’ve dug a bit, but only found a paper by Conley and Klein, “Using Alignment for Text Compression.” But a quick glance at that (haven’t read it yet) suggests that they’re interested in improving compression for compressions sake, which isn’t what I have in mind.
Thanks for your thoughts and observations, interwebs.
No comments yet.
Technorati tags: alignment, Code, Linguistic Computing, translation
The Google Blog has a chart showing that there is a very clear trend toward Unicode adoption.
Apparently their numbers refer to UTF-8 alone (as opposed to UTF-16/UCS-2 or (haha)UTF-32/UCS-4), which again is good news. (Though one wonders if there is any uptake of UTF-16 on the web… I hope not.)
The data is “Google internal”… peer-reviewed, it ain’t.
Thanks to Won for the pointer!
6 comments.
Technorati tags: Language and the Web, unicode
I’m a big fan of Pharyngula, but… he was kinna wrong about a nitpicky little detail. And this particular nitpicky little detail was about language, so, the truth must out!
In his post As big as dinner plates?, Dr. Myers compares two articles about the recent dissection of a Colossal squid. (Can we please pause to acknowledge that that thing is HUGE? Kthx.)
Read the USA Today article on the colossal squid eye, which boils down to basically, “Oooh, they’re big!”. Then compare it to the blog entry on the colossal squid eye, written by a scientist. The latter is much more informative, and contains more specific details, and isn’t afraid to challenge the reader with words longer than a single syllable.
Emphasis added, to the bit that’s linguistic. Now obviously, that’s an exaggeration; the AP article (not USA Today, in fact) doesn’t really consist of one-syllable words (though one can say a lot with one-syllable words…)
But the idea is clear enough: USA Today uses shorter words than scientists, because journalists dumb down science.
Right?
Heck, I dunno. So, I answered the question the way I usually do: I wrote a program.
Surprise!
Average Word Lengths 4.45 Blog 4.64 AP Longest words Blog: photoreceptors tremendously considerably architeuthis neighbouring cephalopods cephalopods disappeared AP: mesonychoteuthis communications international invertebrates redistributed formaldehyde centimeters spectacular
Average word length is about the same. Which says nothing whatsoever about the quality of information in the articles; it does however say that they use words that are about the same size.
One can argue about the character of the long words in each text (you could use frequency counts of them, too), but still, the AP article uses the word “Mesonychoteuthis.” That’s a long way from a one-syllable word.
No comments yet.
Technorati tags: Code, Fun, Linguistic Computing
I’m a Unicode geek, but I feel like I don’t really know enough about what goes on in operating systems after the encoding and decoding issues are worked out. That is, when and where does all of that glyph selection and font shaping and other black magic actually happen? How much of it is dependent on the operating system? Applications? Fonts?
Head a splodes.
But whatever, live and learn. I did run across a very nicely done introduction to what appears to be the cutting edge in computer typography, OpenType:
Adam Twardoch’s PDF slides on OpenType: Typographic perfection with OpenType?
It gives a good idea of the sort of subtleties that OpenType can handle. And it’s very pretty. Font nerds, rejoice…
No comments yet.
Technorati tags: Language and the Web, opentype
Random translation observation:
I was translating some Brazilian Portuguese into English, and the source article was about an earthquake near São Paulo. (See? Brazil does have natural disasters!)
A particular phrase got me thinking: a zona leste de São Paulo meaning something like “the Eastern Zone of São Paulo.” That’s a pretty tricky thing to translate―you don’t really talk about the “Eastern Zone” of a city in English.
In the States, you might talk about the “Upper East Side” of New York, or the “South Side” of Chicago, or “Northwest” (sometimes just “NW”) in DC. My Brazilo-Londonian homey Carlos tells me that zones in London have numbers, so you talk about “Zone 5,” etc.
And then there are those arrondissements in Paris, which are numbered like an escargot.
So does “The East Side of São Paulo” work as a translation for “a zona leste de São Paulo”? Sounds okay to me, actually.
I’d be curious to know about the ways that other cities are subdivided.
4 comments.
Technorati tags: translation
$ cat unicode.rb
# -*- coding: utf-8 -*-s = “ABCあいう”
puts “s: #{s}”
puts “s[0]: #{s[0]}”
puts “s[3,1]: #{s[3,1]}”puts “s.length: #{s.length}”
puts “s.reverse: #{s.reverse}”
puts “s.encoding: #{s.encoding}”$ ruby1.8 unicode.rb
s: ABCあいう
s[0]: 65
s[3,1]:
s.length: 12
s.reverse: ��め�CBA
unicode.rb:10: undefined method `encoding’ for “ABC\343\201\202\343\201\204\343\201\206″:String (NoMethodError)$ ruby1.9 unicode.rb
s: ABCあいう
s[0]: A
s[3,1]: あ
s.length: 6
s.reverse: ういあCBA
s.encoding: UTF-8
$ # yay!
Big ups to Matz.
No comments yet.
Technorati tags: Code, ruby, unicode
And now, a wee rant.
Check out this page:
Now, tell me where exactly in the world this carsharing business exists. This is the front page of the company!
I was linked to this site by a friend in Helsinki (in the context of another conversation, about carsharing, as it happens). But the only place Helsinki is mentioned on this company’s website is in a caption to a photo. We are also met with this quote:
City Car Club offers a cost effective and easy to use way of driving. We have different types of cars and vans available at over 70 locations in the capital area.
The capital area?
So this is a Finnish carsharing company, based in Helsinki (I think), with a website in English, and nary a mention of Helsinki.
Or Finland.
Heck, or even Finnish… in order to get to the Finnish version, I just randomly guessed and went to CityCarClub.fi.
And guess what? No mention of Helsinki there either! (Or Helsingissä or any of the other 148614601 forms a Finnish noun can take…)
I have noticed that the same thing is often true of local newspapers’ websites. They’ll be called “The Gazette” or “The Tribune” or “The Local Paper.”
But they don’t tell you where they are!
Does this strike anyone else around here as weird?
Sheesh, talk about localization issues…
1 comment.
Technorati tags: Fun, l18n, Language and the Web, localization
Information Week blogger Serdar Yegulalp has some thoughts on the intersection of machine translation and open source:
Talk To Me, Openly - Open Source Blog - InformationWeek
He’s got an interesting anecdote about how he tackled studying Japanese, and it serves as an interesting intro to the idea behind bitext and statistical machine translation:
..Since I didn’t have money for classes, I homebrewed my own self-teaching method. I went out and bought a grammar guide, and then two copies of a given book — one in Japanese, the other an English translation — and sat with them side-by-side, comparing the two on a sentence-by-sentence and phrase-by-phrase level. It worked, up to a point, and while I’m no native speaker I can certainly figure out a fair amount of what’s put in front of me as long as I have a dictionary.
I didn’t know it at the time, but this parallel-texts technique is actually one of the best ways to also teach a computer to perform translations between languages.
He’s also got some thoughts on licensing issues involved with the data used to build MT systems, which is a topic which I don’t think has gotten enough attention.
(Please consider this an open thread for your thoughts on how MT and FOSS can and should interact.)
No comments yet.
Technorati tags: machine translation, open source, translation