h
a
c
k
l
o
g

Weirdness with Greek on Twitter

Written by Patrick Hall, October 5th, 2008

Sometimes I will search for various translations of the word “translation” (or “translator”, etc.) on various search engines, just to see what comes up. Even if I don’t know the language in question, sometimes I can glean something interesting, even if it’s only the fact that translations between particular languages are happening.

So anyway, today I stuck the Greek word for “translation”, “Μετάφραση”, into Twitter. I got some weird results:

Μετάφραση - Twitter Search

Corrupted Greek in index view

Many of the results in the results page come back as “????”’s, but when you click through the “View Tweet” links on those particular tweets, like this one, proper Greek appears.

Uncorrupted Greek in individual Tweet view

Any theories as to what sort of encoding issue could be going on here? And does anyone know what the most used encoding for Greek is? Perhaps it’s already UTF-8?

1 Comment for 'Weirdness with Greek on Twitter'

  1. Comment received October 5th, 2008 from ke

    Hm, if we assume the following architecture for Twitter…

    User ——> Main module
    \–> Search module
    |–> Indexing module
    \–> Snippet storage module

    …then it could be that encodings are resolved correctly in the main module and in the indexing module but for some reason not in the snippet storage module. Weird, but I cannot come up with a more plausible explanation for the behavior you’re describing.

    This could apply particularly to automatic fixing of pathological cases, e.g. where clients submit posts in an encoding other than declared (likely to occur given the variety of client apps that allow you to post something to Twitter).

    The encoding hidden beneath the question marks is probably ISO-8859-7 (it uses codepoints that are invalid for UTF-8).

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> . Don't forget to close them after use.