Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Bird Song

Written by Patrick Hall, 2 years ago.
Tags: , .

This is a bit off-topic, I suppose, nothing to do with translation, but here’s an article about some rather amazing research:

“Uniquely human” component of language found in gregarious birds

To assess the birds’ syntactical skills, the research team exploited the diverse sounds in starling songs. They recorded eight different ‘rattles’ and eight ‘warbles’ from a single male starling and combined them to construct a total of 16 artificial songs. These songs followed two different grammars, or patterning rules.

Eight songs followed the “finite-state” rule, the simplest sort, thought to account for all non-human communication. A finite-state grammar allows for sounds to be appended only at the beginning or end of a string. These songs were built up from a rattle-warble base by adding rattle-warble pairs at the end. The simplest song (ab) was one rattle followed by one warble. The next simplest a rattle, then a warble, followed by a different Rattle and Warble (abAB).

The other eight songs followed the “context-free” rule, which allows for sounds to be inserted in the middle of an acoustic string, the simplest form of recursive center-embedding. So a context-free sequence also began with rattle-warble base (ab) but built up by inserting new sounds in the middle, such as rattle-Rattle-Warble-warble (aABb).

It’s funny, I’ve recently been thinking a lot about finite state machines while working on a little article on transliteration for text input in Amharic ( currently extremely incomplete).

Actually, my current system doesn’t require a context-free grammar.

If it turns out I need one, I’ll stick my head out the window and ask a starling.

Loonicode+0003

Written by Patrick Hall, 2 years ago.
Tags: , , .
Loonicode 2

mezzoblue on HTML and Foreign Languages

Written by Patrick Hall, 2 years ago.
Tags: .

I suppose it’s rather odd to post a link to an article that’s 3 years old, but marking up content so that it contains information about the language that content is written in is one of those topics that take a long time to really spread around the web. And a good blog post is hard to find (and, *cough*, to write), so check this one out:

mezzoblue § HTML and Foreign Languages

Even though it was written in 2003, I think Dave Shea hits a lot of key points here. The only place where I would be be a bit more strict than he is (as anyone who reads this blog on a regular basis will know), is I would never want to send out anything that isn’t in UTF-8 and clearly marked as such.

Maybe I’m obsessive. ☺

Noam Chomsky meets Ali G

Written by Patrick Hall, 2 years ago.
Tags: .

Ali G interviews Noam Chomsky

Chortle.

Noam
Noam on
the Range

Python String Reform

Written by Patrick Hall, 2 years ago.
Tags: , , , .

O frabjous day!

From Guido van Rossum’s slides about the upcoming major revision of Python, the very best page of all (the Unicode part, of course!):

String Types Reform

  • bytes and str instead of str and unicode
    • bytes is a mutable array of int (in range(256))
    • encode/decode API? bytes(s, “Latin-1″)?
    • bytes have some str-ish methods (e.g. b1.find(b2))
    • but not others (e.g. not b.upper())
  • All data is either binary or text
    • all text data is represented as Unicode
    • conversions happen at I/O time
  • Different APIs for binary and text streams
    • how to establish file encoding? (Platform decides)

I think when these changes come around Python will take the lead in Unicode support among scripting languages.

(…Unless Ruby gets its Unicode act together. I’m not terribly optimistic about that, myself. Like i18n in Rails, it just doesn’t seem to be too high of a priority in the zeitgeist. Maybe I’ll be proven wrong, I hope so.)

Loonicode+0002

Written by Patrick Hall, 2 years ago.
Tags: , , .
Loonicode 2

Of Dictionaries and Licenses

Written by Patrick Hall, 2 years, 1 month ago.
Tags: .

What does it mean to license a dictionary?

After all, you can’t really copyright the fact that the French word for “cat” is “chat.” It’s not as though it’s a trademark or something . It’s just a word. But one pair of words does not a dictionary make, nor a database.

It would be easy to determine that someone had lifted a 10-page entry from the Oxford English Dictionary, but it’s not just dictionaries of that degree of complexity that are copyrighted. Even cheapo little pocket bilingual dictionaries with nothing more than a one-to-one listing of words, precisely of the cat-chat variety, are copyrighted.

It gets still hairier when you talk about electronic dictionaries (or lexicons, as digital dictionaries tend to be called). Consider the case of the license of this recently released Welsh/English lexicon:

The Welsh Language Board is the owner and/or manager of the copyright; database rights and all other rights pertaining to this database of terms.

Users are only allowed to download lists of terms to the memory of one computer or to translation memories shared across one closed network for their personal use or the sole use of their employers.

It is not permitted to reproduce, copy or publish these list in any form whatsoever without the Board’s prior permission.

If you agree to these conditions, please click on the ‘Accept’ button below.

I find this hard to understand. After all, even using a term from a lexicon subject to such terms seems to violate its copyright. By that (obviously incorrect) reasoning, I shouldn’t be able to repeat the fact that a gyriannau dyfais is a “device driver.”*

Clearly the rights of the people who labored to produce that lexicon should be protected. Anyone who turned around and put their name on that lexicon and sold it as a dictionary would clearly be a plagiarist, and would deserve whatever punishment they got.

(Even so, I think it’s rather ridiculous that an organization like the Welsh Language Board, which ostensibly exists to promote the language, should put such a tool under a restrictive license, particularly when you consider that among the term lists one finds Shop signs and Food menus… domains where any help at all is sorely needed.)

* Shouldn’t that be “device drivers,” anyway?

Loonicode+0001

Written by Patrick Hall, 2 years, 1 month ago.
Tags: , , .
Loonicode 1

Python for Linguistics 1: How to Count Letters with Python

Written by Patrick Hall, 2 years, 1 month ago.
Tags: .

Okay, here goes.

Some observations before you check it out:

  • It’s really hard to write a tutorial for newbies. You have to dig out all your own assumptions and figure out which ones matter.
  • This is the first time I’ve written such a thing, I hope it’s comprehensible.
  • I’ve decided to keep these things outside of the blog so that I can pretend they were perfect in the first place improve them over time.

Please feel free to post comments here! Tell me where you got confused, whether you found it boring, fun, useful, useless, totally misleading, in need of a bipartisan commission to pass judgement on my sanity, whatever.

Also, I guess it’s not fair to say that it’s appropriate for total newbs: I didn’t explain what variables are, and I didn’t get down to the “click the icon to open the Python prompt” sort of directions. There’s a lot of that out there on the web already, but it depends on your OS, etc etc.

Anyway, enough prolegomenizing:

Python for Linguistics 1: How to Count Letters with Python

By the way, we Blogamundistas tend to hang out in #blogamundo on chat.freenode.net, if you do that IRC thing. If you’re trying to get through this tutorial and get stuff, someone (probably me!) will be happy to try to help you.

Kind of Shocking Discovery about open() in Python

Written by Patrick Hall, 2 years, 1 month ago.
Tags: , .

So, here’s why the tutorial is rather late:

I ran across a weird… er… feature of Python’s open() command. If you’re waiting for the tutorial, this will only freak you out, so ignore this post, hehe. If you’re a Python guru, I’d like to hear what you think.

Check this out:

$ python2.4
>>> len(open('moby10b.txt').read())
1256167

That file is Moby Dick, from here. Just a normal text file.

Length: 1,256,167 bytes.

Now, here’s the same command running on Windows, with the same file:

C:> python
>>> len(open('moby10b.txt').read())
1232923

Length: 1,232,923 bytes.

Yes, friends and neighbors, Python on Windows and Python on Linux disagree on how long a file is.

*head a splodes*

After much digging, I discovered that there is a way around this. I have no idea why, after several years of messing around with Python, I have only ever run across this solution when I went looking for a solution.

The trick is to pass in the “Universal Newline” option to open(), like this:

>>> len(open('moby10b.txt', 'U').read())
1232923

That gives the same results on Linux and Windows. Can anyone out there test this on OSX? I presume that it will have the same behavior as Linux, being based on BSD. But good grief. Does this mean that I need to use open(foo, ‘U’) every time I open a file if I want reliable file lengths across platforms?

To paraphrase my dad, “Beats the hell out of me, commander!”

Okay so, tutorial TOMORROW. We promises.

Or you could go and look at the current, broken version now, and confuse yourself. It will be unbroken tomorrow.

(Thanks to Won for helping me figure out the newline madness.)

Next Page »