Is there something wrong with putting Unicode into Javascript source code?
As far as I can tell, it’s perfectly okay to put UTF-8 encoded Javascript directly into .js files. And yet, programmers seem to be wary of doing so, and prefer to use numerically encoded escapes. So for instance, you’ll see:
nWithTilde = String.fromCharCode(241)
instead of:
nWithTilde = "ñ"
Why?
Is there some good reason to avoid putting actual characters in, or is it just legacy ASCII-ism? It seems to me that if a language has good Unicode support, as Javascript does, there’s no reason not to take advantage of it!
And the same can be said of other scripting languages. Python and Ruby both allow UTF-8 in source files if the file is marked with this special comment:
# coding: utf-8
If you’re dealing with non-ASCII content (and your default assumption should be that you are), then directly adding the characters seems a lot more elegant to me.

My guess would be that the complex matter of encodings still scares the hell out of many programmers, to the point that they don’t know what encoding they save their source files in.
I agree with KE. It is astonishing How grown-up developers freeze when it comes to entering non-ascii, even simple basic Western European characters. As for knowing what encoding they are safing a file in, or how to find out if they don’t know, I’m pessimistic. Also, some seem to fear characters breaking on transmission of files.
For me there are 2 causes you see encoded UTF8 characters in JS:
- Some json_encode() methods (e.g. in PHP) encode UTF8 characters.
- You do it manually to force the charcters “to stay alive”, if you (or a bad script editor) have saved your file as ASCII accidentally.
I ever use pure UTF8 in my JS files because I set the right headers and all the stuff.
Thanks, folks, for the additional points. I typed this up before reading Andi’s comment, but it turns out to be an expansion of his first point. (Great minds, etc etc :P)
Another reason which I ran across just now is that some software packages refuse to outputting escaped characters.
Case in point: I’m trying to use Python to generate a JSON (ie, Javascript-syntax) array which contains a transliteration scheme called Velthuis. The scheme holds rules like this:
AA ĀThe library I used (because it was sitting around, maybe there’s something better now) is called
simplejson. Here’s what happened:>>> v = {} # velthuis transliteration dictionary
>>> v[u'AA'] = u’Ā’ # putting a single pair to test
>>> print v.items()
[(u'AA', u'\u0100')]
Python wants to send an escape to the terminal, and…
>>> print simplejson.dumps( v.items())
[["AA", "\u0100"]]
So does simplejson’s “dump string” function
dumps().What happens if if we use
dump()instead, which will write to a file, which we can specify as utf-8 encoded?>>> import codecs
>>> utf8_filepointer = codecs.open('doesitescape.log', mode='w', encoding='utf-8')
>>> simplejson.dump(v.items(), utf8_filepointer)
Okay, that worked…
So now we want to know what
simplejsondid in that file.$ cat doesitescape.log
[["AA", "\u0100"]]
simplejsonstores those characters as escapes. As developers we are given no choice in the matter. This is the attitude which makes people think that non-escaped characters are somehow dangerous.the reason i tend to avoid non-ascii characters in all source code is: other people may edit it. other people use platforms with silly default encodings. sadly, text files dont say what encoding they are in. so, people end up corrupting the file without noticing.
if you work on your own, this doesn’t matter. if you work with others, it does matter. if you work on an open source project with people comming and going, no policy enforcement and no way to control the working environment, it’s a life saver.
Hey Daniel,
I guess I can see that. Aside from a couple of projects I’ve just started up, I’ve not really worked to any great extent in any open source projects myself. (Well, I have, but not necessarily with boatloads of code…)
Still, isn’t the source file encoding a convention which is similar to, say, rules about code indentation and other stylistic questions? People somehow manage to enforce conventions on those (well, some do anyway). Even a wild and woolly project needs some rules…
yes, character set is a convention like code style. But it’s very easy to overlook a broken character in some other part of the file - you just never notice that you used the wrong encoding. And the next guy editing doesn’t notice either. Also, coding conventions are easy to fix (use a pretty printer), mangled utf-8 can often only be repaired by hand.
enforcing guidelines in an oss project is like herding cats. in a nice & cosy project with three maintainers and a couple of people contributing patches, it’s no problem. i’m thinking of mediawiki with a dozen or so ective developers, and, worse, a hundred off-and-on committers. and i keep forgetting to set the flag for automatic linebreak conversion on new files, too.
it boils down to “sticking to the basics keeps you out of trouble”. this *should* not be needed, and if everything works perfectly, it isn’t. but it tends to bite you when you rely on that. you know, the next guy uploads the js file via ftp, has the connection set to “ascii”, and *boom*…
These are compelling reasons, Daniel. I wonder what the long term solution is. With regard to the issue of pasting in broken characters into files, perhaps editors (referring to the applications) could be taught to recognize characters that are broken? In this case UTF-8 has the unique (as far as I know) characteristic of being somewhat self-regulating, ie, putting in a byte which doesn’t “fit” its surrounding bytes “breaks” the document, because of the way continuation bytes work in UTF-8.
In my personal opinion, what needs to happen (in the long, possibly very long, term) is that the concept of what’s “basic” or “plain” needs to shift toward UTF-8, for all files or streams which are going to be exchanged at all. An FTP client shouldn’t be designed to change such things.