Blogs, Anonymity, and Forensic Linguistics
Translator Margaret Marks pointed to an article on forensic linguistics. (Here’s a very brief intro at Wikipedia:). Forensic linguistics is the use of linguistic expertise in the law.
This sort of thing has a fairly long history.
It also overlaps pretty closely with what’s called Stylometry, the quantitative analysis of the “style” of spoken or written language. Surprisingly, stylometry goes right back to 1439:
The Italian humanist Lorenzo Valla proved in 1440 that the Donation must be a fake by analyzing its language, and showing that while certain imperial-era formulas are used in the text, some of the Latin in the document could not have been written in the fourth century.
(Speaking of Latin, “nihil novum sub sole” comes to mind.)
There is a very modern consequence of this sort of technology that I’ve never seen discussed. That consequence is the potential use of stylometric techniques to identify the authors of anonymous blogs. It seems to me that sooner or later this is going to happen.
Anonymity in blogging is pretty important these days, and I think it’s important that people who need to blog anonymously understand this simple fact: anonymity networks and encryption aren’t enough to ensure that your identity is anonymous. In the long run, at least, the only way to really ensure your anonymity is to never, ever associate any text you’ve written with your real name.
The scenario would go like this:
- A blogger has a public blog with content that’s not, well, “controversial”
- The same blogger writes another blog with “controversial” content, and does so using anonymizing technology (Tor, whatever)—but the content itself is still publicly available.
- For whatever reason, someone who wants to know comes to suspect that these are the same person
- Stylometry is used to verify this suspicion
It’s a very sticky problem. And the stickiest part is that there’s really no way to completely disguise the way you write. When we use language, we aren’t really even conscious of our style. And even little typographical details may serve to help identify an author—do you use em-dashes?
Stuff like that could put ya in the slammer just like a fingerprint could. From the article:
Among other textual similarities, Mr. Fitzgerald found both the anonymous letters and the doctor’s own writing samples contained similar and unusual spacing between words.
Spacing. Who thinks about that? It’s just habit.
Mathematically, stylometry (or authorship attribution as it’s also known) is really interesting. You can do neat stuff in literature, such as make the case that Shakespeare copped some of his stuff from Marlowe.
But in terms of privacy and blogging, it’s a little creepy.
Something to think about.
2 comments.
Technorati tags: anonymity, blogs, forensic linguistics
You think “Stylometery” would actually hold up in a court of law?
That’s a fair question, and the short answer is that I don’t know… I’m not a lawyer, for one thing. Also, since writing the post I’ve done a bit more digging around a bit in the web world of forensic linguistics, and I have to say that some of it seems a bit fringey. Reminiscent, a bit, of the handwriting analysis crowd. (Which is certainly not science.)
The term “stylometry” does have a pseudoscientific ring to it — “forensic linguistics” seems slightly less fringey, but even there… well, the papers seem to be in very large fonts. ☺
It does seem to be the case that linguistic analysis has been used as evidence in the past, but I haven’t found any specific instances of the kind of statistical author identification I describe playing a primary in role in a conviction.
Even so, I personally think that it’s perfectly reasonable that carefully conducted analyses could be used in court — there must be existing clear rules about how to measure the accuracy of evidence, after all. But I don’t imagine any reasonable judge giving such data weight comparable to what a DNA analysis would demand.
Anyway, I’ve never seen thi whole topic brought up about bloggers, so I was just bringing up the idea. Perhaps my post sounds too confident–I’ve certainly never implemented any such software, but I suspect that tools from NLP would could result in reliable results in some cases. I’m just saying it seems like a possibility, nothing more.