Google’s Stemming Considered… Not So Useful.
It seems lately that Google has increased the amount of stemming that takes place on search queries. At least, that’s my anecdotal impression.
Here’s what I mean:
Try this search:
Almost all of the results that come back treat genes as the name Gene.
It’s true that I can put quotes around that term:
and get what I originally intended.
Personally I don’t find that sort of second guessing very useful. If the user bothered to type the plural, they want the plural.
Would anyone argue that it would make sense to return plurals for singular searches, getting “genes” for “gene”? It doesn’t make that much sense to me to turn that around, which is what Google is doing at the moment.
8 comments.
Technorati tags: google, ir, Language and the Web, search
Yes, the stemming is extremely annoying, especially when trying to use Google for linguistic research. I live in continual fear that they might introduce stemming even inside of quotation marks…
Hi Anatol,
I noticed today that
site:www.library.yale.edu translation - Google Search
Will return not just “translations” but also “translator” and even “translator’s”.
Sigh.
That is truly horrible. There seems to be neither rhyme nor reason to the morphological behavior of this function:
translation will return the words you mention as well as the verb form translate, but strangely not translates, translated or translating;
translate will return all the derived nouns as well as the third person translates, but not translated or translating;
translates will return translation and translations but not translator or any of its forms, and the verb forms translate and translated, but not translating;
translating will return the same nouns and the verb form translate but not the other verb forms;
translated will return the same nouns and translate but not the other verb forms.
What are they thinking?
Hi again Anatol. I see from your Blog that you’re a German speaker.
Any evidence that Google does similar stuff in German?
Hi Patrick, yes, Google does the same thing in German and no more systematically so than in English… There are too many inflectional forms for me to test them all, but here are some examples:
- übersetzen (‘translate’, 1st/3rd present or infinitive) returns Übersetzer (‘translator.masc‘) but no inflectional forms and not the female form Übersetzerin, and it doesn’t return Übersetzung (‘translation’) or its plural nor any other verb forms;
- Übersetzung does not return anything but itself, not even the plural;
- Übersetzer returns the infinitive/1st/3rd present übersetzen but no other verb forms nor the nouns Übersetzung or Übersetzerin;
- Übersetzerin (‘translator.fem’) returns nothing but itself, not even the masculine form;
- a randomly chosen verb form, übersetzt (past participle/3sg present), returns nothing but itself.
It would be interesting to know what kind of algorithm Google uses to decide which forms to include in any given search. My guess is that it has nothing to do with linguistics (duh), but is somehow based on the frequency of these forms and perhaps their per-document co-occurrence.
I was thinking the same thing with regard to frequency, Anatol.
It might be interesting to take a set of these terms and see if their frequencies correspond somehow to their occurence patterns.
I’ll push that on my mile-long todo list ;)
aaargh
aaaaaaaaargh
I’m gonna stop moaning about this right… now.
As soon as I’ve moaned about this:
“apartment maintenance ” tipping
I wasn’t looking for tips, I was looking for information about tipping.
Ugh!