Ben Langhinrichs

Photograph of Ben Langhinrichs

E-mail address - Ben Langhinrichs







Recent posts

Mon 16 Sep 2019

About that email in Notes



Mon 9 Sep 2019

Perils of PDF 4: Missing and obscured data



Fri 6 Sep 2019

Perils of PDF 3: Wide Tables and data loss


September, 2019
SMTWTFS
01 02 03 04 05 06 07
08 09 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30

Search the weblog





























Genii Weblog

Lies and double-speak - Google and the trillion pages

Mon 28 Jul 2008, 10:40 AM



by Ben Langhinrichs
It is interesting to see how good companies have gotten at "spin", learning perhaps from the politicians.  Google announced that they had indexed one trillion web pages.  Or did they?  If you believe Google's Index Reaches a Trillion URLs or any of a number of similar stories, you'll think they do.  Even if you see Google's official blog, there is a post titled "We knew the web was big..." which seems to imply this, with the cleverly worded quote:
The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. ,,, Recently, even our search engineers stopped in awe about just how big the web is these days -- when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!
But look at that carefully again.  They are talking apples and oranges.  The first two statistics talk about how many pages are in their index.  The latter quote is about how many unique URLs are on the web.  Later in the article, they even 'fess up, once they are comfortable that most people won't keep reading:
We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers.
Well, you might ask yourself, what difference does that make?  The clue is found in the CNN story, Ex-Googlers launch rival search engine, which talks about Cuil (pronounced "cool"), which is a new search engine which boasts that its index "spans 120 billion Web pages".

So, which index is bigger?  The court of public opinion will now say "Google has a trillion pages, while Cull only has a tenth of that", but we have absolutely no way of knowing.  Cull may well have more, which its owners imply, but they are constrained by confidentiality agreements to not say what they know, and even they don't know for certain since they left Google a while ago.  But Google has managed to start a meme that will be hard to beat.

OK, maybe they didn't learn from politicians.  Perhaps they learned from the "seat wars" between Microsoft and IBM.  

Copyright © 2008 Genii Software Ltd.

What has been said:

No documents found