Chinese Women News

This post really does not have any news about Chinese women.  Rather, this is an exercise in comparing search engines, a subject that is of current interest.

Before I get to the non-news about Chinese women, let me engage in the usual self-absorption exercise about the EastSouthWestNorth.  The term 'EastSouthWestNorth' does not exist in any dictionary and therefore search engine results are likely to refer to this blog (whereas the blog name "Peking Duck" might refer to the famous dish).

Here are the search result numbers for 'EastSouthWestNorth' across major search engines:

There are obviously huge differences in what these search engines are indexing into their databases.  This is something that we all know already -- some search engines are better for some purposes than other search engines.  Thus, once upon a time, Altavista was my preferred search engine.  Today, I have migrated to because I found it to be more useful for my purposes.  For another person with different needs, the choice may be different.  Also, does the huge difference between and imply censorship by the latter?  No, it can't be because here are the top three results:

The top result is an immensely popular MSN Spaces blog by a famous columnist.  His blog roll contains a link to the EastSouthWestNorth blog and so this is an acceptable result.  The second result incredibly links directly to this Chinese-language EastSouthWestNorth page (translated into English as Taishi Village, My Neighbor) and there is no way in hell that anyone doing content filtering in China could have let this unedited page slip through.  After all, this essay was considered to be one precipitating factor in the shutdown of the Freezing Point supplement of China Youth Daily.  This is the kind of page that makes people ask me, "Are you worried about the secret police kicking in the door and arresting you in the middle of the night?"  And finally the third result is to this blog.  There is just something fundamentally different about and, and it is not just censorship.

Now I get to the main purpose of this blog post.  In the discussion about the English-language and Chinese-language versions of search engines, there are two assumptions that I wish to challenge:

False Assumption 1: At any moment, there is a fixed (and very large) number of pages on the World Wide Web.  Any decent search engine should be able to index most of these pages and therefore they should be able to produce more or less the same results in about the same ranking order.

As the above example of "EastSouthWestNorth" search results show, this is a false assumption.  Most people realize that.  There are many billions of pages on the World Wide Web and Google is said to be indexing less than 10% of the pages in existence.  There is no reason to expect two different search engines with different algorithms for their spiders and crawlers should end up with the same set of search results.  It also means that we should not rate one search engine as 'better' or 'worse' than others purely on the basis of the number of search results returned on a small number of terms (such as "EastSouthWestNorth").  Google's ~350,000 results for "EastSouthWestNorth" probably contain many splogs which will only frustate users, so that the cleaned-up MSN's ~23,000 results might actually give a better user experience.  After all, no user is actually expected to click through 23,000 or 350,000 pages; they will always look at the top (and presumably most relevant) results.

False Assumption 2: The Chinese-language version of each search engine is the 'censored' version of the English-version of the search language, where 'certain results have been removed to comply with local laws.'

Thus, if the English-language version of search engine delivers X results for a term and the Chinese-language version delivers Y results, then (X-Y) is the count of the censored results.

For the exercise here, I chose three common terms: China (中国), Woman(女人) and News (新闻).  How plain and generic can I get?  The following table contains the search results:

Search Engine China (中国) Woman(女人) News (新闻)
631,000,000 (beta)
Baidu 100,000,000   27,900,000 75,600,000

The first observation is that the number of returned search terms are vastly different across search engines, and this confirms the statement above about False Assumption #1.  The search engine databases vary in size, content, algorithm and so on, and therefore the results are different too.

The second observation is that contains more terms than is probably a superset (and definitely not a subset) of for these search terms.  What is being kept away from the users of the global service (especially about women)?  I don't know, but this result falsifies assumption #2 above -- is not the 'castrated eunuch' (自宫) version of and it really loves "woman."

The third observation is that and are about the same for "China" and "News" but very different for "Woman."  What is going on there?  What has got against "woman"?

The fourth observation is that contains more terms than  What are we too assume here?  That censored hundreds of million of search results on common terms such as "China," "Woman" and "News" in order to 'comply with local laws'?  Or is something else going on (such as different algorithms)?

I have questions, but I have no answers here.  It will be up to the search engines to explain, if they feel like (and I don't think that they will because this involves detailed proprietary information about their operations).  But I know that there is a simple outcome -- as an individual, I will use the search engine that serves my purposes best.  If a search engine wants to 'castrate' itself to the detriment of user satisfaction, it will be punished in the free market.

Relevant linkChina censorship: Yahoo, Google and Microsoft compared  Rebecca MacKinnon, RConversation