Computational Linguistics

Just because I don't allow comments here doesn't mean I am not interested in seeing who is reading this blog.  I can still check Technorati or Google to find out who is linking to this blog.  For some time, I have noticed that there are a number of 'spam' commercial links.  For example, Google returns the following links in reply to 'EastSouthWestNorth':

Who are these people?  Who has the time on hand to compose these types of pages?  Here is the explanation from an article:

Linguists must often correct lay people's misconceptions of what they do. Their job is not to be experts in “correct” grammar, ready at any moment to smack your wrist for a split infinitive. What they seek are the underlying rules of how language works in the minds and mouths of its users. In the common shorthand, linguistics is descriptive, not prescriptive. What actually sounds right and wrong to people, what they actually write and say, is the linguist's raw material.

But that raw material is surprisingly elusive. Getting people to speak naturally in a controlled study is hard. Eavesdropping is difficult, time-consuming and invasive of privacy. For these reasons, linguists often rely on a “corpus” of language, a body of recorded speech and writing, nowadays usually computerised. But traditional corpora have their disadvantages too. The British National Corpus contains 100m words, of which 10m are speech and 90m writing. But it represents only British English, and 100m words is not so many when linguists search for rare usages. Other corpora, such as the North American News Text Corpus, are bigger, but contain only formal writing and speech. 

Linguists, however, are slowly coming to discover the joys of a free and searchable corpus of maybe 10 trillion words that is available to anyone with an internet connection: the world wide web. The trend, predictably enough, is prevalent on the internet itself. For example, a group of linguists write informally on a weblog called Language Log. There, they use Google to discuss the frequency of non-standard usages such as “far from” as an adverb (“He far from succeeded”), as opposed to more standard usages such as “He didn't succeed—far from it”. A search of the blog itself shows that 354 Language Log pages use the word “Google”. The blog's authors clearly rely heavily on it.

For several reasons, though, researchers are wary about using the web in more formal research. One, as Mark Liberman, a Language Log contributor, warns colleagues, is that “there are some mean texts out there”. The web is filled with words intended to attract internet searches to gambling and pornography sites, and these can muck up linguists' results. Originally, such sites would contain these words as lists, so the makers of Google, the biggest search engine, fitted their product with a list filter that would exclude hits without a correct syntactical context. In response, as Dr Liberman notes, many offending websites have hired computational linguists to churn out syntactically correct but meaningless verbiage including common search terms. “When some sandbank over a superslots hibernates, a directness toward a progressive jackpot earns frequent flier miles” is a typical example. Such pages are not filtered by Google, and thus create noise in research data.

At this point, these spam commercial links are still in the minority for this blog compared to legitimate references.  But if these pages are computationally generated, then sooner or later the spam will overwhelm everything else unless Google does something more (which they must in order to preserve their product).

Meanwhile, I am glad that I am fodder for computational linguists.  That should keep them off the streets ...