I am working on a paper for the Worldwide Readership Research Symposium. Here is the background section of the paper:
In the United States, the Department of Homeland Security has gone through a series of data mining projects by the names of CAPPS (“Computer-Assisted Passenger Pre-Screening System”), TIA (“Total Information Awareness”), the Matrix (“the Multi-state Anti-Terrorism Information Exchange”), Able Danger and so on. A key assumption behind some of these projects is that large-scale consumer databases can be used for predictive purposes to identify and interdict target persons.
Do we know that data mining works? The performance characteristics of the Homeland Security projects are obviously classified secrets. In this paper, we intend to see if a large-scale consumer database can be used to predict ailment conditions in the population.
Since the conference does not take place until October, it is inappropriate for me to give away the whole story. But let us just say that I am interested in the relative predictive power of information. Here are my three classes of information.
First, my baseline will be real information collected directly from people in a large survey. I will use demographic variables: age, sex, race, education, household income, occupation and geographical region. These are the standard demographic categories by which ailment incidences are usually reported.
Second, I will use the variables that was supplied from a commercial database compiler. The information contains demographic variables such as age, sex, race, education, household income, occupation and geographical region, but they are compiled from various sources instead of directly elicited. As such, they may be subject to error or non-existent in some cases. The information also contains data elements such as home value, car ownership, credit card ownership, mail order usage, and lifestyles variables such as gambling, cruise vacations, collectibles (coins, stamps, antiques, etc), vegetarianism, gourmet cooking, wines, dieting, reading science fiction, wines, playing tennis, and much more.
You are shocked. You certainly have not told anyone about your lifestyle, tastes, habits and hobbies. How do they know that? Is the information accurate? I don't worry about those questions, because someone once said, "It does not matter where the cat is black or white, just as long as it can catch mice." I don't worry about who is a true vegetarian or not, for I am only concerned about whether being designated as one by the database compiler has predictive power. I accept the epithet "pragmatist" with professional pride.
Third, I am using Census-based variables that can be derived from the street address (including the nine-digit postal code). Based on the address, I can attach the US Census data of the Census-defined block (e.g. median age, median household income, median housing value, %black, %Hispanic origin, %persons age 65+, % persons in poverty level, %college graduates, etc).
I have a database in which all the above data elements have been collected for more than 20,000 persons, and I ran the three sets of predictor variables through a list of ailments.
Which is the best method? You should know better than ask a question like that. It is certain to be the case that any method or set of predictors will do well in those problems that it fits well in but not so well when the problem does not fit. There is the famous "No Free Lunch Theorem" which can be grossly simplified as saying:
If there is a best method that is best in all cases, then we get a free lunch in the sense that we won't have to think or work anymore.
This theorem has obvious applications to any claims that this -ism or that -ism is the best and only thing out there (e.g. Hua Guofeng's Two Whatevers: "Whatever policy originated from Chairman Mao, we must continue to support" and "Whatever directions were given to us from Chairman Mao, we must continue to work on their basis."). There is always something known as the Reality check, because reality does not have to follow your formulation.
When push comes to shove, I will tell you that anything that comes from the individual level (that is, the real data or the compiled database) is usually superior to the aggregate-level data (such as the characteristics of your neighborhood as a whole). Neighborhoods can be very mixed, so that being right on the average may mean being wrong at the individual level all the time (e.g. a mean annual income of $10,000 could be the result of most people making $100 and a very small set making $1,000,000). I found two notable exceptions, and these ailments can be sexually transmitted -- namely, herpes and HIV/AIDS. For those two cases, knowing about your social environment is more important than knowing a lot about you personally that does not pertain directly to the behavior.
Between the two individual sets of predictors, you can pretty much guess what would happen. The real demographic data are going to work well for those ailments that have age/sex specificity (such as Alzheimer's disease, arthritis, erectile difficulty, hangover, osteoporosis, etc). The compiled data elements which contain so much lifestyle information are going to work well for those ailments that have social factors (such as anxiety/panic attack, bipolar disorder, depression, hypertension). The compiled data also has much on food and cooking, so they are going to work well for ailments such as acid reflux, diabetes, food allergy, high cholesterol, heartburn, nutrition deficiency, obesity and so on.
Having said all that, I am sure my readers will react here like many people who have listened to me talk at conferences or read my published papers -- "But it is so obvious!" Was it really that obvious before I gave out the answer?
So here I will have to stop this post and go back to finish the conference paper. There is another section to it: how well will these predictors be able to predict which magazines people read? I won't give out the answer yet so you can decide if this is so obvious ...