The Statistical Reliability of Television Ratings in China
This post is inspired by a story in Legal Mirror via Media In China. During a forum, there were questions about the scientific nature of the television ratings from CCTV-SOFRES and Nielsen Media Research. The lament is that "Since CCTV-SOFRES is implementing a system in which the lowest rated program will be eliminated, we program hosts will have to pay attention to the ratings." At this point, it is necessary to disclose that my company has a 50% stake in the Nielsen Media Research service, but I am personally not involved in television audience measurement in China. I am not here to attack a competitor, but I am here to defend my profession against media distortion.
It is axiomatic that every person thinks that their ratings should be better because they know that they are doing good. So I can't get into the specifics here for lack of information.
The more interesting part is the last part of the article. To counterbalance the claim of the television program host, the reporter interviewed a Mr. Hu at CCTV-SOFRES. Here is the translation:
According to him, the calculation of the rating is not based upon "the number of viewers divided by the total number of survey respondents."
First of all, the company will select 300 representative households by considering sex, education, family composition and other factors. Next, in order to reduce the effect of fatigue from participation in the survey, the company will replace 2% of the households in the sample. Furthermore, the calculation for a time period is based on the minute as the basic time unit. Thus, the number of viewing minutes is added up for all persons and divided by "total number of persons in the sample multipled by the total number of minutes in the time period" and the result is the rating for the television program.
Why does the sample consist of 300 households? According to Mr. Hu, "Based upon international standards, 1,067 sample persons is the point of balance between reliable numbers and research value. This number is the number of persons in 300 households in the case of China."
"From a sample of 1,067, we can achieve the international standard of being able to achieve an error tolerance of plus or minus under 3% at the 95% confidence level. If we increase the sample to 1,000 households, the costs will increase geometrically but the increase in reliability and the reduction in sampling error will be quite limited," said Mr. Hu.
I hope that Mr. Hu did not say that, and it is usually the case that the reporters get the technical details wrong.
What do I know about the sampling error around a television ratings? I am going to refer to someone who has the distinction of having calculated more standard errors around television ratings than the sum total of all other persons in the history of mankind, or so I was told (note: when I last checked with him, his running count was 14 million). So I take it that he knows what he is talking about. The following is an excerpt from Roland Soong (1988) The Statistical Reliability of People Meter Ratings. Journal of Advertising Research, February/March, p.51-56.
Televison audience ratings are obtained from samples of television households. The ratings are subject to inherent error due to the fact that different samples will generate somewhat different results. The margin of possible error due to this factor is commonly referred to as sampling error.
The size of the sampling error is measured by a quantity called the standard error. The size of the standard error can be influenced by many factors. If the sample is a simple random sample of size n, then the standard error of a rating (p) is given by SQRT(p(100-p)/n). However, real-life samples are never just simple random samples and the sampling error is influenced by many other factors such as repeated measurement over multiple time units per household or person; the clustering effect of measuring mutliple persons per household; weighting; guest viewing; and so forth.
As a result, the actual standard error can be greater, equal to, or less than the simple standard error.
Let us look at the statements attributed to Mr. Hu.
First, let us assume that the CCTV-SOFRES sample is indeed a simple random sample of 1,067 persons.
Technically, therefore, it is correct to say that a simple random sample of 1,067 persons does yield a 95% confidence interval of plus or minus 3% or less. But this statement misses the point about how the size of the confidence level depends on the rating level. The more important issue, though, is that the sample is not a simple random sample.
Quoting again from Soong (1988),
Television is a group activity. When persons in a household view television together, there is a decision-making process to choose the programs. In recent years, this has become less important because of the increase in the number of sets per household. Nevertheless, close to 50% of the televsiion households are single-set households. The information relevant to group viewing is contined in the people statistical efficiencies ... [which] relfect primarily the duplication in television viewing of members of the same households ...
So if this is a sample of 1,067 persons living in 300 households, this intra-household clustering effect exists and will make it less reliable than a true simple random sample. For example, Soong (1988) cites discount factors of 16%, 16%, 14% and 30% for adult men, adult women, teens and children 2-11 respectively due to this effect. On this basis alone, the confidence intervals would be larger than those indicated above.
The other factor that impacts statistical reliability is the time period. As Mr. Hu described, "the calculation for a time period is based on the minute as the basic time unit. Thus, the number of viewing minutes is added up for all persons and divided by "total number of persons in the sample multipled by the total number of minutes in the time period" and the result is the rating for the television program." From this description, the most basic sampling unit is not a person. It is a minute within a person. And the actual rating is based upon a cluster sample of multiple time units (such as the minutes within a program) with the sample persons. For cluster samples, the more time units, the more reliable the number. The outcome purely depends on the time units involved. For example, the worst case is the single minute 8:00pm-8:01pm on a specific Sunday (July 24, 2005); greater reliability is achieved with the 30 minutes between 8:00pm-8:30pm on the same day; even greater reliability is achieved with the 210 minutes between 8:00pm-8:30pm between Monday-Sunday (July 18-July 24, 2005); and even better for the same program over 3 months (13 weeks x 210 minutes per week = 2,730 minutes).
Thus, if one wants to look a single minute on a particular day, there may be a large standard error. As one looks at the average rating of a one-hour program across multiple weeks, the standard error may be substantially smaller.
Also, there is no such thing as an international standard about acceptable sampling error. In the end, it is all about money (surprise!). When I first started to work in television audience measurement in local markets, my company had 500 households in New York City and Los Angeles, 400 in Chicago, 300 in Philaldephia and San Francisco. Meanwhile, Nielsen Media Research has 7,000 households in the USA national sample today. We would all like to have large samples, but it was all about what the market can afford to pay.
Here is what usually happens. Mr. Hu of CCTV-SOFRES most likely knows exactly what these issues are, and he tries to explain the complexities to the reporter. The reporter may even grasp how many factors can impact the statistical reliability of television ratings. But, in the end, the reporter needs to come up with a simplistic scenario that he/she believes the typical reader can handle and thereby butchered the explanation. And that is a good summary of my professional dealings with the media over the past two decades.