Official nationwide censuses have been the original big data sources for centuries. Censuses were conducted across a number of ancient societies, including China, India, Egypt, Greece, and Rome. According to Webster’s dictionary, a census is the official process of counting the number of people in a country, city, or town and collecting information about them. A census is an expensive and time-consuming proposition. Most of the ancient kingdoms used a census to keep track of the population driven by their political needs—primarily the need to keep accurate count of ethnic communities as well as for taxation. The US Constitution has stipulated that a census be conducted every ten years, and by 1880 the size and geographical breadth of the US population drove the manual process to its limits, fueling automation and the use of punch-card machines.1 Even today, the primary beneficiary of the census in the United States is the congressional district zoning process. However, it remains a costly and time-consuming process, as each citizen must respond to a questionnaire with a large number of questions. For a country of more than 300 million persons spread out over 3.6 million square miles, counting the entire populace is an immense logistical feat. To accomplish it for the 2010 census, the
Census Bureau mailed approximately 134 million questionnaires that were to be completed by April 1. That would cost nearly $60 million in postage alone if the Census Bureau did not get free postage from the United States Postal Service (USPS). The collective weight of all 360 million printed questionnaires (from all three mailings) is nearly 12 million pounds, and if stacked on top of one another, would be nearly 29 miles high.2 Electoral candidates as well as pollsters extensively used census data in building their big-data-driven prediction models for election results. As I watched the counting of votes in the 2012 US elections, I was fascinated with how John King from CNN used the census data and his predictive model to provide early analysis of the results.3
While a census is mandated by the constitution in many nations around the world as an important input for organizing the democratic division of the voting process, it provides a wealth of information to the marketing community. It is by far the most comprehensive view of a nation, and by combining data across many nations, a marketing organization can collect a global view of its consumers. Marketers have mastered the art of combining statistical information collected from small samples and projected to the entire population using census data. For example, Nielsen’s report on Asian consumers uses census data to accurately estimate the size of the Asian population in the United States and then employs a large number of statistically significant samples of the population to project the behavior of these consumers.4
A census is also an important case study on the protection of personal data. While the data is collected systematically across the entire population, from each individual, its public access is typically in the form of aggregate data. All the collected data is available at an appropriate aggregate level that does not reveal the identity of an individual, while at the same time it provides valuable information about a community.
Statistical sampling offers us with an important way to collect detailed data from a small number of people and achieve a relatively high accuracy in our ability to predict the behavior of an entire population, as long as the sampling is done “randomly,” that is, each individual has equal probability of being chosen as a sample representative of the population. For example, if we were to conduct a telephone survey to elicit opinions in a society where only the wealthiest 10 percent of individuals use telephones, the sample would not be random. For a long time, with a census as the only source of big data, there was no easy way to challenge a prediction based on statistical sample data projected to the population using census data. In the recent past, we began to see other sources of big data, which are an extensive representation of the society at large. In many cases, they represent observations as opposed to reported information. How do we combine census data with these other sources to analyze and infer consumer behavior? Let me discuss a couple of examples.