Next, I would like to discuss location data. We carry our cell phones everywhere, and have now started to use mobile devices to watch movies, browse social media, and make transactions. How can a marketer collect, organize, and analyze location data? I had a chance to work with Jeff Jonas, an IBM Fellow. Jonas was very interested in looking at location data for clues about mobility patterns. In his quest for big data, he stumbled on publicly available mobility data posted by Malte Spitz, a German Green Party politician. Spitz publicly disclosed his mobility information to make everyone aware of privacy issues around location data collected by telcos. He went through a court dispute to collect the data from his wireless phone company, Deutsche Telecom, and made six months of his mobility data publicly available.9 This data was a gold mine for location researchers. Jonas decided to take the challenge. As a premier expert in the field of customer identity, he was interested in understanding whether or not he could establish the identity of an individual by analyzing mobility patterns.

How was this data collected from Spitz’s phone? A cell phone is served by a collection of cell phone towers, and its specific location can be inferred by triangulating its distance from the nearest cell towers. In addition, most smartphones can provide Global Positioning

System (GPS) location information that is more accurate (up to about 20 meters) but can rapidly drain the cell phone battery. In most marketing situations, cell tower location data combined with occasional GPS is good enough. The location data includes longitude and latitude and, if properly stored, can take about 26 bytes of information. If we store 24 hours of location data for 50 million subscribers at the frequency of once a minute, the data stored is about 2 terabytes of information per day. This is the amount of information stored in the location servers at a typical telco. While that is a lot of data, it can be rapidly aggregated to keep only meaningful information.

By itself, longitude-latitude is hard to analyze. I may need to know the granularity of the data and a measure of proximity so that I can infer whether a person is at one location or another. If I have to count the number of people sitting in a building, we need simple measures for location, which can be counted. Fortunately, a number of techniques have emerged for summarizing and counting location. Geohash is one such measure, and is available in the public domain.10 For a given location using an address or longitude-latitude, the geohash algorithm converts it into a code. The code goes left to right, and each byte further divides the rectangular space represented by the code. A two-byte geohash represents an accuracy of ±630 kilometers, while an eight- byte geohash represents an accuracy of ±19 meters (see table 3.1). The geohash “9x” covers nearly all of Wyoming and northern Colorado, all the way from the eastern boundary facing Nebraska and Kansas to the western boundary facing Utah and Idaho, while “9xj6v0v,” an eight-byte geohash represents the corner of Wazee and 20th Street, near Coors Field stadium in downtown Denver. The presence of a person in a specific location for a certain duration is considered a spacetime box and can be used to encode the hangout of an individual in a specific business or residential location for a specific time period. By converting longitude-latitude to geohash, I can count how many people were physically present in the vast area covered by Wyoming and Northern Colorado on July 4 at 5 p.m., or do the same analysis

Table 3.1 Geohash accuracy level11



lat bits

Ing bits

lat error

lng error

km error





± 23

± 2500





± 5.6

± 630




± 0.70

± 0.7

± 78




± 0.087

± 0.18

± 20




± 0.022

± 0.022

± 2.4




± 0.0027

± 0.0055

± 0.61




± 0.00068

± 0.00068

± 0.076




± 0.000085

± 0.00017

± 0.019

for a street corner in downtown Denver. The first two bytes of these examples are exactly the same—9x as the street corner in downtown Denver is fully contained in the bigger box. However, if we are asked to compute the number people in Colorado, we may need to aggregate a couple of geohashes inside “9x” and “9w” as the state of Colorado is split between those two 2-byte geohashes.

We expected a politician to be fairly mobile and irregular in his mobility patterns. As Jonas began to analyze Spitz’s mobility data, he found definite patterns. Spitz moved around a lot, but was still a creature of habit. A small number of hangouts dominated his locations— possibly his home, work, and social meeting places. Jonas used this data and other such studies to establish the identity of an individual based on specific hangouts visited by that individual. This insight is very useful to a prepaid wireless service provider. Most of the subscribers in developing countries are prepaid and often change their phones and subscriber identity module (SIM) cards. However, using hangouts, we can create signatures that accurately represent individuals, giving us an ability to identify them even if they change names, contact, addresses and phone numbers. This insight can be used to identify someone who switches brands regularly and can be used to provide incentives for them to stay with a specific brand.

The discussions with Jonas motivated me to study mobility patterns. I was interested in mobility data for a group of people and to establish insights that can be used for segmentation. In my first job as a market researcher, we used surveys to collect customer data. Surveys can only be administered on a very small (and probably a fairly biased) sample, and are based on recollection of history. We can use cell phone data to collect history, as it happens. I found an app that was able to record my mobility patterns, and after three months of data collection, my pastime was to watch my past mobility patterns through my travels around the world. It was surreal, as if I had a video recording of my past three months of movement. Since it was my own data, I could see the data as well as the inaccuracies. For example, as I drove past a turnpike in Pennsylvania, the tracking was off by a couple of miles, and I could use map data to change the location. As we discovered later, this was a much-needed addition to raw location data, where context information such as street maps can be used to make the data more accurate.

As I work for IBM, I travel extensively, and I expected my mobility patterns to be as wild as water bubbles in a steaming kettle. To my surprise, I showed very definite stable patterns and a small number of hangouts where I spent most of the time. Since my data was coming from my cell phone, it provided me accuracy at the geohash8 level, which is ± 19 meters in the table above. Nearly 15 out of 30 days in a month, I work at home and spend nearly all the time in a single geohash with occasional (and predictable) movements to neighboring geohashes. The other 15 days, I showed mobility patterns that took me from my home geohash to the airport geohash. At that point, the cell phone was turned off and then it was turned on at one of my travel locations. While the distances between these geohashes were large, there were a small number of travel destinations in that three-month period, representing four clients I was working with. In each case, the patterns repeated for each city-representing the hotel, the office location, and the restaurants and bars I was regularly visiting. The analysis of the data also showed me the limitations to using geohash as the mechanism. The IBM office at Armonk is covered by two geohashes, and while I spent most of the time in a single conference room, the occasional trip to the neighboring kitchen to get coffee was depicted as travel to the neighboring geohash 1.22 kilometers away. However, a clever algorithm that predicts velocity of travel allowed me to remove the geohash edge traversal.

I assembled a team of researchers and collected location data from several wireless service providers’ location data, on the condition that the data would be anonymized before analyzing, and none of the results would be shared at atomic (subscriber) level. Most certainly, the wireless service providers were interested in the research, but wanted to make sure the demonstrations would not be misused to identify personal information, such as cheating spouses and executives texting while driving (both interesting insights, which can be estimated using location data, although not with 100% certainty). This team of IBM researchers used the data to build a showcase on how the data could be used by marketers.12 The source data was accurate at the geohash5 level, which means the data was accurate to ± 2.4 kilometers. However, with a little adjustment, Tommy Eunice was able to drive the accuracy down to geohash6 (± 0.61 kilometer). Now we had the data to accurately predict location to the block level. Eunice led the data science work and was rapidly finding clusters of interesting patterns—people who worked from home, buddies who traveled together, or popular lunch places.

Aggregation and clustering are often-used techniques for mining location data. As I stated earlier, geohash coding provides a natural aggregation. A one-byte geohash represents a rectangle bigger than the size of the United States and Mexico (all of the United States and Mexico is represented by the geohash signified by the number “9” and the adjacent geohash “D”). As we add more bytes to the geohash, it divides the bigger rectangle into smaller ones. Once we had the mobility patterns for a large number of subscribers in a city, Eunice was able to aggregate these mobility patterns into two sets of important aggregations. First, he found the popular hangouts by establishing an aggregation of mobility into geohashes at different times of day—early morning, rush-hour commute time, late morning, lunch time, early afternoon, late afternoon, evening commute, dinner time, late night TV time, and so forth. Cell phones and their respective owners congregated at popular hangouts at each of these times, and we could easily spot residential communities, office areas, popular lunch locations, travel congestion spots, and fine dining places. Second, he started to find buddies who traveled together. Two or more cell phones were together in two, three, four, or five locations at the same time. As the number of places visited simultaneously increased in frequency, the data provided us the confidence that these cell phones belonged to people who traveled together and were somehow related to each other. By analyzing the time of day when these cell phones are together, we can predict work, social, or family ties.

Location data can be generated at different levels of accuracy. Typical cell tower data as described above is accurate within 1-2 ki lometers. However, wireless subscribers often turn on GPS to find directions on our cell phones. At the expense of a cell phone battery that may get rapidly consumed, the location data captured through GPS is about geohash 8 at an accuracy of 20 meters. Similar accuracies can be achieved when we turn on Wi-Fi in a sports stadium. Wi-Fi location data has one more advantage—in addition to the fact that it does not overdrive the battery consumption, it also improves our ability to connect to the Internet. A public gathering area like a stadium may offer free Wi-Fi to its audience, to ascertain their location data and use it for a variety of operational and marketing purposes. For example, the stadium may offer a visitor advice on which gate to use for entry to the stadium based on current visitor location, seat location, multiple gate locations, and the lines at each gate. There are many interesting marketing opportunities once we have a person with a smartphone located in a stadium who is able to watch the television screen, interact with a little screen, and has a fair amount of interest in buying merchandise located around him/her. By combining more aggregated cell tower data with Wi-Fi data, we can now combine the behavioral characteristics of a shopper (couch potato vs. frequent mall shopper) with the shopping behavior of the shopper (time spent in each aisle or combinations of aisles visited). A savvy data scientist can also use clustering algorithms to establish micro-segments by finding individuals who follow similar mobility patterns. Some of the micro-segments are based on people traveling to similar locations. However, more complex micro-segments are based on mobility patterns to diverse locations. For example, a statistical program can find active weekend golfers who wake up early on the weekend and show up at the golf course for a Saturday morning game. These golfers may be showing up at different golf courses around the globe, but share the Saturday morning mobility pattern. This microsegment is of enormous interest to golf companies, golf resorts, and the leisure travel industry.

How about combining social media and location data? If there is a way to correlate social media data to mobility data, it can provide marketers with a valuable cross-correlation of customer profiles. To examine this, a team at IBM’s Global Solution Center collected two months of Twitter data and performed a series of unstructured analytics.13 They also had access to the mobility data described above. A marketer might like to find people who show specific travel patterns and who tweet about sports to offer them sports memorabilia at their next visit to the stadium. Finding people who like sports is not as easy, however, as tweets come with different words that describe sports. Once a data scientist has found a baseball fan, the next interesting challenge is to align it to the mobility data. Unless the correlation is done with full disclosure to the customer, this task may not be appreciated by the customer. In many cases, the stadium may have complimentary Wi-Fi and may trade free Internet access for wireless information and a Twitter handle, possibly offering a sports statistics app as a promotion. Now we have access to all the Twitter information from this Twitter handle to make guesses about the person behind the Twitter handle. We can also use the mobility data to find additional micro-segments. We can find their buddies and start offering products and promotions that appeal to the consumer or to his/her social circle. If someone tweets “Enjoying a Rockies game with my hubby” and is a frequent visitor to Rockies games who works in downtown Denver, the marketer can easily infer: “married,” “woman,” “enjoys baseball,” “frequent Rockies game visitor,” and “daily grinder,” and offer specific promotions that may appeal to this person.

I earlier discussed census data, which is the most comprehensive big data source. The mobility data discussed here is still a sample, as it represents only cell phone users and only those who subscribe to a single wireless provider, unless we start combining data across wireless providers. However, cell phone data provides a marketer with an observed count of “work at home” in a geographical area. What if we could pick a statistically significant sample of individuals from the same geographical area and ask them if they work from home? The reported information would not be as accurate as observed data because, depending on the exact phrasing of the question and how the respondent interprets it, the data may not be as accurate as observed data. However, the observed data from a wireless device is a good representation of the wireless phone users for the specific wireless provider, but not necessarily the rest of the population. That data would fail to represent my 90-year-old retiree dad, who does not carry a phone and stays at home most of the time.

< Prev   CONTENTS   Source   Next >