Currently, there are many large network datasets available. Indeed, they have become abundant - perhaps too abundant. While potentially useful, as many are, the value of such datasets rests upon 1) how they were conceptualized, 2) how key concepts were operationalized, and 3) how the data were collected. If any part of the conceptualization, the operationalization, or the data collection is fatally flawed then the resulting data are likely to have little or no value when understanding network phenomena is the goal. The risk seems particularly acute for data collected electronically and remotely. We add a fourth crucial requirement, that the boundary specification problem be solved either completely or very well. These four criteria imply, at a minimum, that data have to be collected and selected with great care.
Faced with this general selection problem we used two different approaches for obtaining the data for our empirical analyses. One was to use already collected data appearing to pass muster on all four of the data adequacy criteria that we think are essential for making such choices. The other was to collect our own data.
The network datasets we used
The patent data we used, covering 30 years, were the cleanest network data among the citation networks we studied. While this seemed likely at the outset, as we studied the patent application process, we realized how clean these data really were. Similarly, the Supreme Court data appeared to be well defined and were extracted from the complete record of all Supreme Court decisions handed down in a period lasting more than 200 years. The SNA citation data were the least clean of the citation datasets as we noted in our discussion of cleaning them. For both the Supreme Court data and the SNA citation data, cleaning them moved these datasets to a form satisfying all four selection criteria.
At face value, the football network data are quite bizarre, given that the initial motivation for collecting them was whimsy and idle curiosity. The initial, seemingly naive, questions were: 1) Where did the players come from? 2) How did they get here (the English Premier league)? and 3) Where did they go? The effort of collecting these data was monumental, as described in Appendix A.4, because there was no one reliable data source for extant data. Cleaning these data, once they had been assembled from a huge number of data sources, bordered on being a nightmare.
By focusing on England's top football league, the sampling design was rather restrictive. Yet the idiosyncratic characteristics of these data, coupled to our reading about football as it has been organized in England, are instructive for considering the interrelation of substance, method, and data. Always, the questions asked have to be questions answerable when using the data at hand. Our questions centered on the organization of football in England, especially regarding the importation of players from elsewhere. As a result, the complete data we assembled for a specific 15-year period were ideal for pursuing these questions after they were expressed in the form of explicit hypotheses.
The areal unit adjacency network data for the counties of the United States were available from a single source. They were the only 'easy' data to obtain and came from a highly trusted source.
There is a limitation inherent in considering separate large networks representing very different substantive domains regarding the generalizability of our results.
• Both the centrality and SNA literatures are only parts of a much larger literature. Some of the processes operative in this segment of the overall scientific literature may operate elsewhere, at least to some extent. Even for the narrowly defined centrality literature, we learned that our initial concern with SNA was overly restrictive. There are markedly different co-authorship cultures, differing levels of institutional control of publication by professional associations, and a heavier dominance by particular research centers in other scientific domains. These differences will limit simple generalizations from our results but the idea of including these 'external' influences has general relevance for understanding the multiple structures of scientific citation networks.
• The procedures of the USPTO seem uniform and broadly consistent with similar organizations assessing patent applications elsewhere. While it is likely that our results will generalize beyond the USA, this has to be checked.
• All nations have unique histories, implying that the creation and operation of the US Supreme Court may have been unique. Certainly, this court changed, sometimes dramatically, over the years, depending on economic, political, and social conditions plus the composition of the court. This, most likely, will limit simple generalizations to other such courts. However, the idea of paying attention to the historical contexts of national court systems will generalize.
• While our results regarding the network dynamics within the organization of football in England do have implications for the global organization of football, they cannot be generalized very far beyond England. Caution will be in order also in attempts to generalize beyond the period defined by the first 15 EPL seasons, as the dynamics have changed to some extent.
• Our results concerning the delineation of the spatial diversity of the USA may not generalize beyond the USA. But, again, the applicability of the methods we used to map this diversity will generalize.
Our broad contention is that the approaches we have taken to studying large networks in a variety of substantive contexts was valuable and can serve as a (partial) template for other large networks and substantive problems. In particular, our results show the general utility of incorporating the study of large networks into the study of substantive issues. The latter are enriched by including network analytic ideas and the former become more fruitful when informed by broader substantive concerns.
The supplementary datasets we used
We were compelled to assemble and assess attribute data for the units in the large networks that we considered, and data for the contexts within which the networks were located. The amount of such information varied across these networks.
• For the scientific citation networks, most of the attribute data on, for example, authors, journals, institutional locations, and keywords, were located in the original data source. However, as outlined in the appendix, there were major problems in obtaining clean versions for these items. Such data were important for constructing a set of two-mode networks. Studying the one-mode citation network is greatly enhanced by also considering the two-mode networks constructed from this attribute data.
• Constructing two-mode network data from attributes was important for the patent citation. Although the inherent data problems were far less severe than for the scientific citation network, the same generic issues were present. Also, the mixture of very general keywords and highly specific keywords demanded attention. We accepted the rather limited attribute data for patents in the original source. Beyond conceptualizing originality as heterogeneity for patents, we did not expand the number of attributes.
• The Supreme Court data did not come with attributes for the decisions, especially in the form of keywords, so we could not extract them. Instead, we looked at the opinions of decisions to discern the substantive content of decisions and the Constitutional principles informing them.
• Understanding the football network required the assembly of a considerable amount of supplementary information as described in the data appendix. Principal among them was the construction of historical memberships of clubs in the top league levels in England. We did the same for the top leagues in France, Italy, Germany, and Spain to assemble historical information on club participation rates and the numbers of titles won by clubs. To obtain a ranking of clubs over time, a variety of UEFA rankings of clubs and leagues were consulted. Additionally information was assembled on the flows of players from around the world into these top leagues.
• To perform both the clustering with a relational constraint and the discriminant analyses for the spatial data, statistical data were extracted from the US Census Bureau's data on counties, along with such data from other sources. Matching these data with counties, even from the Census data, was not a straightforward task, as described in the data appendix.