Data for large temporal networks

The datasets we use fall into two categories. The first contains data defined by the substantive interests outlined in Section 1.3. These data are used for the analysis and results presented in Chapters 4-9. The other (secondary) category[1] has data used for illustrating concepts and methods introduced in Chapters 2 and 3. We know that the term 'interesting' (when it is not used as cover for not expressing an opinion one way or another) is in the eye of the beholder. The distinction between main (primary) and secondary datasets is not intended as an evaluative statement about their relative merits even though we do insist that the data considered here need to be relevant for specific substantive concerns. The secondary data sets have different substantive interests and technical issues in mind.

The main datasets

We describe briefly these main datasets, each driven by substantive interests, and present their dimensions here. Appendix A contains detailed descriptions of them, including how they were obtained and the data processing for getting them into the form we use. Their initial[2] dimensions are provided in Table 1.1. Some[3] of these data are freely available at Pajek datasets (see vlado.fmf.uni-lj.si/pub/networks/data/sport/football.htm).

We have noticed in submitted manuscripts involving blockmodeling (see Doreian et al. (2005)), reviewers often demand a full coverage of the community detection literature (created mainly by physicists) ideas even when community detection ideas are tangential. While there is, at face value, some commonality between these approaches, the differences are quite marked and rather subtle. Such broad summaries often are distractions - and, when complied with, can affect citation networks.

Table 1.1 Dimensions of the datasets used in Chapters 4-10.

Dimensions of the datasets used in Chapters 4-10.

The patent citation network (for patents citing earlier granted patents) features patents issued in the USA. The time period is relatively short, covering 1976-2006, a mere 30 years. However, this network is the largest dataset we consider for the substantive chapters, having more than 3.2 million patents linked by over 32 million citation links. The US Supreme Court citation network, in contrast, is much smaller with more than 30,000 units and over 216,000 citation links. However, it covers more than 200 years; this is by far the longest time span of all of the networks we study.

There are intrinsic differences beyond their sizes of these two networks. We noted in Section 1.3.1 the strict constraints on patent citations, in contrast to the freedom that SC Justices have in citing prior decisions. There are many SC decisions that neither made nor received citations. One practical consequence is that the relevant citation network has fewer units than the number listed in Table 1.1. However, the long time span and the depth (defined in Chapter 3) of the SC citation network created technical problems requiring attention before the general methods for acyclic networks presented in Chapter 3 could be used. The patent citation network was acyclic as received. This was not the case for the SC data: some decisions handed down by the same Court in a short period of time do cite each other, a phenomenon present also in the scientific citation data for publications appearing in the same year. Solutions for handling this problem are described in Chapter 3 and mobilized in the analyses of both the centrality and the broader SNA literature, in addition to the Supreme Court network. In analysis of centrality and SNA literature we used also some other bibliometric networks.

The football data that we constructed have a far more complex structure, featuring football players, football clubs, and countries. It was defined by the 3749 football players playing in any of the first 15 seasons of the EPL. These players had 148 nationalities (dual citizenship is precluded for defining the nationality of players). Even though the player network is defined by these players, our primary interest centered on the clubs for which they played. More specifically, the network ties for this network are the links between these clubs as created by players moving between them. The number of clubs involved in their migrations to and from the EPL was 2355. These clubs are located in 152 countries. The total number of player moves between clubs was 40,246. We also used ancillary data (described in Appendix A.4) on clubs and player presence by nationality in other top European leagues for additional analyses.

Our example of a large spatial network features all of the counties of the contiguous USA and was motivated by trying to reconcile two very different approaches to mapping social diversity in geographic space. The substantive problem has intrinsic interest, and the network we study is one of the larger substantively interesting networks we have located.

Table 1.2 Dimensions of the illustrative datasets.

Dimensions of the illustrative datasets.

  • [1] With a few exceptions, we maintain this distinction to have our substantively relevant results remain within single chapters.
  • [2] For some analyses, not all of these data were used. For other analyses various subsets were used and the results combined. (See Section 2.4 for a description of the 'divide and conquer' strategy that we employ for simplifying large networks.)
  • [3] The exception is the football data because we intend to explore them further before making them available publicly.
 
< Prev   CONTENTS   Next >