Secondary datasets

We report our results for the networks described in Section 1.5.1 extensively in the relevant chapters. To avoid repetition of results, we used data from other sources to illustrate our methods in Chapters 2 and 3. These data, the dimensions of which are listed in Table 1.2, came from the following sources.

The Edinburgh Associative Thesaurus (EAT)

The primary goal of the EAT project was to understand how words in the English language are coupled. This was done by examining empirical 'associations' between words. The approach taken to obtain these word associations was straightforward. Subjects were shown a word and then asked to provide the first word coming to their minds. The procedure presented batches of words to each subject. The presented words were regarded as stimuli and the words offered by subjects as responses. The established links between stimuli and responses were provided by subjects. There were no imposed rules dictating the nature (appropriateness) of the responses. The pairings of stimuli and responses were simply empirical associations. For each pair of words, they were aggregated across subjects as a way of quantifying these associations. For example, some frequent couplings included 'husband' in response to 'wife' and 'cheddar' in response to 'cheese'.

The resulting Edinburgh Thesaurus association norms were started from a nucleus set of words. Further associations were collected by expanding from the nucleus: initial words were used to obtain further responses, together with additional words. The EAT website reports this cycle was repeated about three times. By then, the number of different responses became so large they could not be reused as stimuli in a systematic fashion. The EAT data collection stopped after 8400 stimulus words were used. The result was a total of 23,219 words in the Thesaurus network linked by 325,589 associations. The database has two files: one is a SR (stimulus-response) file, with the other being a RS (response-stimulus) file. These data are used in Section 2.5.

The NBER-United Nations Trade Data, 1962-2000

This network was used for illustrative purposes in Section 2.6. The network ties are trade exchanges (exports and imports) between nations. The data we used came from 1999: there are

See a description of this project. 17 Each stimulus word was presented to 100 different subjects. Their website reports that the subjects were mostly undergraduates from many British universities whose ages ranged from 17 to 22 with a modal age of 19. The sex distribution was about 64 per cent male and 36 per cent female. The data were collected between June 1968 and May 1971. Any bias in the distribution of associations due to using university students as subjects has no relevance for our illustrative purposes regarding methods.

174 vertices and 11755 trade flows linking nations. The weight of the arcs are trade values in $US1000. The source for these data is The complete dataset is available as the zipped Pajek project file listed in Table 1.2.

The Kansas Event Data (KEDS)

The data in this resource are the results of a 20-year project, originally based in the Department of Political Science at the University of Kansas. This project and its data were known as the Kansas Event Data System (KEDS), a label we use here. It was moved to the Department of Political Science at Pennsylvania State University in January 2010 ( The project uses automated coding of English-language news reports from a variety of news resources to generate political event data focusing on the Middle East, the Balkans, and West Africa. These data were designed primarily for use in statistical early warning models to predict political change in these regions with attention given to suggestions and policies for mediating conflicts. The units for this network are nations and organizations. The relations include ties between nations in the form of actions by one nation directed towards another nation, as described by verbs. These actions include visits, seeking information, issuing warnings, and expelling persons. Data for the Balkans for KEDS are used in Section 2.2. The full dataset is available also at the KEDS website.

Krebs Internet industry partnerships

Valdis Krebs collected in 2002 ( a network of Internet industry partnerships. Two companies are linked with a line if they have announced a joint venture, strategic alliance, or other partnership during the period 1998-2001. The companies are classified into three classes: 1 - content, 2 - infrastructure, 3 - commerce.

Data archives

There are variety of sources containing many datasets, both large and small, but with a primary focus on large datasets. One is SNAP, the Stanford Large Network Dataset Collection maintained by Jure Leskovec. It is documented at The topics covered include on line social networks, communication networks, citation networks, and collaboration networks. There are also graphs of the internet and physical road systems. Signed networks are included in this archive. KONECT, the Koblenz Network Collection, contains large network datasets assembled at the Institute of Web Science and Technologies at the University of Koblenz-Landau. As stated on its website ( 'KONECT contains over a hundred network datasets of various types, including directed, undirected, bipartite, weighted, unweighted, signed and rating networks. The networks of KONECT are collected from many diverse areas such as social networks, hyperlink networks, authorship networks, physical networks, interaction networks and communication networks.'

These archives of datasets are used in Section 2.3 when describing the distribution of network sizes in terms of the number of units and relational ties. Networks are sparse when they have roughly the same number of units and relations ties. More specifically, the numbers of these ties are not orders of magnitude larger than the number of units. Networks being sparse is crucial for developing efficient methods for analyzing large networks.

< Prev   CONTENTS   Next >