Probabilistic flow in acyclic network
Another way to measure the importance of vertices and arcs in acyclic networks is the following. Let ^V" = (V, A) be a standardized acyclic network with source sgV and sink t e V. The vertex potential, p(v), is defined by:
• p(s) = 1 and
The flow on the arc (u, v) is defined as
It follows immediately that
which states that Kirchhoff's law holds for the flow cp.
The probabilistic interpretation of the flow cp has two parts:
1. The vertex potential of v, p(v), is equal to the probability that a random walk starting in the source s goes through the vertex v.
2. The arc flow on (u, v), cp(u, v), is equal to the probability that a random walk starting in the source s goes through the arc (u, v).
Note that the measures p and cp consider only 'users' (future) and do not depend on the past.
Nonacyclic citation networks
The problem with cycles is that even one cycle in a network creates an infinite number of trails between some units. The counts described in Section 3.2 are useless and cannot be used because all of them are massively inflated. This is a serious problem. Fortunately, there are two broad approaches for overcoming it. One focuses attention on the weights:
• introduce some 'aging' factor to force the total weight of all trails to converge to some finite value; or
• restrict the definition of weights to some finite subset of trails, for example, paths or geodesies.
Alas, this approach does not work well because two new problems arise: 1) How do we establish the right value for the aging (down-weighting) factor? 2) Is there an efficient algorithm to count the restricted trails? Most likely, using different down-weighting factors will lead to different orderings of weights which destroys a critical property of the weights established in Section 3.2. To our knowledge, regarding the second problem, no efficient algorithm exists. For our purposes, this approach is a dead end.
The second broad approach involves dealing with the cycles directly. Since a citation network is usually almost acyclic, this approach transforms a non-acyclic network into an acyclic network. There are at least three ways of doing this:
1. since the presence of cycles creates non-trivial strong components, identifying and removing them by shrinking (see Section 2.7);
2. strategically deleting some arcs also creates acyclic networks; and
3. using what we call a preprint transformation.
Figure 3.6 Preprint transformation.
Figure 3.6 illustrates these ideas. There are two cyclic configurations on the left. For the top 2-cycle, shrinking simply collapses u and v into a single vertex in a new network. Similarly, for the second configuration, shrinking collapses u, v, and w. If there is no need to preserve the identities of the vertices, this is a very simple way of forming an acyclic network. The other two operations are applicable when there is a need to preserve the identities of the vertices. Deleting arcs to create an acyclic network, while effective, is truly problematic because selecting arcs to delete is arbitrary and the choices change the meaning of specific citations. For the lower cyclic configuration in Figure 3.6, one option is the removal of (u, w), (w, v), and (v, u) which creates a different fragment than removing (u, v) and (w, v). We do not recommend this tactic.
The preprint transformation is based on the following idea: Each paper, u, from a strong component is duplicated with its 'preprint' version, u'. Under this transformation, the scientific productions, mainly papers, inside a strong component are treated as 'citing' their preprints, and the preprints cite the papers outside the strong component.
Table 3.1 presents some characteristics for some of the citation networks we have considered. The labels for the columns include n and m, the number of vertices and arcs respectively. The next three columns are the size of the largest weakly connected component (nc), the number of non-trivial weakly connected components (kc), and the depth of network, the minimum number of levels (h). The last three columns contain the numbers of strongly connected components (cyclic parts) of sizes 2, 3, and 4. In citation networks, large strong components are highly unlikely.10 This is made clear in Table 3.1, where there are no strong components of size greater than 4. Indeed, the presence of large strong components usually indicates at least one error in the data. In general, shrinking and using the preprint transformation are very useful ways of converting a non-acyclic citation network into an acyclic network in order to mobilize the arc counts introduced in the previous section.
An exception to this rule is a citation network of the high energy particle physics literature KDD Cup (2003) from arXiv. In it, different versions of the same paper are treated as separate units. Doing this leads to large strongly connected components. The idea of preprint transformation can be used in this case to eliminate cycles if necessary.
Table 3.1 Some citation network characteristics.