Importance Sampling: Geyer-Thompson Approach
The Geyer-Thompson (1992) algorithm takes one large sample of graphs for a provisional value of the parameter vector and then treats this sample as representing all possible graphs. Thus, even when we change the parameter vector, we use the same sample to calculate the sample average to determine whether z§ — z(xobs) = 0. Intuitively, this is much like the brute force approach, but we only need to use a small number of large samples, and there are effective minimization algorithms. A Geyer-Thompson algorithm is the main ingredient of the default method in the statnet program (Handcock, 2003; Handcock, et al., 2003; Hunter, Handcock, Butts, Goodreau, & Morris, 2008; Hunter & Handcock, 2006). To take into account that the sample of graphs is only a sample, and not the entire space of graphs, we need to take a weighted average of the statistics when we calculate z§. More specifically, if the sample is generated from the model Pg(x), the sample average fQ = w(1) f(x(1)) + w(2) f(x(2)) + ••• + w(M) f(x(M)), of a function f(x), with weights
is a good approximation to the true expected value Eg (f (X)) when M gets large and Q is reasonably close to Q. The closer Q is to Q, the closer the weights are to M, and if, for example, Q = Q and f (x) = z(x), then fQ (x) = zQ (x). If Q is not close to Q, there will be a high dispersion between the weights w(m), and the estimate, although still centered, will have a large standard deviation.
Typically, the chain generating x(1), x(2),..., x(M) is not allowed to burn in between successive sampling points. Instead, a smaller number of steps, k (the thinning), are discarded between sample points, thus giving a sample that is weakly correlated. This is generally acceptable as long as the autocorrelation of the chain is not too large, in which case the algorithm may perform badly. In statnet, the number of steps k is set by the argument “interval” to “ergm” (as described in Goodreau et al. (2008); the user may consider setting k according to the heuristic shown below in Section 12.4.2, which depends on the multiplication factor).
To solve the likelihood equation, a sequence of parameters Q (0) = Q, Q (1), Q (2),..., Q (G) is generated using a version of Newton-Raphson (or Fisher scoring). An updating step of Q(g) in the minimization algorithm is done through
Because M=1w(m)z(x(m)) is an approximation of EQ(z(X)), if Q (g—1 is
truly the MLE, then EQ(z(X)) — z(xobs) = 0, and the updating step will simply set Q(g) equal to Q (g—1).
The scaling matrix D(Q) in the algorithm scales the difference between the observed values and the expected values of the statistics. This is needed because statistics can differ in their sensitivity to changes in corresponding parameters, and the parameters will affect not only their “own” but also other statistics. The matrix D(Q) is set to the weighted sample covariance J2m w(m) z(x(m))z(x(m))T — [ m W(m) z(x(m))] [ m W(m) z(x(m)]T.
Typically, this algorithm has to be restarted several times, taking Q(G) as the provisional MLE Q, and then restarting the algorithm with Q = Q(G), (the number of restarts is set by the argument “maxit” to “ergm”). Of great importance is that Q is not too far from the true MLE, and in particular, the sample of statistics z (x(1)), z(x(2)),..., z(x(M)) must “cover” the observed statistics z(xoys).  Hence, finding a good starting point for this algorithm may be difficult.
-  Technically, we say that z(xobs) is in the relative interior of the convex hull onz(x(1)),...,z(x(M)) (Handcock, 2003).
-  The format of the algorithm here adheres to the settings in PNet. Some further modifications are found in Snijders (2002). In statnet, the thinning and the burn-in are setusing “interval” and “burn-in”, respectively. Other Robbins-Monro estimation settingsare set using the “control” argument to ergm.