Regularized Logistic Regression Classification
In general, the linear logistic regression model has the following classification function
Here X e RnXp, n is the number of samples (subjects) and p is the number of features. As all the computations were performed within the MDT mask (193,586 ~200,000 voxels), p is the number of voxels times the number of diffusion measures (five in this case). The parameters to be estimated are w and b, where w e Rp is a p-dimensional vector, b e Rn is the intercept and y 2 f—1,1} is the class label, in our case, to be the subject diagnosis. The regularized cost to be optimized is:
where L is the logistic loss function, J (w) is the regularization term and X is the Lagrange multiplier. The intercept b is not regularized, and only depends on the loss function. We will simplify L (y, F (Xw + b)) to L (w). In our case, the standard TV-L1 norm cost becomes:
where the first term is the LASSO or L1 cost, TV is the Total Variation penalty , wj is the weight map of a microstructural measure j, Nm (=5 here) is the number of measures used and a is a constant that sets the desired tradeoff between L1 and TV terms. The L1 penalty encourages sparsity in the model, by setting most coefficients to zero. This penalty function suffers from some limitations when there is a large number of parameters p to fit, and few observations n, as LASSO selects at most n variables before it saturates. Further, if there is a group of highly correlated variables, then LASSO tends to select one variable from a group and ignores the others. On the other hand, the TV is defined as the L1 norm of the image gradient, which allows for sharp edges, encouraging the recovery of a smooth, piecewise constant weights map. This in turn allows us to interpret the weight maps as they may highlight clusters that can resemble anatomical regions.
We used the FISTA procedure  to find w (the estimated value for w). As the L1 terms are not smooth, a naive gradient descent may not always converge to a good minimum. For this convex optimization, smooth and non-smooth terms are considered separately. The logistic loss and the logistic gradient are the smooth terms:
We used an eightfold nested cross-validation to tune the parameters a and X.