# Biomarker Study of Metabolomics: Semiparametric Transformation Models

HIV-positive individuals who have been on long-term ART appear to be at an increased risk of cardiometabolic diseases, including diabetes, compared to HIV-negative individuals. Plasma levels of amino acids and other small molecules reflective of impaired energy metabolism, such as acylcarnitines and organic acids, were measured with mass spectrometry to provide a detailed metabolic profile for 70 nondiabetic, HIV-infected persons who were on efavirenz, tenofovir, and emtricitabine with an undetectable viral load for over 2 years (Koethe et al. 2016).

There is interest in assessing associations between these biomarkers and demographic or clinical variables. In this section, we will focus on modeling a specific biomarker, 2-hydroxybutyric acid, which is thought to be an early indicator of insulin resistance in nondiabetic persons; elevated serum 2-hydroxybutyric acid has been seen to predict worsening glucose tolerance. 2-Hydroxybutyric acid is fairly skewed, ranging from 13 to 151 pM, median 34 pM in our data set. Even after a log transformation, the distribution remains slightly right skewed with some outlier levels. Predictor variables for our model include age, sex, race, body mass index (BMI), CD4 cell count, smoking status, and ART duration (log transformed).

Because of the skewness of the biomarker outcome, we favor fitting a semiparametric transformation model, specifically Y = T(pZ + e), where T(-) is an unspecified monotonic increasing transformation and e is a random error with a specified parametric distribution F_{e} (Zeng and Lin 2007). The conditional distribution of Y given Z is therefore

Hence, the semiparametric transformation model can be written in a manner similar to that of the ordinal cumulative probability model, g[F_{Y}|_{Z}(y)] = a(y) - pZ, with the link function g(-) = F_{e}^{- 1}(-) and the intercept a(y) = T^{-1}(y). Harrell (2015) has proposed using this fact to estimate parameters from the semiparametric transformation model with continuous data by maximizing an approximated multinomial likelihood, and he has implemented this procedure, denoted as orm, in R Statistical Software as part of his popular "rms" package.

In our biomarker analysis, we fit three models of 2-hydroxybutyric acid (denoted as Y) on covariates: (1) a multivariable linear regression model with Y untransformed; (2) a multivariable linear model with Y log transformed; and (3) a semiparametric transformation model fitted using orm with the link function g(-) = log(-log(-)), which corresponds to assuming F_{e} follows an extreme value distribution. Figure 9.2 shows quantile-quantile (Q-Q) plots of PSRs from each of these models compared to quantiles from a uniform (-1, 1) distribution. If the model is correctly specified, the residuals should be approximately uniformly distributed. Clearly, PSRs from the normal linear model are far from uniform, and although PSRs from the linear model after log transforming the biomarker are closer to being uniform, PSRs from the flexible, semiparametric transformation model are more uniform.

In this analysis, we could also have used OMERs to uncover lack of fit for the linear models. However, the OMER is difficult to calculate for the semiparametric transformation model because it requires computation of the conditional expectation, and even if we went through the process of estimating the conditional expectation for all observed covariate combinations, OMERs would still be skewed and not very good for model diagnostics because the semiparametric transformation model makes no assumptions of symmetry of the OMERs, equal variance, etc. In contrast, the PSR is easily and naturally calculated from the fitted semiparametric transformation model and makes no additional assumptions beyond that of the original model. Hence, the PSR is useful for comparing fit across the three different models because it is on the same scale for each.

Figure 9.3 shows residual-by-predictor plots for continuous covariates from the semiparametric transformation model using PSRs. There is some evidence of nonlinear relationships (top panel). The model was refit expanding age, BMI, and log-transformed ART duration using restricted cubic splines with 3 knots.

FIGURE 9.2

Q-Q plots of PSRs from (a) linear, (b) linear after log transformation, and (c) semiparametric transformation models of 2-hydroxybutyric acid compared to a uniform (-1, 1) distribution.

FIGURE 9.3

Residual-by-predictor plots. The smoothed relationship is shown using lowess curves. The top panel shows PSRs versus continuous predictors from an initial model fit without splines. The bottom panel shows PSRs versus continuous predictors after expanding age, BMI, and log-transformed ART duration using restricted cubic splines with 3 knots.

**186 ****Quantitative Methods for HIV/AIDS Research**

Residual-by-predictor plots from these models are given in the bottom panel of Figure 9.3. There is no longer evidence of nonlinear residual relationships. A likelihood ratio test confirms that the second model with the nonlinear terms is a better fit *(p* = 0.010); despite the added model complexity, the Akaike information criterion (AIC) for the model with the nonlinear terms is lower than that without them (599 vs. 605).