MLR-Based QSAR Models for Predicting Inhibitory Activity of Reverse Transcriptase by HEPT Derivatives using GETAWAY Descriptors

In this study, quantitative structure-activity relationship (QSAR) models for nonnucleoside reverse transcriptase inhibitors based on 1-[(2-hydroxyethoxy)-methyl]-6(phenylthio)thymine (HEPT) derivatives were generated. The structures of the compounds and their activities were obtained from the literature. The data set were divided into two sets: training set (N=91) and validating set (N=10). All 3-D structures of these inhibitors were optimized by semi-empirical method, AM1 prior to calculations of 3-D molecular descriptors, GETAWAY. Multiple linear regression (MLR) using stepwise method was applied to determined significant descriptors. Out of 197 GETAWAY descriptors, 4-14 molecular descriptors have significant relationships with the activities (expressed as log (1/EC50)) of HEPT. The MLR method generated 14 models. The predictive power of these models were evaluated internally by applying the following statistical parameters for the training set and test set: root-mean-square error for prediction (RMSE), correlation coefficient (R), squared correlation coefficient (R2), adjusted squared correlation coefficient (Radj), difference between R2 and Radj (R2 – Radj), squared cross-validation correlation coefficient (Q2). External validation was performed by employing Golbraikh and Tropsha criteria. Moreover, residual analysis was performed. Internal validation of Model XX (N = 91) revealed that it has the highest predictive power (RMSE = 0.4288, R = 0.9393, R2 = 0.882, Radj = 0.8620, R2 Radj = 0.0203, Q2 = 0.8317). However, external validation (using the validating set, N=10) showed that Model XII has the highest predictive power (R2 = 0.961, R0 = 0.9565, k = 0.8648, k’ = 0.9800, [R2 R0] = 0.0066, [R2 R0] / R2 = 0.0069, Rpred = 0.9481) based on Golbraikh and Tropsha criteria. Residual analysis confirmed that both models are valid.

(560 amino acids) and p51 subunits (440 amino acids) (Kohlstaedt et al., 1992).Because of this process, RT has become target for anti-HIV drug discovery.The strategy is to inhibit the RT using candidate molecules.Example of these molecules are 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio) thymine (HEPT) derivatives (De Clercq, New, 2001;Baba et al.;1989, Miyasaka et al., 1989;;Tiwari et al., 2006).HEPT are non-nucleoside inhibitors that block RT by binding to a compartment adjacent to the catalytic site of the enzyme and thereby interrupt the conformation of several amino acids crucial for proper RT function (De Clercq, 1998).The binding site contains five aromatic , six hydrophobic  and five hydrophilic  amino acids that belong to the p66 subunit and two amino acids  belonging to the p51 subunit (Kohlstaedt et al., 1992).However, drug development is timeconsuming and expensive.Thus, it is necessary to develop a method that can predict efficacy of the candidate drug prior to laboratory or clinical stage to reduce drug development cost (Kola and Landis, 2004).
Quantitative structure-activity relationship (QSAR) models have been used intensively in predicting biological activities, chemical properties and toxicities of many organic molecules (Bazoui et al., 2002;Duda-Seiman et al., 2007;Verma et al., 2010).QSAR models are generated by finding significant relationships between activities/properties and molecular descriptors.Furthermore, QSARs studies are vital part of drug development because they shorten the amount of time to find a better lead compounds by allowing us to predict activities/toxicities of these compounds using molecular descriptors without undergoing tedious animal and laboratory testing (Verma et al., 2010).Currently, there are more than 3000 molecular descriptors that are used in QSAR studies (Todeschini and Consonni, 2009).These are categorized into different types: 0D, 1D, 2D, and 3D molecular descriptors.In this work, GETAWAY descriptors were used for QSAR modelling.
GETAWAY stands for GEometry, Topology, and Atom Weights AssemblY.V. Consonni, et al. (Consonni, Theory, 2002) proposed this set of novel molecular descriptors based on a leverage matrix, called Molecular Influence Matrix (MIM).GETAWAY is a set of 3D molecular descriptors that matches 3D molecular geometry provided by MIM and atom relatedness by topology with chemical information by different atomic weighting schemes such as unit weights, mass, polarizability, electronegativity.GETAWAY descriptors have low or no degeneracy, which avoids getting the same value for a descriptor for more than one compound sharing the same structural features which are often observed in topological descriptors.The H is defined by where M is the molecular matrix which represents the location of each atom (A) of an optimized molecule using Cartesian coordinates x, y, z.The resultant A x A matrix is invariant to rotation of the molecular coordinates.The superscript T refers to the transposed of the matrix.The diagonal elements hij are leverages and represent the influence of each atom in determining the shape of the molecule.Each off diagonal element hij, represents the degree of accessibility of the j'th atom to interactions with the i'th atom.GETAWAY descriptors are classified into two types: H -GETAWAY and R -GETAWAY.We illustrate here some of the descriptors derived by V. Consonni, et al. (Consonni, Theory, 2002).
H -GETAWAY descriptors are calculated based on the information provided by the MIM.The H -GETAWAY autocorrelation descriptors use the geometric information inherent in the leverage values and atomic weighting schemes to make a new set of descriptors.For example, the HATS indices are defined using a weight vector w' for each atom in a molecule: Therefore, (Equation 3) and the HATSk is written as 4) where dij is the topological distance between the i'th and j'th atoms and k = 1, 2,...,d with Hence, the HATS total index is (Equation 5) A second set, called R -GETAWAY, combines information with geometric interatomic distances in the molecule.R and R+ descriptors are obtained from the leverage/geometry matrix.The matrix R is defined as 6) where rij is the geometric distance between the two atoms.Also, the R -GETAWAY autocorrelational indices are defined analogously to the H -GETAWAY autocorrelational indices.Hence, we have the w weighted k'th order autocorrelation index (Rk) The R total index is defined as (Equation 8) V. Consonni et al. (2002) evaluated extensively the prediction ability of GETAWAY descriptors by analyzing the regressions of these descriptors for selected properties of some reference compound classes.Also, they studied the general performance of these descriptors in QSAR/QSPR with respect to other well-known sets of molecular descriptors.They concluded that GETAWAY descriptors provide more predictive models when the property to be modeled depends strictly on the 3D features of the molecule, e.g.biological activities.Furthermore, when the GETAWAY descriptors are added to other types of descriptors the predictive ability of the model improved.
Multiple linear regressions (MLR) have been widely used method in QSAR research (Verma et al., 2010).MLR has an advantage for its simplicity and ease of interpretation because the model assumes a linear relationship between activity and molecular descriptors.The MLR model is expressed as 9) where β0 is model constant, x1,...,xk are molecular descriptors with their corresponding coefficients β1,…βk.These coefficients are calculated by least-squares method, which minimizes the sum of squared residuals.The magnitudes of the coefficients indicate the extent of influence of the corresponding descriptors on the activity (y).
Several QSAR studies involving HEPT derivatives as inhibitors of HIV-RT using 0D, 1D, 2D and 3D molecular descriptors has been reported in the literature.Castro and colleagues (2005) developed QSAR for HEPT using Morgan Extended Connectivity in Graph of Atomic Orbitals (GAO).The best model was based on correlation weights of Morgan extended connectivity of first order in the GAO (N = 59, R 2 = 0.91, s = 0.41, F = 577).Application of the model to a test set (N= 20) gave a good prediction power based on coefficient of determination (R 2 = 0.91, s = 0.42, F = 183).
In the recent study of D. Ivan, et al. (2013), the best MLR-based QSAR they obtained has five descriptors of different types (topological, connectivity index, and GETAWAY).The model has high coefficient of determination (N = 91, R 2 = 0.826, s = 0.682, F = 84.4) and satisfactorily predicted the activities of HEPT derivatives in the test set (N =20, R 2 =0.675,R 2 pred = 0.663).
The objective of this research was to generate QSAR models using the GETAWAY descriptors to predict the inhibitory activities of HEPT derivatives against RT.We performed extensive statistical analyses to validate the models.Furthermore, we decided to use these descriptors because they represent 3D molecular chemical information of the molecules under study.

METHODOLOGY
We used the data set of 101 HEPTs (Figure 1) with inhibitory activities against HIV-RT reported in the literature (Table 1).The data were divided into two sets: training set (N = 91) and test set (N = 10).All 3-D structures of these HEPTs were optimized by MM+ followed by semi -empirical method, AM1, applying the Polak -Rabiere conjugate gradient with restricted Hartree -Fock (RHF) spin pairing, 0.001 convergence limit in vacuo and RMS gradient of 0.001 kcal/A mol.AM1 method allows a large number of structures to be optimized in a shorter timeframe, and for calculations involving large molecular systems.Also, AM1 has been applied for optimization of HEPT structures in the literature (Hannongbua, Structure, 1996;Zarei and Atabati, 2009;Ivan, et al., 2013).All calculations were performed using Hyperchem (Hypercube Inc.).After geometry optimization, 197 GETAWAY were calculated using Dragon software (Talete srl).
Multiple linear regression (MLR) using stepwise method was applied to determined significant descriptors.Internal validation of these models was evaluated by applying the following statistical parameters for the training set and test set: root -mean -square error for prediction 10) 11) 12) 13) where: For external validation, the following Golbraikh-Tropsha criteria (Golbraikh et al., 2003;Golbraikh and Tropsha, 2002) were evaluated to determine the predictive ability of the MLRbased QSAR model applied to a test set (N =10).
The predictive power of QSAR models were also tested by predictive parameter (R 2 pred).For a predictive QSAR model, the value of R 2 pred should be higher than 0.5 (Equation 14) In addition, we performed correlation analysis among the independent variables (e.g.variation inflation factor (VIF), and tolerance) (Kutner et al., 2004).Paired t-test was employed to determine the significant difference between the experimental and predicted inhibitory activities by the models.All statistical analyses were performed using SPSS software (SPSS Inc.).

RESULTS AND DISCUSSION
QSAR analysis was carried out in order to explore properties, which might be responsible for the interaction of molecules with HIV -RT receptors.The structures of molecules, which were employed in this study, are shown in Table 1.MLR using stepwise method was employed to select significant descriptors and to generate QSAR models.Twenty QSAR models were generated using the training set, N = 91(Table 2).These models have four to 14 GETAWAY descriptors which have significant correlations with the actual activities of HEPT derivatives (Table 3).The number of independent variables considered in the QSAR models was five to six times less than the number of molecules in the training set, which is the acceptable number of variables for MLR modeling for a particular database size used in this study (Tropsha et al., 2003).Internal validation of the training set (N = 91) using RMSE, R, R 2 , (R 2 -R 2 adj), Q 2 revealed that model XX is the best 13 -descriptor QSAR model (Table 3).The model has an acceptable values of root -mean -squared error (RMSE) < 1.00, correlation coefficient (R) > 0.80, squared correlation coefficient (R 2 ) > 0.65, adjusted squared correlation coefficient (R 2 adj) > 0.6 , difference between R 2 and R 2 adj (R 2 -R 2 adj) < 0.30, which, all show that the results are not based on chance correlation.The model's Q 2 > 0.65 supports the predictive ability and significance of the model.
Model XX has two mass -, three electronegativity -, two van der Waals volume -, one polarizabilty -related, and five unweighted GETAWAY descriptors.In this model the most (negative) influential variable to Log (1/EC50) is R2u + (R maximal autocorrelation of lag 2 / unweighted), which is related to three dimensional geometry of the molecule (Consonni et al., 2002).However, after we performed correlation analysis, we found several multicollinear descriptors that may complicate the interpretation of the model.To test if there is complication exists, we subsequently calculated the variance inflation factor (VIP) for each variable.VIP provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity (Kutner et al., 2004).Results indicate that the variances of all regression coefficients of the model are not severely affected by collinearity (VIP).Moreover, the 13 variables in the model explained 88.2% variation in Log (1/EC50) and they are statistically significant (P < 0.01).The F statistic (F = 44.416,P < 0.0001) shows that the regression equation/model is statistically significant.The To further validate the model, we employed residual analysis on the residuals calculated from the difference between the experimental and predicted inhibitory activities by Model XX.For the MLR model to be valid there are three assumptions to be checked on the residuals: (1) no outliers; (2) the data points must be independent; (3) the distribution of the residuals must be normal with mean = 0, and constant variance (Chan, 2004).The first assumption is satisfied by the model.For the second, we used Durbin-Watson estimate to check for independence.Values near two indicate that data points are independent.The value we obtained for Model XX is 1.869 which satisfies the independence assumption.Furthermore, it can be seen in Figure 2 (A -C) that the distribution of the standardized residual is normal (mean = 0) and the scatter of the points has no clear pattern indicating that the variance is constant.Therefore the third assumption is satisfied.These results conformed to our internal validation performed for Model XX, which, suggests the model is valid.Moreover, to estimate the true predictive power of a QSAR model, it must be able to predict the activities of the molecules in the test set and compare the predicted with the experimental values (Veerasamy et al., 2011).We applied the Golbraikh and Tropsha criteria to estimate the predictive power of Model XX using the test set pred presented in Table 4, Model XX satisfactorily predicted the activities of the HEPT derivatives in the test set.However, the calculated value of k for Model XX (k = 0.7864) is not within the range of Golbraikh and Tropsha criteria (0.85 ≤ k ≤ 1.15).Hence, we applied the same criteria for other models to find a better model.
After applying the same procedure for the other 19 models and we found that Model XII has the best prediction performance when applied to test set, although, it is not the best QSAR model based on the internal validation.This has been expected since it was reported (Kubinyi et al., 1997;Kubinyi, Three, 1998) that in general there is no relationship between internal and external validations of the model.Furthermore, our results confirmed the findings (Kubinyi et al., 1997) in the literature that low internal predictivity may result to high external predictivity and vice versa.We further explored the predictive performances of the two selected models using the compounds in the test set by employing paired t test and residual analysis (Table 5).
Results show that there is no significant difference between the predicted and experimental activities (t(XII) = 1.444,P = 0.183; t(XX) = 0.998, P = 0.344, respectively).Likewise, the residuals distributions are normal (mean(XII) = 0.1230 and mean(XX) = 0.1488) and have constant variance for both models.).In addition, most of the HEPT QSAR studies reported used 2D and 3D molecular descriptors, which might complicate the interpretation of the models (Ivan et al., 2013;Zarei and Atabati, 2009).In contrast, the use of one type of descriptors (particularly 3D molecular descriptors) simplifies the information about the possible interaction of the ligand to the enzyme under study.This makes drug design more manageable albeit the structure of the receptor or enzyme is unknown.
The GETAWAY descriptors selected in the present models suggest that the geometry, net electronegativity and size of the molecules play an important role in their activities.These conform to the results of 3D-QSAR studies on HIV RT-HEPT interaction, that the steric and electrostatic interactions for the HIV-1 RT-HEPT affect inhibitory activities (Hannongbua et al., 2001).Moreover, these observations agree well with the experimental studies on the crystal structure of HIV-1-HEPT complex.The increase in potency of the HEPT is due to the interaction between residue Tyr181 in the enzyme and the 6-benzyl ring of the inhibitors, which stabilizes the structure of the complex (Hopkins et al., 1996).Also, our models support the conclusion in the literature that GETAWAY descriptors provide more predictive models for biological activities that are dependent on 3D features of the molecule (Consonni et al., 2002).
In this work, we performed extensive statistical analyses to validate Model XII and XX for their prediction performances.Our results showed that the best model, XII was realized after performing validation methods and applying the models to predict the activities using the test set.
Validating procedures are necessary to establish the predictive performances of the QSAR models.Most of the researches in QSAR modeling relied only on several statistical parameters (e.g.R, R 2 , Q 2 ) to validate their MLR -based QSAR models.Hawkins D.M., et al. (2003) argued that for sample size holding a portion of it back for testing is wasteful, and it is much better to use the cumbersome leave -one -out cross -validation.In contrast, A. Golbraikh, and A Tropsha (2002) suggested that this assumption is generally incorrect and there is no correlation between the values of Q 2 for the training set and predictive ability for the test set.Hence, the high value of Q 2 is necessary, but not an indication that the model has high predictive power.Still, the best method of validating a model is external validation (Golbraikh and Tropsha, 2002;Veerasamy et al., 2011).Furthermore, applying all available validation strategies to check the robustness of the model is necessary (Veerasamy et al., 2011).
In summary, the present QSAR models, XII and XX, due to their high predictive power, can be used as an alternative method to the costly and time -consuming experiments for determining the inhibitory activities of new HEPT derivatives.The two models have five H -and eight R -GETAWAY descriptors.These descriptors are related to the geometry, distributions of electronegativity, van der Waals, and atomic mass in a molecule.The results conform to the idea that many biological activities are dependent on the three -dimensional arrangement of atoms in a molecule (Schuur et al., 1996;Vracˇko, 2002).Furthermore, this study demonstrated that the use of GETAWAY descriptors successfully predicted the inhibitory activity of HEPT against HIV-RT.

CONCLUSION
The MLR -based QSAR models for the inhibitory activity of 91 HEPT derivatives have been obtained using GETAWAY descriptors.The best MLR model is XII and has 13 GETAWAY descriptors.The present models have good stability, robustness, and predictability when verified by internal validation.Application of the model XII to HEPT derivatives in a test set, satisfactorily predicted their inhibitory activities.Moreover, the model suggests that the three dimensional distributions of atomic electronegativity, mass, volume, and geometry of the molecule have considerably effect on the inhibitory activities of HEPT derivatives.Furthermore, these models offer new theoretical tools for drug design and development.
of Squares = sum of the squared deviations of the predicted value ( ̂) about the mean ( ̅) SST = Total Sum of Squares = sum of the squared deviations of the observed variable (  ) about the mean PRESS = predictive residual sum of squares = the difference between the predictive values and observed values SS = sum of squares = the difference between the observed values and their meanGenerally, an acceptable QSAR model generated from training set is considered to have a higher predictive if the following criteria are met: R>

Figure 2 .
Figure 2. Distribution and variance of standardized residuals calculated from the difference between experimental and predicted inhibitory activities by Model XX (A-C) and Model XII (D -F).

Figure 3 .
Figure 3. Correlations between the experimental and predicted values of inhibitory activity of HEPT using model XII and XX using the training set (N = 91)

Figure 4 .
Figure 4. Correlations between the predicted and experimental values of inhibitory activity of HEPT using model XII and XX using the test set (N = 10).

Table 2 . Internal Validation of 20 QSAR models using Training Set (N = 91).
bSelected models used in this study.

Table 4 . Predictive power results for the external test set; Golbraikh and Tropsha criteria.
Model XII has two mass -, four electronegativity -, two van der Waals volume -related and five unweighted GETAWAY descriptors.The two most influential variables in this model are R1v + (R maximal autocorrelation of lag 1/weighted by atomic van der Waals volumes) and R2u + (R maximal autocorrelation of lag 2 / unweighted) which are both related to the 3D geometry of the HEPT derivatives.Akin to model XX, model XII has muticolinearity among its variables; however, VIPs calculations suggest that this has no significant threat to predictivity of the model (VIP < 10, tolerance > 0.10).The 13 variables explained 85.7% of variance in Log (1/EC50).Moreover, the F statistic (F = 35.430,P<0.0001)indicatesthatthe regression equation/model is significant.Furthermore, the values of t statistic for all regression coefficients are all significant and can be used to explain Log (1/EC50).The comparison of predictive performances of model XII and XX using the test set based on Golbraikh and Tropsha criteria are shown in Table4.Unlike model XX, model XII conformed to the criteria.The graph of correlations between the predicted and experimental values of inhibitory activities of HEPT (in the test set) using model XII and XX are presented in Figure4.These results show that all calculated statistical parameters indicate that both models have good predictive power.Nevertheless, the values obtained for model XII are more satisfactory than that for Model XX.