A novel approach is developed for modeling situations in which the modeled property is an algebraically transformed version of the original experimental data. In many cases such a transformation results in a data set with a significantly smaller data range. Here we explore the effects of range-of-data on modeling statistics. We illustrate a twostep method using data on the mass spectrometry collision energy (CE) that is required to decompose 50% of precursor ions to fragments (CE50). Earlier we showed that a nonlinear center-of-mass transformation, yielding Ecom50, produces values less dependent on the specific mass spectrometric experimental conditions. For this data set the Ecom50 range is 13.5% of the CE50 range. We propose a two-step modeling method. First, the original experimental data, CE50, (larger range-of-data) is modeled by a standard modeling method (PLS). Second, the calculated dependent variable resulting from the modeling is algebraically transformed (not modeled) according to the center-of-mass transformation, providing the generally more useful data, Ecom50. As shown here, use of this two-step method for predicting Ecom50 (from previously published data) produces a standard error 21% smaller and correspondingly reduces the confidence interval for prediction. Some specific implications for prediction are given for a published data set. This work is part of the ongoing development of a system of models to assist in the development of human metabolites.
Keywords: Collison energy at 50% reduction (CE50), molconn structure descriptors, PLS models, range of data significance, PubChem structures prediction.