Quantitative high throughput screening (qHTS) assays are used to assess toxicity for many chemicals in a short period by collectively analyzing them at several concentrations. Data are routinely analyzed using nonlinear regression models; however, we propose a new method to analyze qHTS data using a nonlinear mixed effects model. qHTS data are generated by repeating the same experiment several times for each chemical; therefor, they can be viewed as if they are repeated measures data and hence analyzed using a nonlinear mixed effects model which accounts for both intra- and inter-individual variabilities. Furthermore, we apply a one-step approach incorporating robust estimation methods to estimate fixed effect parameters and the variance-covariance structure since outliers or influential observations are not uncommon in qHTS data. The toxicity of chemicals from a qHTS assay is classified based on the significance of a parameter related to the efficacy of the chemicals using the proposed method. We evaluate the performance of the proposed method in terms of power and false discovery rate using simulation studies comparing with one existing method. The proposed method is illustrated using a dataset obtained from the National Toxicology Program.
Quantitative high throughput screening (qHTS) assays take advantage of robotic technology and high sensitivity detectors (Michael
Many previous studies have considered qHTS data to assess potential toxicity. Shockley (2015) mentioned that qHTS data could be analyzed using the Hill model, a nonlinear regression model, with significant potential for toxicity testing. He suggested that a robust method be developed to take full advantage of qHTS assays, to overcome large uncertainties in parameter estimation when fitting nonlinear models to qHTS data. Lim
Several methods have been proposed to screen toxic chemicals using qHTS data. Inglese
Repeated measures data are generated by observing a number of individuals repeatedly under differing experimental conditions, where individuals are assumed to constitute a random sample from a population of interest (Lindstrom and Bate, 1990). For the qHTS assays in the current study, each chemical was measured 1–48 times at 15 concentrations. Therefore, for an analysis purposes these qHTS data may be technically treated as repeated measures data taken on multiple subjects, even though they have commonly been evaluated using nonlinear regression models.
Nonlinear mixed effects models are generally used to analyze repeated measures data. For example, Strathe
Outliers often occur in qHTS assay data; therefore, this paper analyzes qHTS assay data by introducing the one-step approach proposed by Williams
The remainder of this paper is structured as follows. Section 2 describes the basic nonlinear mixed effects model, current methods, one-step approach and the proposed method to analyze qHTS data. The performance of the proposed method is compared with that of an existing method on the basis of simulation studies described in Section 3. Section 4 summarizes the results of applying the proposed method to qHTS data. We conclude the paper and discusses potential future research topics in Section 5.
The qHTS assay considered in this study is the “TOX21 ERa LUC BG1 Agonist” assay from the Tox21 library. The assay identifies small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway using the BG1 cell line. Chemical interactions with estrogen receptor (ER) disrupts normal endocrine function. Therefore, it is important to understand the effect of environmental chemicals on the ER signaling pathway to effectively manage endocrine disrupting chemicals (EDCs). The BG1-Luc-4E2 cell line was employed in the TOX21 ERa LUC BG1 Agonist assay to identify ER agonists, and has been used to screen the Tox21 10K compound library. BG1-Luc-4E2 cell line endogenously expresses full length ER-alpha and is stably transfected with a plasmid containing four estrogen responsive elements (ERE) upstream of a luciferase reporter gene (https://tripod.nih.gov/tox/apps/assays/).
The Hill model is a nonlinear model primarily used to describe the shape of dose-response curves in biology, chemistry, and pharmacology. We apply the Hill model for each chemical since qHTS data are generated by repeating the same experiment at various concentrations for thousands of chemicals. The Hill model can be expressed by
Each parameter in Hill model has a specific bio-chemical meaning (Figure 1).
Repeated measures data are analyzed using a nonlinear mixed effects model, which allows two types of correlation: correlation among random effects within each subject, and correlation in the within-subject variance-covariance matrix of errors (Williams
where
We can generally assume that the number of random effects and the variance structure of random effects are the same for all subjects, respectively (Williams
where
where
where
Various methods have been proposed to estimate parameters for mixed effects models; however, we restrict this paper to maximum likelihood (ML) methods.
This section introduces one of robust estimation methods, the one-step approach (Williams
We apply the multivariate first-order Taylor expansion to approximate
where
Substituting
If the left side of
where
Since
where
where
where
and
where
We propose a method to identify toxic chemicals using a series of qHTS data analyses, as shown in Figure 2 and are detailed as:
Step 1. Perform one sample
Classify chemicals that fail to reject the null hypothesis as ‘Inactive’.
If chemical rejects the null hypothesis, go to next Step.
Step 2. Fit a nonlinear regression model to each chemical and identify the sign of estimates
Classify chemicals with negative estimates (
If estimates are positive (
Step 3. Fit different models according to replication count for each chemical.
Fit a nonlinear regression model to chemicals that were only measured once.
Fit a nonlinear mixed effects model to chemicals that have at least two observations at each concentration
Step 4. After fitting a model to each chemical in Step 3, identify toxic chemicals based on the significance of
If
If
When actually fitting the curve on the real data, the method above may not be ensured. For example, there are some cases where the model fit fails to converge. One possible reason for the failure might be that the plateau of the Hill curve is not specified by the data. Nevertheless, we see that the Hill function can still be used to represent the data. Thus, we have set a maximum iteration number to be 50 in all steps. If not converged, the result of the last iteration is returned. We checked for individual cases and confirmed that the number of cases is small. Most have converged successfully. Another remedy to address this issue for our proposed method is to have the first step where we conduct a simple
Another possible problem is that the covariance matrix of the estimated parameters is not positive definite. We observed that there are a few cases having the problem. To resolve this issue, we applied a ridge regression type approach similar to the method proposed by Lim (2015). When a covariance matrix was not positive definite, we added a very small value to the diagonal elements of the Hessian matrix and re-calculated the covariance matrix, which turned to be positive definite.
Last, we know that starting values are very important for a nonlinear model to be converged and bad initial points lead the optimization procedure to converge to local minima. Therefore, it is often recommended to use a grid of starting values to ensure to fit the model correctly. However, from our long-term experience of the Hill function, we know how to select very good starting values. Using them showed that the model fit converged in most cases. Therefore, we did not utilize the grid method for starting values in this paper.
In this simulation study we compare the proposed method with one existing method proposed by Parham
Active: If
Inactive: If
Marginal: Neither classified as active nor inactive.
Our criteria of comparison are the power and the false discovery rate (FDR). Power is the ratio of predicted actives among true actives and FDR is the ratio of true inactives among predicted actives. High power with small FDR is preferred, while high power with high FDR is not.
The design of the simulations was modeled after the real qHTS data considered in this paper. Logarithmic value of the concentration log(
Active data are expected to follow the Hill model in
Inactive data do not follow the Hill model and scatter around 0, with no trends. They are sampled from normal distribution with mean 0 and
At Step 1 of our method, we set the significance level of
The estimated power and FDR are summarized in Figure 3. The standard errors of all power and FDR estimates provided in this figure are less than 2% and 21% of the estimates, respectively. In all cases, the Parham method has a very high FDR compared to our proposed methods. For example, when the proportion of active chemicals is 0.05, the FDR of the Parham method is 51.6%. However, those of RNM5, RNM10 and RNM15 are 14.3%, 24.6%, and 32.7%, respectively. As the proportion of active chemicals increases, the FDR decrease for all methods. However, even when the proportion is 0.2, the FDR of the Parham method is 18.4%, still quite high. On the other hand, those of RNM5, RNM10 and RNM15 are 3.47%, 6.80%, and 9.98%, respectively, which are much lower than that of the Parham method. The proposed methods outperform the Parham method in terms of the FDR. The RNM5 performs best.
The power of the Parham method is around 91% in all cases. On the other hand, those of RNM5, RNM10 and RNM15 are around 82%, 85% and 87%, respectively, in all cases. Although the proposed methods have lower power than the Parham method, they maintain reasonably good power in all cases. However, the Parham method has a much larger FDR than the proposed methods and hence it is not recommended. As described, the three variants of the proposed method are based on the significance level of the
Additionally, Figure 4 shows that the Parham method declare a much larger proportion of chemicals to be marginal than the proposed methods. For example, when the proportion of active chemicals is 0.10, the Parham method declared 40.10% of inactive chemicals as marginal whereas the RNM5 declared only 2.39% as inconclusive.
In summary, our simulation studies suggest that the proposed method outperforms the method proposed by Parham
The US Tox21 program established a library of approximately 10,000 chemicals comprising commercial origin substances by the US Environmental Protection Agency, National Toxicology Program, and NCGC (Huang
The qHTS experiment in this paper automated screening in the 1,536-well plate format and tested each library chemical at multiple concentrations. The qHTS test vessel is the plate, incorporating a grid of 1,536 small open divots (Figure 5). The first and second rows of divots were positive controls, and third and fourth rows were vehicle controls. The remaining rows were treated with chemicals and measurements were normalized relative to the controls,
where
The qHTS data included 8,306 chemicals and a variable that indicated the cluster by CASANOVA (within each chemical) a profile belonged to. CASANOVA refers to Cluster Analysis by Subgroups using ANOVA (Shockley
Tests were run various times at same concentration, i.e., the number of replication was in (1, 48), hence sample size for each chemical was in (15, 720). The data set for each chemical was analyzed using the one-step approach for the nonlinear mixed effects model (Williams
Table 1 summarizes the results of the analysis using the proposed method (RMN5) and Parham method. In Step 1 of the proposed method, among 11,064 chemicals, 5,270 chemicals classified as ‘Inactive’ because they failed to reject the null hypothesis in the one sample
In this paper we analyzed quantitative high throughput screening (qHTS) data using a robust one-step approach, and assessed potential toxicity for a number of chemicals. The ToxCast pipeline (TCPL) package (Filer
We first perform a one sample
Although outliers often occur in qHTS data, it is not easy to differentiate between outliers and natural variability in the data. However, we can still say that there may be potential outliers in the data and if there are any, then the standard least squares estimators are affected by them. Our goal is thus not to identify outliers, but to estimate parameters still relatively correctly even though there are outliers by using a robust method such as the M-estimator. It is well known that the M-estimator is robust to outliers or influential observations in linear regression. Recently, the authors introduced the M-estimation methods to nonlinear regression and showed that it is again robust to outliers in nonlinear regression through extensive simulations. For more details, see Lim (2015), Lim
The proposed qHTS data analysis process will be useful for classifying chemicals with toxicity and rapidly processing the collected 10,000 chemical datasets. The approach could subsequently be applied to other qHTS data libraries. We analyzed qHTS data using the one-step approach (Williams
Screening results for 11064 chemicals of BG1-Luc-4E2 cell agonist data using the proposed method (RMN5) and Parham method
Class | Count (%) | |
---|---|---|
RMN5 | Parham | |
Active | 1,600 (14.5%) | 1,978 (17.9%) |
Inactive | 6,711 (60.7%) | 5,782 (53.3%) |
Inconclusive (Marginal) | 2,753 (24.8%) | 3,304 (29.8%) |
Total | 11,064 (100%) | 11,064 (100%) |