This paper considers the use of classical and Bayesian inference methods to analyze data generated by variables whose natural behavior can be modeled using asymmetric distributions in the presence of left censoring. Our approach used a Lèvy distribution in the presence of left censored data and covariates. This distribution could be a good alternative to model data with asymmetric behavior in many applications as lifetime data for instance, especially in engineering applications and health research, when some observations are large in comparison to other ones and standard distributions commonly used to model asymmetry data like the exponential, Weibull or log-logistic are not appropriate to be fitted by the data. Inferences for the parameters of the proposed model under a classical inference approach are obtained using a maximum likelihood estimators (MLEs) approach and usual asymptotical normality for MLEs based on the Fisher information measure. Under a Bayesian approach, the posterior summaries of interest are obtained using standard Markov chain Monte Carlo simulation methods and available software like SAS. A numerical illustration is presented considering data of thyroglobulin levels present in a group of individuals with differentiated cancer of thyroid.
Stable distributions, usually denoted by
A great problem associated to stable distributions
Stable distributions have been widely used in the analysis of economics and actuarial data sets (Kleiber and Kotz, 2003), in the theoretical and applied physics (Barkai
In applied areas as in engineering and medical studies, it is possible to have data from variables whose behavior in the nature can be described using distribution with marked asymmetry and fat tails and additionally, there is the presence of some observations which can not be measured below some small threshold by the limitation of the measuring equipment. In this direction, there are situations as in the case of the treatment of individuals with differentiated thyroid cancer, which are necessary to follow up the patient by a fixed predetermined time, that is, we have in this case, longitudinal data with left censorship, asymmetry and heavy tails. In this paper, we develop a methodology that assumes the use of the Lèvy distribution as alternative to model data obtained during the follow up of the individuals, whose empirical distribution of frequencies shows a marked asymmetry in the presence of left censoring, extreme values and covariates. For this purpose, we first conducted a simulation study to evaluate the fit of the Lèvy distribution and compare it with results obtained using other existing common lifetime distributions; then we applied our procedure to a data set of individuals diagnosed with different forms of thyroid cancer, where all patients were treated with surgery in the thyroid gland and radioactive iodine therapy.
This paper is organized as follows: in Section 2, we present a background about the Lèvy distribution; in Section 3, we introduce the likelihood function; in Section 4, we introduce the Lèvy distribution in presence of censored data and covariates; in Section 5, we present an example with simulated data; in Section 6 we present an application with a real medical data; finally in Section 7 we introduce some comments and remarks.
If a random variable
where
When
where
The standard Lèvy distribution is given when
where
In Figure 1, we have the plots of the Lèvy pdf assuming
The cumulative distribution function (cdf) of the random variable
where erfc (
The Lèvy distribution with pdf (
where the inverse complementary error function is defined as erfc^{−1} (1 –
The hazard function
In Figure 2, it is presented the plots of the standard Lèvy hazard function (
(maximum error: 5×10^{−4}), where
Let
The log likelihood function for
The maximum likelihood estimator (MLE) for
Hypothesis tests and confidence intervals for the parameter
Assuming the Lèvy distribution with
where
In presence of left censored data, the likelihood function for
where
Assuming the presence of a vector of covariates
Inferences for the parameters
Recently, problems involving censored data were developed by Kim (2016, 2017), Khan
Under a survival study scenario, we simulated data from a standard Lèvy distribution. Using the inversion method, we have from (
Using the transformation
where Φ(
In this way, considering a given fixed value of
that is,
Considering
The sample mean, sample median and sample standard-deviation for the simulated lifetimes in Table 1 are given, respectively by, 254.32, 8.98, and 833.80. Assuming the standard Lèvy distribution with pdf (
In Figure 3, it is presented the histogram of the simulated lifetime data with the fitted standard Lèvy pdf curve from where it is observed a good fit of the model for the dataset. Using some existing popular lifetime distributions, we computed the MLEs for the median lifetime obtaining a value of 176.28 with the exponential distribution, 20.2076 with the Weibull distribution and 12.8717 with the log-logistic distribution. It is therefore possible to get different estimates for the median lifetime using these standard lifetimes. In addition, we obtained probability plots, the empirical cdf and the goodness-of-fit statistics for each one of these distributions (Table 2, Figure 4, and Figure 5). From these plots we observe that the fits of these distributions are inadequate for the data set introduced in Table 1, wherever it was observed a good fit of the data to the standard Lèvy distribution (smaller goodness-of-fit statistics).
Now, assuming a Bayesian approach for the standard Lèvy distribution with a gamma non-informative prior distribution, Gamma(0.000001, 0.000001), for
Assuming left censored data in
The results of Table 3 it indicated a good fit of the left censored data to the standard Lèvy distribution (smaller goodness-of-fit statistics).
In this section we also study the performance of the MLE estimation for
For each combination (
where
In Figure 6 it is presented the behavior of the MLE, for both bias and the root of the mean-squared error. We observe from these charts that: 1) the biases of the MLE for the parameter
To illustrate our proposed methodology, we used a medical data set related with differentiated cancer of thyroid. The data set consists of 91 patients treated by the head and neck surgeon Gabriel Sánchez de Guzmán who works in a level IV hospital located in Bogotá Colombia. This data set was also used by López
From the histograms in Figure 7, it is observed that some observations are large when compared to others values (very asymmetric data), that is, standard lifetime distributions as Weibull are not appropriate to analyze this data set. It is also observed the presence of left censored data in
To analyze the Thyroglobulin data set, it was assumed three standard Lèvy distributions with parameters
Considering the first case (not presence of covariates and independent responses) to analyze the proposed medical data set, we get inferences for parameter
Maximum likelihood estimates were obtained in SAS/NLMIXED procedure (SAS, 2010b), by applying the Newton-Raphson algorithm. To obtain the Bayesian estimates we have used MCMC methods available in SAS software 9.2, SAS/MCMC (SAS, 2010a). A single chain has been used in the simulation of sample. We simulated other 10,000 Gibbs samples taking every 10
In Figure 8, it is presented the plots of estimated cdfs based on the maximum likelihood estimates and the plots of the empirical cdfs for censored data using the Kaplan-Meier method. In addition, we obtained goodness-of-fit statistics for the standard Lèvy distribution, for the exponential distribution, for the Weibull distribution and for the log-logistic distribution (Table 5).
From the fitted models (Figure 8 and Table 5), we conclude that the thyroglobulin data are better fitted by the standard Lèvy distribution. From the results of Table 4, we observe similar inference results using classical or Bayesian approaches. For the Bayesian approach, we have used a non-informative Gamma(0.000001, 0.000001) prior distribution for the parameter
Considering the regression model (
where
Tables 6 and 7 indicate maximum likelihood estimates and the inference results considering a Bayesian approach for regression models for each response of interest. In the Bayesian analysis it was assumed for the parameters
From the obtained results of Table 6, it is observed that the covariate
The results of Table 7 indicate that the covariates
Assuming dependent responses for
where
Table 8 presents the inference results considering a Bayesian approach for the regression model. We consider for the parameters
The results of Table 8 indicate that the covariates
Different model selection methods to choose the most adequate model could be adopted under the Bayesian paradigm (Berg
We used the CPO to choose the most adequate model because the DIC is not indicated to compare the independent responses and dependent responses models. We conclude, based on t he results of Table 10, that the model with independent response is better fitted by the data (larger CPO), that is, we use this model to obtain our final inferences of interest.
Usually, in the statistical analysis of real data sets, we could have observations generated by variables whose natural behavior is asymmetric and where some sample units show large values in comparison to other ones. Additionally, it is possible to have the presence of left censored data and covariates that make a quite complex scenario of the estimation process. To analyze this kind of data, we could use a Lèvy distribution as a good alternative for standard existing lifetime distributions available in the literature commonly used to model data with asymmetry behavior.
As a special application and motivation for the present paper, it was considered a real medical data set, that contains few individuals having very large values of the response variable in comparison with the other observations and where the equipment that measures the variable does not detect very small values below a given threshold. From the obtained results of a simulation study it was observed that the standard lifetime distributions usually used as exponential, Weibull, gamma, log-normal, log-logistic or generalized gamma are not good candidates to fit this type of data. From the results obtained by fitting the Lèvy distribution to a real data set, we concluded that this type of model can be effective to fit asymmetric data in the presence of left censored data and extreme values. It is important to point out that this paper extended the work of Achcar
Section 6 verified the good fit of the Lèvy distribution with the medical data set, where standard Bayesian diagnostic criteria as DIC or CPO were used to show the best fit of the Lèvy distribution for the thyroglobulin levels data set. From this analysis, it was possible to verify the covariates with significant effects on the response of interest and discover the best dependence structure for the data. In the same section, it was observed that, the model assuming independent responses on the time shows the best fit to the data (smaller DIC and larger CPO). Comparing the observed results under both approaches (classical and Bayesian), we concluded that the Bayesian inference approach could detect more covariate effects than a classical inference approach. Under the maximum likelihood approach we detected an effect of the covariate gender (first measure of the follow up) and tumor size (second and third evaluation) on the thyroglobulin levels. Under the Bayesian approach, it was possible to detect that the age and gender have important effects on thyroglobulin levels in the first evaluation, but in the second moment of follow up the covariate age is not important but it was observed the effect of tumor size besides the covariate gender. Finally, at the end of the follow up, the variable which presents more effect in the thyroglobulin levels is the tumor size. We conclude that these obtained inference results can be important for the medical knowledge about the thyroglobulin levels behavior during the follow up time after treatment for different forms of thyroid cancer.
In according with the obtained results from our simulation study, the more popular distributions (exponential, Weibull and log-logistic) usually used to fit asymmetry data, show a poor fit when the data are generated by a process that naturally can be model by a Lèvy distribution. This result implies that if the frequency distribution of the observed data shows marked asymmetry and extreme values, necessarily the data analyst should consider the Lèvy distribution as a candidate among the other commonly used distributions in their inference procedure.
Finally, it is important to point out that for a Bayesian analysis using standard MCMC simulation methods as the Gibbs sampling or Metropolis-Hastings algorithms (see for example, Casella and George (1992) or Chib and Greenberg (1995)) we have many available software that does not require a high computational expertize to get the posterior summaries of interest and could be used in any applied study.
The authors are grateful to Gabriel Sánchez Guzmán, a head and neck surgeon of the Clinica Cardio Infantil de Bogotá, who kindly provided the data set used to fit the models proposed in our methodology. The third author had partial financial support by the Coordenação de Aperfeiçoamento de pessoal de Nível superior (CAPES) of Brazil
Simulated values (
3289.97 | 4.51 | 18.57 | 45.59 | 112.29 |
0.75 | 91.05 | 40.21 | 4456.61 | 4.00 |
36.17 | 3.13 | 1.75 | 8.44 | 3.75 |
21.24 | 3.04 | 1386.99 | 326.37 | 79.96 |
1.87 | 0.71 | 6.05 | 13.38 | 2.00 |
2120.27 | 39.47 | 5.19 | 2.66 | 40.64 |
9.06 | 6.58 | 25.40 | 1.65 | 4.08 |
2.88 | 45.63 | 13.49 | 30.31 | 162.06 |
0.56 | 4.68 | 49.61 | 32.08 | 6.17 |
136.35 | 4.80 | 8.91 | 2.81 | 2.35 |
Goodness-of-fit statistics
Distribution | −2 × log [ | AIC | BIC |
---|---|---|---|
Lèvy | 392.3 | 394.3 | 396.2 |
Exponential | 653.9 | 655.9 | 657.8 |
Weibull | 512.0 | 516.0 | 519.9 |
Log-Logistic | 491.9 | 495.9 | 499.7 |
AIC = Akaike information criterion; BIC = Bayesian information criterion.
Goodness-of-fit statistics (left censored data in
Distribution | −2 × log [ | AIC | BIC |
---|---|---|---|
Lèvy | 204.1 | 206.1 | 207.3 |
Exponential | 327.5 | 329.5 | 330.7 |
Weibull | 222.6 | 226.6 | 229.1 |
Log-Logistic | 225.9 | 229.9 | 232.3 |
AIC = Akaike information criterion; BIC = Bayesian information criterion.
Inference results for Lèvy
Variable | MLE (SE) | 95% confidence interval | PM (SD) | 95% credible interval |
---|---|---|---|---|
0.8790 (0.1303) | (0.6202; 1.1379) | 0.8777 (0.1324) | (0.6399; 1.1535) | |
0.1358 (0.0244) | (0.0874; 0.1843) | 0.1359 (0.0252) | (0.0933; 0.1912) | |
0.0747 (0.0155) | (0.0439; 0.1056) | 0.0752 (0.0159) | (0.0462; 0.1090) |
MLE = maximum likelihood estimates; SE = standard error; PM = posterior mean; SD = standard deviation.
Values of model discrimination criteria
Variable | Distribution | −2 × log [ | AIC | BIC |
---|---|---|---|---|
Lèvy | 590.2 | 592.2 | 594.8 | |
Exponential | 1030.0 | 1032.0 | 1034.0 | |
Weibull | 761.6 | 765.6 | 770.6 | |
Log-Logistic | 739.2 | 743.2 | 748.2 | |
Lèvy | 225.9 | 227.9 | 230.4 | |
Exponential | 697.3 | 699.3 | 701.8 | |
Weibull | 375.6 | 379.6 | 384.7 | |
Log-Logistic | 360.4 | 364.4 | 369.4 | |
Lèvy | 223.7 | 225.7 | 228.3 | |
Exponential | 803.6 | 805.6 | 808.1 | |
Weibull | 345.2 | 349.2 | 354.2 | |
Log-Logistic | 334.3 | 338.3 | 343.3 |
AIC = Akaike information criterion; BIC = Bayesian information criterion.
Maximum likelihood estimates (MLE) and standard error estimates (SE) for the regression models
Variable | Parameter | MLE | SE | 95% confidence interval |
---|---|---|---|---|
−0.0010 | 0.6555 | (−1.3032; 1.3011) | ||
0.0215 | 0.0119 | (−0.0021; 0.0451) | ||
−1.6025 | 0.4818 | (−2.5596; −0.6454) | ||
0.0146 | 0.0115 | (−0.0083; 0.0375) | ||
−5.1193 | 1.5545 | (−8.2072; −2.0314) | ||
0.0309 | 0.0203 | (−0.0093; 0.0712) | ||
−0.7994 | 0.5373 | (−1.8667; 0.2679) | ||
0.0348 | 0.0119 | (0.0112; 0.0584) | ||
0.0109 | 0.0098 | (−0.0085; 0.0303) | ||
−4.3124 | 1.8805 | (−8.0477; −0.5771) | ||
0.0156 | 0.0226 | (−0.0293; 0.0606) | ||
−0.6396 | 0.6163 | (−1.8637; 0.5846) | ||
0.0348 | 0.0128 | (0.0094; 0.0602) | ||
0.0050 | 0.0117 | (−0.0182; 0.0283) |
Posterior means (PM) and standard deviation (SD) for the regression models
Variable | Parameter | PM | SD | 95% credible interval |
---|---|---|---|---|
0.6235 | 0.5923 | (−0.3279; 1.4727) | ||
0.0192 | 0.0102 | (0.0014; 0.0386) | ||
−1.9977 | 0.2580 | (−2.4019; −1.2970) | ||
0.0047 | 0.0118 | (−0.0191; 0.0259) | ||
−5.5631 | 0.3562 | (−6.1059; −4.9325) | ||
0.0317 | 0.0180 | (−0.0030; 0.0662) | ||
−0.7634 | 0.1842 | (−1.0771; −0.4025) | ||
0.0346 | 0.0116 | (0.0108; 0.0552) | ||
0.0130 | 0.0063 | (0.0003; 0.0249) | ||
−3.8062 | 1.4488 | (−6.4211; −1.0299) | ||
0.0119 | 0.0219 | (−0.0299; 0.0551) | ||
−0.4974 | 0.6063 | (−1.6931; 0.7043) | ||
0.0336 | 0.0137 | (0.0043; 0.0581) | ||
0.0017 | 0.0094 | (−0.0151; 0.0205) |
Posterior means (PM) and standard deviation (SD) for dependent responses regression model
Variable | Parameter | PM | SD | 95% credible interval |
---|---|---|---|---|
−0.7663 | 0.6358 | (−2.6072; 0.3472) | ||
0.0009 | 0.0134 | (−0.0183; 0.0361) | ||
−2.6465 | 0.4021 | (−3.3359; −1.9144) | ||
0.0489 | 0.0130 | (0.0224; 0.0772) | ||
−8.4149 | 1.7018 | (−11.0656; −4.5569) | ||
0.0030 | 0.0648 | (−0.1272; 0.0970) | ||
−1.1078 | 1.3358 | (−3.4245; 1.6122) | ||
0.0997 | 0.0247 | (0.0432; 0.1382) | ||
−0.0109 | 0.0183 | (−0.0371; 0.0195) | ||
19.0035 | 7.1519 | (7.6302; 28.5281) | ||
0.0058 | 0.1010 | (−0.1325; 0.2615) | ||
−3.2657 | 1.9517 | (−6.7785; 1.1160) | ||
0.3152 | 0.1490 | (0.1142; 0.6747) | ||
−0.2662 | 0.1088 | (−0.5338; −0.1402) |
Deviance information criterion (DIC)
Model | DIC |
---|---|
Independent responses ( | 583.36 |
Independent responses ( | 220.00 |
Independent responses ( | 225.50 |
Dependent responses | 1472.93 |
Conditional predictive ordinate (average)
Model | Independent responses | Dependent responses |
---|---|---|
0.3916 | 0.3663 | |
5.6547 | 0.3560 | |
5.3591 | 0.3426 |