Survival has traditionally been analyzed using statistical methodologies that include nonparametric methods such as Kaplan-Meier (KM) curves, actuarial methods, and some data modeling procedures using analytical probability distributions that are easy to use, such as the Exponential, Weibull, Gamma and Lognormal distributions. However, these distributions may not necessarily reflect the natural behavior of the times observed in a monitoring study, presenting restrictions related to the density form. Datasets are sometimes observed to exhibit important extreme values. An example are patients who survive for periods longer than those expected by the treating physicians, which is mainly caused by the present biological diversity that makes these patients different in terms of their susceptibility to contracting diseases or their response to a performed procedure. Such situations cause statistical methods to provide inaccurate predictions of times, as the likelihood function associated with the probability distribution used for modeling fails to include sufficient information about extreme values, thereby affecting the subsequent estimations (Martínez and Achcar, 2014).
Consequently, the present study proposes modeling the Lévy probability distribution, which is characterized by its tails, which are heavier than a Gaussian distribution; consequently, it may present a greater impulsivity degree in the distribution of times, where some events or group of events still have an occurrence probability. The model is part of the
It is possible to use different reparametrizations of the Lévy distribution. According to Nolan (2009) one of the most used is the standard Lévy distribution, with
Using the obtained results, this study compares the proposed distribution with parametric and nonparametric traditional methods (Kaplan-Meier). Additionally, illustrates the methodological proposal using a dataset concerning the survival times of patients with colon cancer (CC) that exhibit a marked asymmetry and censorship presence during the follow-up of individuals.
The Lévy distribution is a continuous probability distribution, named by the French mathematician Paul Lévy in 1925 during his investigation of the behavior of independent random variables. This distribution belongs to a family of stable distributions that have been of great interest because they allow modeling the asymmetry and arbitrarily larger tails than normal distribution (Buckle, 1995).
Stable distributions are described by four parameters (
The parameter
The parameter
The location parameter
The scale parameter
In the special case of the Lévy distribution, the stability parameter
In addition, it has the following distribution and survival functions:
where the function
It is possible to use different reparametrizations of the Lévy distribution, according to Nolan (2009) one of the most used is the “standard Lévy distribution”, with
By definition, the risk function for any parametric model associated with a random variable
Taking into account the survival function (2.3), the risk function can be expressed as
then
The numerator in (3.2) corresponds to the density function of a random variable with a Lévy distribution with parameters
Note that,
To perform an iterative calculation of the abovementioned series, the following alternative formula is used:
In this manner, the risk function calculated iteratively is determined as
Another manner in which to express the denominator of the risk function is in terms of the standard normal distribution function, where the only difference is its scale and one translation (Ayant and Borg, 1974).
Consequently,
Therefore, considering the expression (3.5), the risk function (3.4) is rewritten as:
Figure 2 shows some graphs related to the risk function (3.6) of the Lévy distribution with different values of the parameter
Let
It can be observed in (4.1) that the log-likelihood function has a complex structure that produce derivatives (Annexed) that are difficult to address. Therefore the use of some numerical approximation through iterative methods such as the Newton-Raphson algorithm are suggested to obtain a solution to the problem.
A simulation study was performed to evaluate and model different cases that may arise during the analysis, considering the existence of censorship in the estimation process. It was decided to perform
where
The lower limit
In the particular case of the standard Lévy distribution, it was necessary to approximate the following probability:
considering that
The random samples
therefore
The previous step was repeated for the censored random variables
A random sample from the 1,000 simulated samples was chosen considering each scenario. With this sample, the survival function, denoted by
Considering the asymptotic normality of the maximum likelihood estimators, the 100(1 −
From the parametric approach, the CIs for the survival function
The confidence limits of the survival function
From the sample that was chosen to estimate the survival function, a bootstrap sample was selected (
Then, the set of artificial data (bootstrap sample) was denoted datos*; these data were used to estimate the parameters via the maximum likelihood method.
Based on the previous estimate, it was obtained
The previous steps were repeated
Then, it was possible to obtain a distribution of the bootstrap estimates for each time
For each distribution, the confidence limits were calculated using a method based on the frequency distribution itself, called the “percentile method”. In this method, the confidence limits 100(1−
Based on all the percentiles, the lower and upper confidence limits were drawn, which turned out to be around the respective parametric survival curve, which was accompanied by the function estimated via Kaplan-Meier and some CIs calculated in previous steps.
In addition, a comparison was also made between the survival function
To calculate these differences, the following expression was used:
The integral (5.2) was evaluated between the minimum and maximum of the simulated random sample.
Table 1 indicates that the estimator of the scale parameter
According to the values obtained when taking the differences between the survival functions, it was observed that, under any simulation scenario, the two curves approximated inadequately when the sample size increased, especially when it was
Figure 3 shows that the chances of survival are high at the beginning of the time interval. This behavior was repeated in all simulation scenarios. Under the standard Lévy distribution, it is possible to calculate the survival probabilities for censored times. The lengths of the CIs for the Kaplan-Meier estimator increased according to the increase in censorship and moderate sample sizes (
To illustrate the proposed method, a set of data provided by a Health Services Provider Institution in Colombia was used. The data file contains basic information regarding individuals treated during oncology consultations in the period between October 2003 and January 2016. The information includes sociodemographic information, care type, and procedures performed (chemotherapy, biopsy, and radiotherapy) on patients from different regions of Colombia diagnosed with CC. The information includes 88 patients (women and men) who entered the study gradually as they underwent consultation. The event of interest was “death associated with a failure attributable to CC”, which occurred for 79 of the individuals. The remaining individuals experienced diverse situations that classified them as
It is worth mentioning that the times included in the analysis correspond to information about patients treated with different procedures, considering their condition, in addition to patients with different stages of CC (stages II, III, and IV) since the available information was quite restricted. To illustrate the methodological proposal, it was decided via consultation with experts to combine the information about all patients to obtain preliminary results regarding the overall survival of the disease.
Figure 4 presents the frequency distribution and the boxplot of the time variable. A larger observation than the rest of the data is shown, reflecting great asymmetry and therefore a histogram with a long tail to the right. The shortest survival time found among the patients was 0.24 months, however, the longest-surviving patient reached 147.5 months (4425 days). The censorship percentage exhibited by the dataset was 10.2%.
Table 2 reports the results of the estimations via the maximum likelihood method of candidate distributions. Based on the Akaike information criterion (AIC), Bayesian information criterion (BIC) and Akaike information criterion corrected (AICc), it was found that from the candidate distributions, those that presented a better fit were the standard Lévy distribution (325.8, 328.2, 325.8) and the Lognormal distribution (476.3, 481.3, 476.5). When the two parameters of the model Lév(
The estimation of parameters makes it possible to compare the empirical frequency distribution and theoretical distribution functions. The empirical frequency distribution was constructed using the Kaplan-Meier method due to the presence of censored data. It was observed in Figure 5 that the exponential and gamma distributions fitted the dataset the least well. In addition, the standard Lévy distribution curve exhibited a different behavior compared to the other distributions. This is because this distribution is more flexible and allows modeling extreme data (147.5 months) located in the right tail of the curve.
With the purpose of comparing the two curves of the
One of the main functions of the survival model is the risk function. With this, the manner in which the instantaneous death rate changes in patients with CC over time. Figure 7 shows that that from the distributions with the best fit, the risk function
Usually, in a survival analysis, datasets have observations associated with variables whose natural behavior is asymmetric, and some sample units can exhibit large values in comparison with others. Additionally, it is possible to have censored data on the right that makes the scenario quite complex at the time of the estimation process. To analyze this type of data, the Lévy distribution could be used as a good alternative to common parametric models available in the literature: Exponential, Gamma, Weibull and Lognormal.
In addition, if the assumptions of the parametric models exist, a more robust analysis can be performed in comparison with the non-parametric methods (Kaplan-Meier curves and actuarial methods), since taking into account some assumptions and the selection of a distribution of hypothetical probability for survival times, statistical inference is more accurate, which makes standard deviations smaller when there are no such assumptions (Hashemian
According to a preliminary analysis regarding the performance of the new distribution, assuming that the location parameter
Usually in the literature, the confidence limits of the survival function are reported for times of interest as the median or some arbitrary time, which differs from the proposal that focuses on building CIs via the percentile method for functions
It is important to remember is that this study includes the analytical expression of the risk function of the Lévy model along with its graphical representation, which is similar to the density function
As a special application and motivation for the present study, a set of medical data that contains a few people with very large values of the response over time that they survive the event was considered. From the obtained results, it was detected that the standard distributions are not good candidates to adapt to this type of data. However, after the data were fitted using the Lévy distribution with the location parameter
Among the limitations of the study, we refer to the low number of patients included. The cancer staging could also be a good alternative for a specific analysis of patients, since a global survival analysis for colon cancer was carried out based on available information. For future research work, other types of asymmetrical distributions with characteristics can be considered flexible functions of density and risk, which favor modeling, when faced with sets of data that present extreme values of importance, such as the Gamma-Kumaraswamy Generalized distribution, which contain a large number of special sub-models that are useful to compare in terms of goodness of fit with the proposed study distribution.
In this sense, the use of these models can provide great benefits for medical research from new procedures of parametric survival analysis, where some units of the sample can have very large values compared to others. This type of methodology can be used for the analysis of times in the area of health as well as in the area of industrial quality control with elements that have a longer duration than the rest of the data.
The authors express their sincere thanks to Dr. Jose William Martinez for authorizing the use of cancer data to illustrate the concepts discussed in this article.
The derivatives of the log-likelihood in relation to the parameters
Functions of the survival Lévy model.
Risk functions of the Lévy model.
Survival functions of the standard model Lévy and the Kaplan-Meier estimator, taking into account the censorship and the size of the sample.
Distribution of survival times in months of patients with colon cancer. (a) Histogram of survival times; (b) Box-plot of survival times.
Adjustment of the candidate parametric models. (a) Histogram versus probability distributions; (b) Adjustment of the proposed distributions; (c) Empirical distribution function versus the estimated distribution function.
Survival functions of the times with censorship, coming from patients with colon cancer, under the parametric and nonparametric (Kaplan-Meier) approach. (a) Standard Lévy distribution; (b) Lognormal distribution.
Risk function for survival times from patients with colon cancer. (a) Standard Lévy distribution; (b) Lognormal distribution.
Estimation of the
Cens | ALI |
PC |
|||||||
---|---|---|---|---|---|---|---|---|---|
10% | 1.993 | 60 | 0.05200 | 0.00949 | 0.00009 | 0.03721 | 0.943 | 0.03146 | 1.5623 |
100 | 0.05121 | 0.00724 | 0.00006 | 0.02839 | 0.948 | 0.02389 | 1.4631 | ||
500 | 0.05015 | 0.00317 | 0.00001 | 0.01243 | 0.949 | 0.00278 | 1.3650 | ||
1000 | 0.05003 | 0.00224 | 0.00000 | 0.00877 | 0.950 | 0.00168 | 1.2694 | ||
30% | 0.193 | 60 | 0.05129 | 0.00939 | 0.00008 | 0.03681 | 0.961 | 0.00515 | 1.6872 |
100 | 0.05087 | 0.00721 | 0.00005 | 0.02828 | 0.958 | 0.02296 | 1.5817 | ||
500 | 0.05020 | 0.00318 | 0.00001 | 0.01247 | 0.948 | 0.00652 | 1.4809 | ||
1000 | 0.05003 | 0.00224 | 0.00000 | 0.00879 | 0.951 | 0.00118 | 1.3870 | ||
55% | 0.036 | 60 | 0.05112 | 0.00939 | 0.00009 | 0.03866 | 0.940 | 0.03413 | 2.6871 |
100 | 0.05107 | 0.00763 | 0.00005 | 0.02989 | 0.941 | 0.03246 | 2.3874 | ||
500 | 0.05023 | 0.00335 | 0.00001 | 0.01312 | 0.946 | 0.00296 | 2.2277 | ||
1000 | 0.05007 | 0.00236 | 0.00000 | 0.00924 | 0.946 | 0.00207 | 2.2821 |
Estimation of the parameters for the candidate distributions
Distribution | Parameter | Estimation | SE | CI (95%) | −2 × log [ |
AIC | BIC | AICc | |
---|---|---|---|---|---|---|---|---|---|
Common distribution | Exponential | 0.122 | 0.013 | (0.095, 0.149) | 490.2 | 492.2 | 494.7 | 492.3 | |
Gamma | 0.877 | 0.117 | (0.646, 1.108) | 489.2 | 493.2 | 498.2 | 493.4 | ||
0.106 | 0.020 | (0.067, 0.145) | |||||||
Weibull | 0.853 | 0.063 | (0.728, 0.978) | 485.1 | 489.1 | 494.0 | 489.2 | ||
0.133 | 0.018 | (0.105, 0.182) | |||||||
Lognormal | 1.440 | 0.128 | (1.188, 1.691) | 472.3 | 476.3 | 481.3 | 476.5 | ||
1.176 | 0.093 | (0.992, 1.359) | |||||||
Unconventional distribution | Lévy | −0.111 | 0.108 | (−0.324, 0.101) | 161.1 | 326.3 | 331.3 | 324.3 | |
2.365 | 0.487 | (1.415, 3.316) | |||||||
Standard Lévy | |||||||||
1.994 | 0.301 | (1.403, 2.584) | 323.8 | 325.8 | 328.2 | 325.8 |
SE = Standard error; CI = confidence interval; AIC = Akaike information criterion; BIC = Bayesian information criterion; AICc = corrected AIC.