Density estimation is indeed the process of constructing an estimate of the probability density function (pdf) from an available data. This estimation not only represents the data distribution but also provides summary statistics such as the mean, median, variance, moments and quantiles. Furthermore, density estimates provide information about distribution characteristics, including skewness, kurtosis, and multimodality within the data. The estimation of pdf is a fundamental concept in statistics and a widely researched topic. There are two commonly methods for density estimation: parametric and nonparametric methods. The parametric method assumes that the data is drawn from a known distribution, whereas the nonparametric method aims to estimate the density function directly from the data. Several nonparametric density estimation techniques commonly include the histogram, naïve density estimator, nearest neighbor method, and orthogonal series estimator. Kernel density estimation (KDE) is a widely used nonparametric density estimation. The KDE relies on the kernel function which determines the weight assigned to each data point, and the bandwidth which controls the smoothness of the estimate. Hence, the selection of the bandwidth is the most crucial in the context of kernel density estimation.
The selection of the bandwidth is a critical issue that arises in the context of KDE, as the performance of the KDE depends on the chosen bandwidth. A small bandwidth value results in an undersmoothed density, while a large bandwidth value leads to an oversmoothed density (Gramacki, 2018). There are various methods to determine a bandwidth for KDE. The primary categories of bandwidths for KDE include rules-of-thumb (ROT), cross-validation (CV), and plug-in (PI). The plug-in method has been demonstrated to offer excellent performance in many cases. Due to its demonstrated excellent performance, the plug-in method is the common first choice in practical applications. However, there is still room for further improvement in its implementation (Wand and Jones, 1995). Plug-in bandwidths are operated on the straightforward concept of substituting estimated values of the unknown quantities into formulas for achieving the asymptotically optimal bandwidth.
In order to overcome optimal bandwidth, this study introduces the proposed plug-in bandwidth. This bandwidth leverages the first kind shifted Chebyshev polynomials, providing a solution to the problem. The effectiveness of the methods relies on the estimation of integrated squared density derivative functionals, a subject that has been explored by many researchers. Silverman (1986) provided a comprehensive overview of density estimation techniques, including discussions on bandwidth selection and the role of integrated squared density derivative functionals. Sheather and Jones (1991) discussed a data-driven method for bandwidth selection in kernel density estimation, which related to integrated squared density derivative functionals. Raykar and Duraiswami (2006) developed the algorithms for estimating density derivatives using the univariate Gaussian kernel. These algorithms are utilized to calculate the optimal bandwidth for kernel density estimation. Tenreiro (2011, 2020) proposed direct plug-in bandwidth for the KDE based on the Fourier series and the Hermite series. In a recent study, Dharmani (2022) introduced a bandwidth selection by employing the near Gaussian assumption. This assumption enables the use of the Gram-Charlier A series as an approximation to the function for the purpose of estimating its density derivative. The objective of this paper is to derive a bandwidth by using the first kind shifted Chebyshev polynomials as an approximation to the density function. This is aimed at estimating the integrated squared density derivative functionals.
The remaining sections of this article are organized as follows. Section 2 provides an overview of the fundamental properties of kernel density estimation. In Section 3, various methods for bandwidth selection are discussed. These methods include least squares cross-validation bandwidth, an improved version of rules of thumb bandwidth, and the Sheather and Jones plug-in bandwidth. Section 4 offers a brief definition of the first kind shifted Chebyshev polynomials, then utilizes them to approximate the underlying density function, and finally presents the proposed plug-in bandwidth based on this estimator. Section 5 presents a simulation study of the proposed bandwidth, examining its performance under different distributions and sample sizes using the R programming language. Additionally, Section 6 applies the proposed bandwidth to real dataset. Finally, Section 7 concludes the article with a summary of the findings.
The kernel density estimator for a random sample
where
The kernel function
In practice, it is common to consider a global error criterion that measures the distance between the estimated density function
where
The assumptions are
where
Several methods are available for determining appropriate bandwidth for KDE. The three main types of bandwidths are as follows: Cross-validation (CV), rules-of-thumb (ROT) and plug-in (PI) (Gramacki, 2018). Cross-validation involves techniques like least squares cross-validation (LSCV), biased cross-validation (BCV), and smoothed cross-validation (SCV). Rules-of-thumb includes approaches such as Silverman’s rule of thumb and its improved version. Plug-in methods have been explored by various authors, including Park and Marron (1990), Sheather and Jones (1991), and Hall
A well-known method for selecting the bandwidth is least squares cross-validation (LSCV), proposed by Rudemo (1982) and Bowman (1984). The main objective is to find the optimal bandwidth
(Silverman, 1986; Wand and Jones, 1995).
The first term ∫
where
The second term ∫
where
Finally, the last term ∫
As a result,
(Härdle
The optimal bandwidth minimizes AMISE
where
The rule-of-thumb bandwidth is determined by replacing the density function
The rule-of-thumb bandwidth is sensitive to outliers, which cause an overestimation of
The concept of plug-in bandwidth was originally introduced by Woodroofe (1970). This concept is based on the idea of using an optimal bandwidth that minimizes AMISE(
The pilot bandwidth for the estimation of
where
where
Bandwidth selection in KDE using the AMISE criteria involves estimating the second-order derivative of the unknown density being estimated. The first kind shifted Chebyshev series expansion can be used as an approximation method for an unknown density function. This section will cover the necessary background on the first kind shifted Chebyshev polynomials and derive the bandwidth.
The first kind Chebyshev polynomials of degree
In order to use the first kind Chebyshev polynomials on a finite range [
Afterward, the first kind shifted Chebyshev polynomials are generated by
In the context of the interval [
where the coefficients are defined via the formula
where
is the Chebyshev zero nodes.
The integration of the squared function of the second-order derivative of the first kind shifted Chebyshev series expansion
By substituting the values of
The bandwidth
In this section, the aim is to evaluate the performance of the proposed bandwidth
The main idea is to find the optimal number of terms in the expansion (
The simulation results evaluate the performance of the plug-in bandwidth by finding the bandwidth that minimizes the mean integrated squared error, MISE (
In this section, kernel density estimation is applied to real datasets. The performance of the proposed bandwidth
A real dataset named “flywheels” from Anderson-Cook (1999) and comprising 60 observations on flywheel imbalance angles, will be utilized. This analysis focuses on how different bandwidth choices influence kernel density estimates and histograms, serving as methods to understand data distribution. Figure 2 displays a histogram with 14 bins for this dataset. The density seems to exhibit asymmetric bimodal behavior. The kernel density estimate, using different bandwidth options such as
When selecting bandwidth for kernel density estimation, the direct plug-in method is the common initial approach, but there is room for enhancement. Estimating the bandwidth involves finding the integration of the squared function of the second-order derivative of the unknown density to be estimated. This article introduces a bandwidth selection technique by incorporating the estimation of
The simulation studies revealed that the first kind shifted Chebyshev series-based plug-in bandwidth (
The authors would like to thank Kasetsart University and Rajamangala University of Technology Rattanakosin for the support.
MISE (
n | |||||
---|---|---|---|---|---|
Density #1 | 25 | 10.7819 | 4.8937 | 7.7200 | |
50 | 4.4220 | 2.7017 | 3.7667 | ||
100 | 2.5376 | 1.5025 | 1.8990 | ||
150 | 1.8510 | 1.1071 | 1.3038 | ||
200 | 1.2925 | 0.9060 | 1.0161 | ||
Density #2 | 25 | 15.8366 | 9.6809 | 14.2788 | |
50 | 8.0684 | 5.2273 | 6.4727 | ||
100 | 4.3664 | 2.9407 | 3.3472 | ||
150 | 3.2559 | 2.2902 | 2.5283 | ||
200 | 2.7399 | 2.0048 | 2.1457 | ||
Density #3 | 25 | 129.2468 | 154.7279 | 90.0448 | |
50 | 62.6229 | 143.2790 | 64.9015 | ||
100 | 36.2592 | 131.1196 | 46.5380 | ||
150 | 24.5300 | 125.3598 | 35.7687 | ||
200 | 19.9220 | 117.4577 | 30.3761 | ||
Density #4 | 25 | 219.2996 | 172.4827 | 136.0272 | |
50 | 89.2615 | 136.3549 | 74.9258 | ||
100 | 39.6160 | 114.3767 | 41.9968 | ||
150 | 27.6475 | 98.8623 | 28.8491 | ||
200 | 22.6444 | 90.6911 | 23.0213 | ||
Density #5 | 25 | 827.9669 | 471.6216 | 753.9018 | |
50 | 366.6813 | 213.6145 | 276.7979 | ||
100 | 197.7864 | 120.3047 | 143.0313 | ||
150 | 139.2668 | 91.8233 | 103.4045 | ||
200 | 110.0786 | 76.4904 | 84.0808 | ||
Density #6 | 25 | 7.9374 | 3.9941 | 6.1431 | |
50 | 4.6343 | 3.0836 | 2.6055 | ||
100 | 2.3224 | 1.7155 | 1.7884 | ||
150 | 1.7449 | 1.3714 | 1.3327 | ||
200 | 1.4163 | 1.1692 | 1.1060 | ||
Density #7 | 25 | 16.2093 | 18.0853 | 8.5167 | |
50 | 8.0472 | 14.4206 | 4.7440 | ||
100 | 3.8401 | 11.2955 | 2.8610 | ||
150 | 2.9414 | 9.5903 | 2.1900 | ||
200 | 2.3237 | 8.4671 | 1.7893 | ||
Density #8 | 25 | 12.2326 | 6.2189 | 8.5853 | |
50 | 6.9255 | 4.0860 | 4.6909 | ||
100 | 3.5754 | 2.9847 | 2.7308 | ||
150 | 2.6956 | 2.4547 | 2.1257 | ||
200 | 2.2223 | 2.1839 | 1.7520 | ||
Density #9 | 25 | 10.4642 | 4.1865 | 5.6758 | |
50 | 4.2284 | 2.9857 | 3.1319 | ||
100 | 2.6069 | 2.1917 | 2.0087 | ||
150 | 1.8859 | 1.8096 | 1.5154 | ||
200 | 1.5083 | 1.5946 | 1.2736 | ||
Density #10 | 25 | 36.8391 | 23.5191 | 26.6129 | |
50 | 26.1360 | 21.0004 | 21.4497 | ||
100 | 19.5172 | 20.0515 | 19.5826 | ||
150 | 14.3548 | 19.3202 | 18.6060 | ||
200 | 10.8535 | 18.9143 | 17.9090 | ||
Density #11 | 25 | 8.0731 | 4.5490 | 6.4799 | |
50 | 4.9207 | 3.4638 | 3.0759 | ||
100 | 2.7689 | 2.2636 | 2.2999 | ||
150 | 2.2054 | 1.8821 | 1.8253 | ||
200 | 1.8125 | 1.6684 | 1.5879 | ||
Density #12 | 25 | 16.7257 | 9.6699 | 12.1060 | |
50 | 10.4177 | 8.1942 | 8.6793 | ||
100 | 7.8214 | 7.5209 | 7.3742 | ||
150 | 6.0406 | 7.1254 | 6.6118 | ||
200 | 4.8533 | 6.7446 | 5.9887 | ||
Density #13 | 25 | 10.6791 | 5.8031 | 7.6241 | |
50 | 6.4439 | 4.3061 | 4.7116 | ||
100 | 4.0942 | 3.4850 | 3.3782 | ||
150 | 3.3745 | 3.0413 | 2.8025 | ||
200 | 3.0361 | 2.8499 | 2.6000 | ||
Density #14 | 25 | 32.0359 | 24.3535 | 15.4503 | |
50 | 15.4430 | 22.3324 | 11.9413 | ||
100 | 10.0967 | 20.7340 | 9.5673 | ||
150 | 7.9991 | 19.4951 | 8.1350 | ||
200 | 6.4908 | 18.5153 | 7.1848 | ||
Density #15 | 25 | 20.8581 | 28.7160 | 19.9445 | |
50 | 14.3004 | 27.7959 | 12.0152 | ||
100 | 10.2774 | 26.6659 | 8.6503 | ||
150 | 8.5594 | 25.5584 | 7.5136 | ||
200 | 7.1760 | 24.7048 | 6.7325 |
MSE (
Bandwidth | MSE |
---|---|
1.9152 | |
2.0158 | |
1.6829 | |