Multiple regression is an important cornerstone of supervised learning in statistics with countless applications. It is used to identify the relationship between multiple predictors and a response variable. The basic idea extends to the generalized linear model in which the predictor is related to the response variable via a link function when the conditional distribution of the response belongs to an exponential family with some regularity conditions; see Nelder and Wedderburn (1972). A straightforward approach in the (generalized) linear model is to use a linear predictor, which is a linear combination of predictor variables. However, this approach is often too restrictive in many practical applications, especially when the predictors are related to the mean of the response via a complicated relationship. Nonparametric regression methods have the advantage of uncovering complex relationships between the predictors and response. Examples of nonparametric methods include local polynomial regression, kernel regression, basis expansion methodology such as spline and wavelet regression, and so on. One may refer to Tsybakov (2008); Hastie
Many nonparametric estimation methods are known to enjoy good theoretical properties at least in the asymptotic sense. However, when examining performance in finite samples, considerations of the domain can lead to significant differences. Even in problems of estimating one-dimensional functions, extensive research has been conducted on methodologies aimed at addressing the impact of domain shape on estimation accuracy. Examples include estimating functions on positive domains (Geenens, 2021; Wright and Zabin, 1994), boundary effects (Müller, 1991), and estimation across the entire real line (Bak
Different shapes of domains impacting estimation accuracy is also observed within the generalized linear model framework. However, research into the development of estimation methodologies that reflect this phenomenon is very limited. Within the nonparametric approach to generalized linear models, the standard approach involves considering a tensor product space. In a popular approach using the regression spline model, this corresponds to constructing a tensor product spline basis for estimation. For example, Stone (1994) considered the use of polynomial splines and their tensor products in multivariate function estimation and showed that it leads to desirable statistical properties. However, a possible drawback of the tensor product spline method is that it implicitly assumes the shape of the domain. Specifically, tensor product splines assume that predictors are observed on a rectangular domain. In cases where the shape of the domain is irregular and complex, the supports of tensor product basis functions may not partition the domain appropriately. As a remedy for this, a nonparametric regression method based on triangulation, which efficiently partitions the domain using triangles, has been developed. Barycentric coordinates functions defined with respect to the resulting triangles form a basis for a space of piecewise polynomial functions over triangulation. For details concerning the triangulation and the barycentric coordinates basis functions, one may refer to Mark Hansen and Sardy (1998); Lai and Schumaker (2007); Jhong
Striking a good balance between bias and variance is a fundamental issue in nonparametric estimation. In the triogram regression model, this comes down to choosing the optimal number and location of the vertices of the triangulation. Ideally, vertices should be densely placed in regions with high local fluctuation in the regression function, while regions with smooth variations should have fewer vertices. If an appropriate triangular partition can be obtained, the estimator can capture the local trends in the data without compromising the overall smoothness. To this end, Mark Hansen and Sardy (1998) considered stepwise selection of vertices with the use of the Rao (score) statistic for addition and the Wald statistic for deletion. Koenker and Mizera (2004) used total variation-type penalty in the quantile regression framework. In a similar vein, Jhong
In this study, we investigate the logistic regression model based on sparse triangulation of the compact domain in β2. Despite the promising possibility, there is very little research extending the triogram model to logistic regression. One practical reason is that a large number of vertices can compromise the stability of the algorithm, making it challenging to model complex relationships effectively. To address this issue, we introduce sparsity to the boundary vertices of the triangulation based on the Ramer-Douglas-Peucker (RDP) algorithm (Douglas and Peucker, 1973; Ramer, 1972), and employ the
The rest of the paper is organized as follows. Section 2 reviews the basics of the triangulation and the associated barycentric coordinates basis functions, and defines the logistic regression estimator. Section 3 describes the implementation scheme including the proposed triangulation process and coordinate descent algorithm. A numerical study including simulation and analysis of the digit data is presented in Section 4. Section 5 summarizes the findings of this study and presents discussion about possible generalizations of the results.
This section introduces the concepts of triangulation and tent spline basis, and defines the notations used throughout the paper. For details concerning the triangulation and the corresponding spline space, one may refer to Stone (1994), Mark Hansen and Sardy (1998), Koenker and Mizera (2004), Jhong
Let Ω be a compact region in β2. Let
for
Upon obtaining the basis functions, any continuous piecewise linear function is expressed as
The
Logistic regression model is introduced to deal with the binary classification problem in which the response variable
The probability function
where
denotes the logistic function.
Suppose that we are given a set of data
where
The sparse triogram probability estimator (STriPE) of
This section summarizes the algorithm for fitting logistic regression based on a given triangulation. We first consider the standard Newton-Raphson algorithm for the logistic regression model. Let
The gradient vector is given by
The Hessian matrix is given by
The iterative update formula of the pure Newton-Raphson algorithm is given by
The algorithm summarized above is outlined in Mark Hansen and Sardy (1998). However, this algorithm may exhibit some instability in practical applications. As the dimension of the spline space increases, the area of the triangle narrows, leading to convergence issues in the algorithm. The poor condition number of the Gram matrix has led to a decrease in numerical stability. In response to this issue, we consider the coordinate descent algorithm along with the initialization strategy to be described in the next subsection. We use the Taylor second-order approximation of the univariate objective function and employ it to obtain an update formula. We now consider the coordinate-wise optimization algorithm. Since initial value of coefficients be
The univariate objective function is approximated
To obtain the closed-form solution for the minimizer, we differentiate the above expression as follows,
Setting the equation equal to zero, we derive the minimizer
The
where
The RDP algorithm (Douglas and Peucker, 1973; Ramer, 1972), an effective method for polyline simplification, plays a crucial role in the reduction of vertices in a given linear path or polygon while preserving its overall shape. RDP algorithm operates on the principle of recursive division of the line, starting with an ordered set of points or lines and a specified distance threshold,
The numerical instability of the triogram method arises primarily when the number of vertices is large. There are several approaches to mitigate this issue and improve estimation accuracy at the same time. Koenker and Mizera (2004) advocated for
As a remedy, we adopt the RDP algorithm to impose sparsity on the boundary vertices. Previous studies have utilized the convex hull algorithm proposed by Eddy (1977) for initial triangulation; see, for example, Jhong
Since we are dealing with two-dimensional data, researchers can visually assess whether sparse triangulation is suitable for the given data. It seems sufficient to evaluate the adequacy by comparing sparse representations with the visualization of the data’s shape and convex hull. Although the sparsity parameter
We adopt the
In the triangulation process,
where
Through numerical experiments, we confirmed that the triangulation strategy, combined with the coordinate descent algorithm presented in Section 3.1, significantly enhances numerical stability and estimation accuracy. The proposed algorithm was implemented using R software, employing the grDevices, stats, and RDP packages. The overall implementation algorithm is summarized in Algorithm 1 below.
1: | |
|
|
|
|
|
|
|
|
|
|
|
|
2: | |
stats.kmeans(), | |
RDP.RamerDouglasPeucker() | |
3: | |
4: | Compute the boundary vertices: |
5: | Choose |
6: | Compute the interior vertices: |
7: | Combine boundary vertices and interior centroids |
8: | Compute the design matrix G ∈ β |
9: | |
10: | |
11: | |
12: | |
13: | Compute the gradient ∇ |
14: | Compute the Hessian ∇2 |
15: | Update |
16: | |
17: | diff = || |
18: | |
19: |
This section illustrates the advantages of the proposed method based on simulation studies. We consider three probability functions defined on non-rectangular domains. Each function is defined as a logistic transformation of a linear combination of basis functions defined on a triangulation obtained by adding 1, 3, and 2 interior vertices to pre-specified boundary vertices, respectively. The contour plots of the example functions and the domain areas can be seen in Figure 2.
We randomly generated
We consider sample sizes of
where
We apply our method to analyze the handwritten zip code database presented in LeCun
In the left panel of Figure 4, we observe that the data is not distributed in a rectangular shape. We obtained an initial triangulation using the proposed initialization strategy. The user-defined parameter
For
This paper proposed a multivariate nonparametric logistic regression method based on the sparse triangulation within a compact domain in β2. Sparsity on the boundary vertices of the triangulation was imposed by applying the RDP algorithm to the initial vertices obtained by the convex hull algorithm. The complexity of estimation methods was controlled by the
The results of this paper are expected to provide a foundation for further research. They can be generalized and extended in a few ways. First, we can extend the method to the case of
Second, we can consider combining the proposed methodology with sparsity-inducing penalization. In Jhong
The average MSE values and the standard errors (in parentheses) of STriPE, TPS, CS, and KLR with 50 replicated simulations
Example 1 | ||||
---|---|---|---|---|
Sample Size | STriPE(se) | TPS(se) | CS(se) | KLR(se) |
0.0345(0.0048) | 0.0355(0.0053) | 0.0425(0.0059) | ||
0.0305(0.0011) | 0.0116(0.0012) | 0.0281(0.0028) | ||
0.0073(0.0004) | 0.0069(0.0004) | 0.0221(0.0015) | ||
0.0057(0.0002) | 0.0056(0.0002) | 0.0198(0.0011) | ||
0.0046(0.0002) | 0.0043(0.0002) | 0.0177(0.0014) | ||
Sample Size | STriPE(se) | TPS(se) | CS(se) | KLR(se) |
0.0377(0.0043) | 0.0400(0.0050) | 0.0916(0.0020) | ||
0.0160(0.0009) | 0.0150(0.0007) | 0.0890(0.0012) | ||
0.0121(0.0005) | 0.0124(0.0005) | 0.0900(0.0011) | ||
0.0098(0.0003) | 0.0099(0.0003) | 0.0877(0.0009) | ||
0.0092(0.0004) | 0.0091(0.0003) | 0.0891(0.0010) | ||
Sample Size | STriPE(se) | TPS(se) | CS(se) | KLR(se) |
0.0287(0.0045) | 0.0271(0.0043) | 0.0180(0.0007) | ||
0.0062(0.0006) | 0.0067(0.0009) | 0.0154(0.0003) | ||
0.0047(0.0004) | 0.0050(0.0005) | 0.0146(0.0002) | ||
0.0033(0.0003) | 0.0034(0.0003) | 0.0143(0.0001) | ||
0.0026(0.0002) | 0.0141(0.0001) |
In bold, best row-wise.