TEXT SIZE

search for



CrossRef (0)
Some efficient ratio-type exponential estimators using the Robust regression’s Huber M-estimation function
Communications for Statistical Applications and Methods 2024;31:291-308
Published online May 31, 2024
© 2024 Korean Statistical Society.

Vinay Kumar Yadav1,a b, Shakti Prasadc

aDepartment of Basic and Applied Science, National Institute of Technology Arunachal Pradesh, India;
bDepartment of Mathematics, School of Computational and Applied Sciences, Brainware University, India;
cDepartment of Mathematics, National Institute of Technology Jamshedpur, India
Correspondence to: 1 Department of Basic & Applied Science, National Institute of Technology, Arunachal Pradesh, Jote, Papum pare-791113, India. E-mail: vkyadavbhu@gmail.com
Received June 16, 2023; Revised September 22, 2023; Accepted December 16, 2023.
 Abstract
The current article discusses ratio type exponential estimators for estimating the mean of a finite population in sample surveys. The estimators uses robust regression’s Huber M-estimation function, and their bias as well as mean squared error expressions are derived. It was campared with Kadilar, Candan, and Cingi (Hacet J Math Stat, 36, 181–188, 2007) estimators. The circumstances under which the suggested estimators perform better than competing estimators are discussed. Five different population datasets with a well recognized outlier have been widely used in numerical and simulation-based research. These thorough studies seek to provide strong proof to back up our claims by carefully assessing and validating the theoretical results reported in our study. The estimators that have been proposed are intended to significantly improve both the efficiency and accuracy of estimating the mean of a finite population. As a result, the results that are obtained from statistical analyses will be more reliable and precise.
Keywords : ratio type exponential estimator, mean squared error (MSE), Huber M function, Robust regression, auxiliary variable, percent relative efficiency
1. Introduction

In statistics, estimation is the method of estimating an unknown population parameter utilising sample data. Determining the population mean using a sample of data is a frequent example of this. A single value called a point estimator, which is used for estimating the population parameter, is one method of estimation. Unreliable or biassed estimates may result from point estimators’ sensitivity to outliers or other odd observations in the data.

A ratio estimator, that employs the ratio of two variables that are present in the sample data to determine the ratio of the associated population parameters, is one approach to overcoming this issue. Whenever the correlation coefficient between the variable being studied and an auxiliary variable is positive, it indicates that the two variables typically vary together, and this relationship may be used to increase estimate accuracy. By modifying the estimate of the study variable with the auxiliary variable, a more precise and effective estimate of the population mean may be achieved.

In order to improve the performance of the ratio estimator even further, information about the auxiliary variable may be integrated into the estimating process. For simplicity, the estimator may be modified based on the information about the auxiliary variable’s unpredictability in relation to the mean considering the coefficient of variation, and the estimator could be modified based on the auxiliary variable’s distributional shape considering the coefficient of kurtosis. Now, the ratio estimator may be tailored to the unique properties of the data by taking into consideration these other variables, which can result in even more precise and effective estimations of the unknown population parameter. Overall, ratio estimators are considered an excellent method for determining population characteristics from sample observations, and they may be particularly useful when there is a strong positive correlation between the study and auxiliary variables. Notably, it’s important to exercise precautions when utilising any estimating technique and to take into account any potential restrictions and underlying presumptions that might be present.

The potential benefits of incorporating data on the auxiliary variable to improve the performance of ratio estimators have been discovered by several statisticians.

A number of studies made use of this approach, for example Kadilar and Cingi (2004), Kadilar et al. (2007), Noor-ul-Amin et al. (2016, 2018, and 2022), Prasad (2020), Zaman (2020, 2021), and Zaman and Kadilar (2021a, 2021b). By using additional information from the auxiliary variables, these investigations are able to increase the ratio estimators’ precision and accuracy, which has significant consequences for the validity and precision of statistical research. A cutting-edge and well-liked technique for handling outliers in datasets is the use of robust regression algorithms. These techniques were created to reduce the impact of outliers on regression analysis while still taking into account a significant portion of the data points. There are several strategies to deal with outliers, including changing the data, using Winsorization, and locating and erasing outliers. The unique features of the data and the objectives of the study influence the methodology used.

The L1 criteria for this purpose was initially proposed by Edgworth (1887), according to sources. The most widely employed M-estimator is afterwards suggested by Huber (1973). When compared to the LS estimator, the Huber M-estimator has the additional advantage of not being as sensitive to outliers. When it comes to handling data outliers, Huber M-estimation is a more reliable approach than LS estimation. As a result, it is frequently used in circumstances wherein outliers might be present. The Huber M-estimator makes use of the function ρ(ε), which strikes a balance between ε2 and |ε|. Here, ε refers to the error term from the linear regression model y = a + bx + ε, and an is the model constant. The following are the parameters that the Huber ρ(ε) function accepts:

ρ()={2;-ll,l||-l2;<-l 듼 듼 듼or 듼 듼 듼<l.

The tuning constant l is a parameter that affects the level of robustness of the estimator used in statistical analysis. By adjusting the value of l, one can control the sensitivity of the estimator to outliers and other unusual observations in the data. A larger value of l makes the estimator more robust, while a smaller value of l makes the estimator more sensitive to outliers. The appropriate value of l depends on the specific characteristics of the data and the goals of the analysis, and may need to be determined through experimentation or other means.

The formula l = 1.5σ, was developed by Huber (1981), where σ is an estimate of the SD σ of the population’s random errors. For additional details on constant l and M-estimators, see Rousseeuw and Leroy (1987). The regression coefficient, which is calculated by minimising, is βhm.

inρ(yi-a-bxi)

with respect to a and b.

Huber (1981) developed the M-estimators method inside the robust regression framework to deal with outliers. By developing ratio estimators that incorporate Huber’s M-estimator, which has been shown to be more successful in delivering accurate and reliable results in the presence of outliers, Kadilar et al. (2007) furthered this method. Essentially, these methods enable the removal of outlier effects, improving the accuracy and robustness of regression models. In order to lessen the negative impacts of the outlier data, we explore employing the Huber M-estimation function in this paper’s ratio type exponential estimators.

To study more about robust regression see quantreg (Koenker, 2009) package in R-Software (2021), Zaman and Bulut (2019, 2021), Zaman et al. (2021, 2022), and Bulut and Zaman (2022).

In Section 2, we consider the existing estimators using robust regression. In Section 3, we discuss new ratio type exponential estimators based on the Huber M-estimation function, as well as their MSEs. Section 4 provides efficiency comparisons of the existing and considered estimators based on the expression of MSEs. Sections 5 and 6 present the results of the numerical illustration and simulation study, respectively. In the final section, we draw a conclusion based on these results.

2. Existing ratio estimators

Kadilar et al. (2007) explored ratio estimators ycki, (i = 1, 2, 3, 4, 5) for estimating finite population mean Y using robust regression is given as

y¯ck1=y¯+β^hm(X¯-x¯)x¯X¯.y¯ck2=y¯+β^hm(X¯-x¯)x¯+Cx(X¯+Cx).y¯ck3=y¯+β^hm(X¯-x¯)x¯+β2(x)[X¯+β2(x)].y¯ck4=y¯+β^hm(X¯-x¯)x¯β2(x)+Cx[X¯β2(x)+Cx].y¯ck5=y¯+β^hm(X¯-x¯)x¯Cx+β2(x)[X¯Cx+β2(x)],

where Cx and β2(x) are the auxiliary variable’s population coefficients of variation and kurtosis, respectively; y and χ are the study and auxiliary variable’s sample means, respectively, and it is assumed that the population mean X is known. In robust regression, Huber M-estimation function are utilised to calculate βhm.

Using a first-degree-approximation expansion, the MSEs of the estimators (1)–(5) can be calculated as follows:

MSE (y¯cki)=1-fn(Rcki2Sx2+2βhmRckiSx2+βhm2Sx2-2RckiSxy-2RhmSxy+Sy2),

where i = 1, 2, 3, 4, 5.; f = n/N; n is size of sample; N is the size of population;

Rck1 = Y/X, Rck2 = Y /(X + Cx), Rck3 = Y/(X + β2(x)), Rck4 = Y β2(x)/(2(x) + Cx), Rck5 = Y껩x/(X껩x + β2(x)) are the population ratios; the variances of the study and auxiliary variables are Sy2 and Sx2, respectively, while the covariance between the study and auxiliary variable is Sxy.

3. Mathematical formulation of suggested ratio type exponential estimators

In recent years, the use of robust statistical approaches in sampling studies and finite population mean estimates has received a lot of attention. The necessity to improve the effectiveness and accuracy of predicting the population mean in the context of outliers served as the primary motivation for establishing the development and study of the ratio type exponential estimators discussed in this article. Sample surveys are an important method for deriving conclusions regarding finite populations. However, when outliers are present in the data, which might have a disproportionate impact on traditional estimators, their dependability may be compromised. Robust estimating strategies are now being investigated as a result of this constraint.

The suggested estimators are based on the Huber M-estimation function of the robust regression. The objective was to develop estimators that are capable of handling the disruptive impacts of outliers while providing more accurate estimations of the population mean is what contributed to this conclusion. We intend to reduce the possible biases and inefficiencies associated with conventional estimators in the context of unusual data points by utilising the adaptive characteristics of Huber M-estimation.

In this paper, we formulate mathematical equations for the bias and the mean squared error of the suggested estimators, which enable a thorough evaluation of their performance. We contrast these estimators’ performance with that of those developed by Kadilar, Candan, and Cingi (Hacet J Math Stat, 36, 181–188, 2007) in order to assess their efficacy. To determine the scenarios in which the recommended estimators perform better than current techniques, a comparison study is necessary. Furthermore, real-world application is essential, thus we undertake numerical and simulation-based investigations employing five population datasets, each of which contains an outlier. These empirical studies support our theoretical conclusions and offer perceptions on how the suggested estimators actually work in real-world situations.

We presented ratio type exponential estimators employing the Huber (1981)M function, inspired by the work of Kadilar et al. (2007) and Prasad (2020). The suggested estimators can produce effective results even when there are outliers. The following estimators are recommended for estimating the population mean:

y¯sv1=[y¯+β^hm(X¯-x¯)]exp [(X¯-x¯)(X¯+x¯)].y¯sv2=[y¯+β^hm(X¯-x¯)]exp [(X¯-x¯)(X¯+x¯)+2Cx].y¯sv3=[y¯+β^hm(X¯-x¯)]exp [(X¯-x¯)(X¯+x¯)+β2(x)].y¯sv4=[y¯+β^hm(X¯-x¯)]exp [β2(x)(X¯-x¯)β2(x)(X¯+x¯)+2Cx].y¯sv5=[y¯+β^hm(X¯-x¯)]exp [Cx(X¯-x¯)Cx(X¯+x¯)+2β2(x)].

To calculate the mean square error (MSE) of the suggested estimators ysv1, ysv2, ysv3, ysv4 and ysv5, up to the first order of large approximations using the following transformations:

y = Y (1 + ε0), and χ = X(1 + ε1) such that E(εj) = 0, |εj| < 1∀ j = 0, 1, 2, 3., E(02)=((1/n)-(1/N))Cy2,E(12)=((1/n)-(1/N))Cx2, E(ε0ε1) = ((1/n) – (1/N))ρyxCyCx.

We would like to point out that the population data are used to calculate βhm. Using the above transformations, express the equations “(3.1), (3.2), (3.3), (3.4) and (3.5) ” in terms of ε′s, we get

y¯sv1={Y¯(1+0)-X¯βhm1(1+2)(1+3)-1}exp [-121(1+121)-1].y¯sv2={Y¯(1+0)-X¯βhm1(1+2)(1+3)-1}exp [-12Φsv21(1+12Φsv21)-1].y¯sv3={Y¯(1+0)-X¯βhm1(1+2)(1+3)-1}exp [-12Φsv31(1+12Φsv31)-1].y¯sv4={Y¯(1+0)-X¯βhm1(1+2)(1+3)-1}exp [-12Φsv41(1+12Φsv41)-1].y¯sv5={Y¯(1+0)-X¯βhm1(1+2)(1+3)-1}exp [-12Φsv51(1+12Φsv51)-1],

where Φsv2 = X /(X + Cx), Φsv3 = X/(X + β2(x)), Φsv4 = 2(x)/(2(x) + Cx), Φsv5 = X껩x/(X껩x + β2(x)).

Extending the right side of “(3.6), (3.7), (3.8), (3.9) and (3.10)”, multiplying and ignoring the terms of ε′s with power higher than 2, we have

y¯sv1-Y¯Y¯[0-121+3812-1201-X¯βhmY¯(1-1212+12-13)].y¯sv2-Y¯Y¯[0-12Φsv21+38Φsv2212-12Φsv201-X¯βhmY¯(1-12Φsv212+12-13)].y¯sv3-Y¯Y¯[0-12Φsv31+38Φsv2212-12Φsv301-X¯βhmY¯(e1-12Φsv312+12-13)].y¯sv4-Y¯Y¯[0-12Φsv41+38Φsv4212-12Φsv401-X¯βhmY¯(1-12Φsv412+12-13)].y¯sv5-Y¯Y¯[0-12Φsv51+38Φsv5212-12Φsv501-X¯βhmY¯(1-12Φsv512+12-13)].

Squaring “(3.11), (3.12), (3.13), (3.14) and (3.15)” both sides, and discarding the terms of ε′s having power of bigger than 2, we get

[y¯sv1-Y¯]2=Y¯2[02+12(12+X¯βhmY¯)2-201(12+X¯βhmY¯)].[y¯sv2-Y¯]2=Y¯2[02+12(12Φsv2+X¯βhmY¯)2-201(12Φsv2+X¯βhmY¯)].[y¯sv3-Y¯]2=Y¯2[02+12(12Φsv3+X¯βhmY¯)2-201(12Φsv3+X¯βhmY¯)].[y¯sv4-Y¯]2=Y¯2[02+12(12Φsv4+X¯βhmY¯)2-201(12Φsv4+X¯βhmY¯)].[y¯sv5-Y¯]2=Y¯2[02+12(12Φsv5+X¯βhmY¯)2-201(12Φsv5+X¯βhmY¯)].

Taking the expectation of both sides of the equations “(3.16)–(3.20)”, we obtain the MSEs of the considered estimators ysv1, ysv2, ysv3, ysv4 and ysv5, up to the first order of large approximations as

MSE (y¯sv1)=1-fn[Sy2+14Rsv12Sx2+βhm2Sx2-Rsv1Syx+Rsv1βhmSx2-2βhmSyx].MSE (y¯sv2)=1-fn[Sy2+14Rsv22Sx2+βhm2Sx2-Rsv2Syx+Rsv2βhmSx2-2βhmSyx].MSE (y¯sv3)=1-fn[Sy2+14Rsv32Sx2+βhm2Sx2-Rsv3Syx+Rsv3βhmSx2-2βhmSyx].MSE (y¯sv4)=1-fn[Sy2+14Rsv42Sx2+βhm2Sx2-Rsv4Syx+Rsv4βhmSx2-2βhmSyx].MSE (y¯sv5)=1-fn[Sy2+14Rsv52Sx2+βhm2Sx2-Rsv5Syx+Rsv5βhmSx2-2βhmSyx].

Rsv1 = Y /X, Rsv2 = Y /(X + Cx), Rsv3 = Y /(X + β2(x)), Rsv4 = Y β2(x)/(2(x) + Cx), Rsv5 = Y껩x/(X껩x + β2(x)).

4. Theoretical efficiency comparison

In this section, the efficiency criteria for proposed estimators ysvi ( i = 1, 2, …, 5 ) have been determined algebraically according to the Kadilar et al. (2007) estimators.

MSE(y¯cki)-MSE(y¯svi)=((1-f)/n)[(3/4)Rsvi2Sx2+βhmRsviSx2-RsviSxy]>0, if

Rsvi>0,(i=1,2,,5).

According to the equations (4.1), the proposed estimators ysvi (where i = 1, 2, 3, 4, 5 ) are more dominant than that of the Kadilar et al. (2007) estimators as long as the conditions (4.1) fulfilled.

5. Numerical illustration

The considered estimators are compared to the existing estimators in the literature in this section. To compare the behaviour of the suggested estimators to other existing estimators, five different types of natural population data sets (shown in Table 1) were used. Since real population data sets include outliers, we take them all into account.

We visually detected outliers in the datasets A, B, C, D, E and displayed them in Figure 1. These data points are considered outliers because they considerably diverge from the overall trend of the dataset, which suggests that they may have an impact on parameter estimation. We used the Huber M robust regression method, which has become known for its proficiency in handling various kinds of outliers, to mitigate this effect. Since they are created to give more robust and trustworthy parameter estimates in the case of outliers, we anticipate that our proposed estimators will outperform those found in the literature.

Data set - A : is considered from “UScereals” (Ripley et al., 2013) from the “MASS” package in R-Software (2021) where,

Y=the weight of the calories in grams.X=the weight in grams of the sodium.

Outlier description: We uses scatter plots to visually recognise outliers in this dataset, concentrating on cases where the weight of sodium and the weight of calories differed significantly from the majority of data points. Outliers in this dataset are mostly high-sodium and high-calorie cereals.

Data set - B : is considered from the “Singh” (2003) (Page no: 1111), where,

Y=Considered as the amount of the real estate farm loans taken out in 1977.X=Considered as the amount of non-real estate farm loans taken out in 1977.

Outlier description: Using scatter plots, outliers in this dataset were identified, revealing multiple points of data with substantially higher values for non-real estate farm loans as well as real estate farm loans in 1977.

Data set - C : is taken from “Murthy” (1967) (Page no: 399) where,

Y=The regionss cultivated area was under wheat in 1964.X=The regionss cultivated area was under wheat in 1963.

Outlier description: This dataset’s outliers were identified using scatter plots, which showed situations when the cultivated area under wheat in 1964 diverged significantly from the normal range.

Data set - D : is taken from the from “Murthy” (1967) (Page no: 288), where,

Y=Output data for 80factories in a region.X=Fixed capital for 80factories in a region.

Outlier description: In this dataset, we utilised scatter plots to visually evaluate data points where fixed capital and output data for factories in an area had unusual patterns, and then we used those data points to identify outliers.

Data set - E : is picked from the “Engel data set” (Koenker and Bassett, 1982) from the “quantreg” package in R-Software (2021) where,

Y=Annual food expenditure of a household in Belgian francs.X=Annual household income in Belgian francs.

Outlier description: In order to visually identify outliers, scatter plots were used, concentrating on families with unusually high yearly incomes or annual food expenditures in Belgian francs.

We went into discussion about how these visually distinguished outliers affected our study and the estimating techniques we used. We also want to highlight that we used the Huber M robust regression method in addition to visual outlier identification. The adaptability of Huber M regression to different kinds of outliers and extreme values is well recognised. In comparison to conventional least squares regression, this technique reduces the effect of outliers on parameter estimation and produces results that are more accurate. It has been designed especially for datasets having differed forms of outliers, including high leverage, influential, and extreme values, and it is developed to manage outliers well. Our dedication to tackling outlier difficulties and producing robust estimates in the presence of multiple outlier types is demonstrated by the adoption of Huber M robust regression in our research. This improves the transparency of our research and gives a clearer picture of any possible difficulties brought on by outliers in the data sets under study. The RE(%) of the suggested estimators ysvi, (i = 1, 2, …, 5 ) with respect to the Kadilar et al. (2007) estimators are given as:

RE(Existing Estimators,Proposed Estimators)=MSE(Existing Estimators)MSE(Proposed Estimators)×100.

In Tables 25, where we show the relative efficiency (RE(%)) results, the performance of the estimators is compared and evaluated. These tables provide insightful information about the performance of our suggested estimators in comparison to the existing estimators. The RE(%) values above 100 in these tables, in specific, indicate an important benefit for the suggested estimators.

This result highlights a key finding: Our suggested estimators frequently beat their competitors in terms of mean squared error. This result occurs when the percent relative efficiencies surpass 100. This is an excellent illustration of how well our estimators perform when applied to the simulated datasets in terms of prediction accuracy and precision. These findings highlight the usefulness in real-world applications and enhanced efficiency of our suggested estimators, highlighting their potential as useful tools for statistical estimation tasks.

6. Simulation studies

To find the RE(%)of the suggested estimators, we will conduct a simulation study that is carried out by considering the “Engel Data Set” (Koenker and Bassett, 1982) presented in Table 1. This data set contains data on income and food expenditure for 235 working-class Belgian households. To load this data, load the quantreg library, and then enter the command data (engel) in R programming.

We carry out the procedures listed below for carrying out the simulation study, which were coded in R-program (2021), and we describe the simulation processes taken into account to determine the MSEs of the suggested estimators ysvi ( i = 1, 2, …, 5 ) and traditional estimator ycki, ( i = 1, 2, …, 5)

Step 1 : Select the 5,000, 10,000 and 1,00,000 samples of the different size n (where n = 20, n = 30, n = 40 and n = 50) using the “Engel Data Set” (Koenker and Bassett, 1982) that are mentioned in R program using the SRS without repalcement techniques.

Step 2 : After that we will considered the data from 5,000, 10,000 and 1,00,000 samples to find the value of the Y¯^. Now, we have the 5,000, 10,000 and 1,00,000 values of Y¯^ from the 5,000, 10,000 and 1,00,000 samples for each sample n.

Step 3 : The mean squared error of Y¯^ is computed for each n by

MSE(Y¯^)=15000i=15000(Y¯^-Y¯)2

where Y is population mean of the study variable.

The MSE ratio of the investigated estimators to the current estimators for each sample size (n) is computed to estimate the relative efficiency. All of the suggested estimators clearly outperform the current ones across all sample sizes, demonstrating their greater effectiveness when compared to conventional estimators. The simulation results support this observation, demonstrating the accuracy of our theoretical results.

It is important to highlight that the suggested estimators’ efficiency significantly increases when compared to the current estimators. To put it another way, the suggested estimators show considerably higher efficiency in situations where outliers were more likely to occur in the data. The Tables 35 provide a brief overview of the results of the simulation after multiple iterations.

7. Analysis of numerical illustration and simulation study

From the Tables 13, the following interpretation can be found:

  • We present descriptions of five real-world data sets in the Table 1 to demonstrate the applications of our research.

  • From the Table 2

    • For Data Set-A, the PREs of the estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 194.9832% to 221.5548% for the Population size of 65 and the sample size of 20.

    • For data set B, with Population size 50 and sample size 20, the PREs of the estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 249.12% to 252.19%.

    • For data set C, the PREs of the estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 356.50% to 382.89% for the Population size of 34 and the sample size of 20.

    • For data set D, with Population size 80 and sample size 20, the PREs of the estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 377.31% to 382.23%.

    • For data set E, the PREs of the estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 266.57% to 293.98% for the Population size of 235 and the sample size of 20.

  • From the Table 3

    • For sample size of 20 and the population size of 5000, the PREs of the suggested estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 282.49% to 291.16%.

    • For sample size of 30 and the population size of 5000, the PREs of the suggested estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 275.73% to 286.17%.

    • For sample size of 40 and the population size of 5000, the PREs of the suggested estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 274.22% to 286.26%.

    • For sample size of 50 and the population size of 5000, the PREs of the suggested estimators ysvi (i = 1, 2, …, 5) over the existing estimators remain between 277.19% to 290.79%.

  • Similar results we will get from the Tables 45.

We introduced and compared our estimators, designated as ysvi (i = 1, 2, …, 5), to current approaches in our research of real-world datasets. Our estimators frequently beat the alternatives in a variety of situations with different population sizes and sample sizes, demonstrating their efficiency in predicting population characteristics. Notably, all of our suggested relative efficiency (RE%) values were greater than 100%, showing clearly that our estimators are more efficient than existing estimators. We strongly advise survey practitioners to use these estimators since they offer excellent and trustworthy estimations for a variety of survey applications.

8. Conclusions

In the presence of outliers, using traditional statistical approaches for data analysis might lead to incorrect outcomes. Robust regression approaches have been used to enhance methods for predicting the population mean in order to solve this problem. This article presents an innovative approach for analysing sample survey data, concentrating on the development of exponential estimators of the ratio type using the Huber M-function. The study compares these newly suggested estimators’ mean square errors (MSEs) to those of the estimators previously proposed by Kadlar, Candan, and Cingi in 2007. The study offers a thorough review that includes both theoretical derivations and actual implementations. The results of the study constantly show that the newly suggested estimators work better than their competitors, producing lower MSEs under different circumstances. The aforementioned findings are supported by rigorous numerical illustrations and simulation studies, that intentionally take into account the existence of outliers in the data. As a result, the estimators based on the robust regression techniques given in this article outperformed the estimators from Kadilar et al. (2007) across both real-world data sets and simulated scenarios. The positive findings of this study not only confirm the effectiveness of the suggested estimators but also open the door for future efforts to broaden the applicability of estimators across various sampling techniques. This article advances data analysis methods, especially in situations where outliers are a challenge. It also holds significant potential for applications in a variety of industries, including business, economics, and agriculture, which will eventually encourage intelligent policy development. Future research will be conducted to improve and diversify the estimator toolbox for sample surveys.

Data Availability Statement

All the relevant data information is avaiable within the manuscript and code is given in Appendix.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgement

Authors are thankful to the Editor-in-Chief and learned referees for their inspiring and fruitful suggestions. The authors are also thankful to the National Institute of Technology Arunachal Pradesh, Jote, for providing the necessary infrastructure for the completion of the present work.

Appendix A: R - Code

# Load necessary libraries
library(quantreg)

# Load the ‘engel’ dataset
data(engel)

# Define the variable (food expenditure) and (income)
Y <- engel$foodexp
X <- engel$income

# Calculate the means variables
Ybar <- mean(Y)
Xbar <- mean(X)

# Specify the number of bootstrap replications (5000, 10000, 100000)
B <- 100000

# Set the sample size ‘n’ (20, 30, 40, 50)
n <- 50

# Define ‘N’ as the length of the ’income’ vector
N <- length(engel$income)

# Initialize vectors to store results for different estimators
T1.1 <- numeric(B)
T1.2 <- numeric(B)
T1.3 <- numeric(B)
T1.4 <- numeric(B)
T1.5 <- numeric(B)
P1.1 <- numeric(B)
P1.2 <- numeric(B)
P1.3 <- numeric(B)
P1.4 <- numeric(B)
P1.5 <- numeric(B)

# Loop for bootstrap replications
for (K in 1:B) {

# Randomly sample data without replacement to create a bootstrap sample
swor <- sample(N, size = n, replace = FALSE)
y <- engel$foodexp[swor]
x <- engel$income[swor]
ybar <- mean(y)
xbar <- mean(x)
Cx <- sd(x) / mean(x)
B2 <- kurtosis(x)

library(MASS)
Br1 <- rlm(y ~ x)
Br <- Br1$coefficients[2]

# Calculate various estimators
T1.1[K] <- ((ybar + Br * (Xbar - xbar)) / xbar) * Xbar
T1.2[K] <- ((ybar + Br * (Xbar - xbar)) / (xbar + Cx)) * (Xbar + Cx)
T1.3[K] <- ((ybar + Br * (Xbar - xbar)) / (xbar + B2)) * (Xbar + B2)
T1.4[K] <- ((ybar + Br * (Xbar - xbar)) / (xbar * B2 + Cx)) * (Xbar * B2 + Cx)
T1.5[K] <- ((ybar + Br * (Xbar - xbar)) / (xbar * Cx + B2)) * (Xbar * Cx + B2)

P1.1[K] <- (ybar + Br * (Xbar - xbar)) * exp((Xbar - xbar) / (Xbar + xbar))
P1.2[K] <- (ybar + Br * (Xbar - xbar)) * exp((Xbar - xbar) / ((Xbar + xbar)
+ 2 * Cx))
P1.3[K] <- (ybar + Br * (Xbar - xbar)) * exp((Xbar - xbar) / ((Xbar + xbar)
+ 2 * B2))
P1.4[K] <- (ybar + Br * (Xbar - xbar)) * exp((B2 * (Xbar - xbar)) / (B2 * (Xbar
+ xbar) + 2 * Cx))
P1.5[K] <- (ybar + Br * (Xbar - xbar)) * exp((Cx * (Xbar - xbar)) / (Cx * (Xbar
+ xbar) + 2 * B2))
}

# Calculate Mean Squared Errors (MSE) for each estimator
MSEY1 <- mean((T1.1 - mean(T1.1))ˆ2)
MSEY2 <- mean((T1.2 - mean(T1.2))ˆ2)
MSEY3 <- mean((T1.3 - mean(T1.3))ˆ2)
MSEY4 <- mean((T1.4 - mean(T1.4))ˆ2)
MSEY5 <- mean((T1.5 - mean(T1.5))ˆ2)

d <- data.frame(MSEY1, MSEY2, MSEY3, MSEY4, MSEY5)

MSEYpr1 <- mean((P1.1 - mean(P1.1))ˆ2)
MSEYpr2 <- mean((P1.2 - mean(P1.2))ˆ2)
MSEYpr3 <- mean((P1.3 - mean(P1.3))ˆ2)
MSEYpr4 <- mean((P1.4 - mean(P1.4))ˆ2)
MSEYpr5 <- mean((P1.5 - mean(P1.5))ˆ2)

d1 <- data.frame(MSEYpr1, MSEYpr2, MSEYpr3, MSEYpr4, MSEYpr5)

# Relative Efficiency
da1 <- c(MSEY1 / MSEYpr1, MSEY2 / MSEYpr1, MSEY3 / MSEYpr1, MSEY4 / MSEYpr1, MSEY5
/ MSEYpr1)
da2 <- c(MSEY1 / MSEYpr2, MSEY2 / MSEYpr2, MSEY3 / MSEYpr2, MSEY4 / MSEYpr2, MSEY5
/ MSEYpr2)
da3 <- c(MSEY1 / MSEYpr3, MSEY2 / MSEYpr3, MSEY3 / MSEYpr3, MSEY4 / MSEYpr3, MSEY5
/ MSEYpr3)
da4 <- c(MSEY1 / MSEYpr4, MSEY2 / MSEYpr4, MSEY3 / MSEYpr4, MSEY4 / MSEYpr4, MSEY5
/ MSEYpr4)
da5 <- c(MSEY1 / MSEYpr5, MSEY2 / MSEYpr5, MSEY3 / MSEYpr5, MSEY4 / MSEYpr5, MSEY5
/ MSEYpr5)

da <- c(da1, da2, da3, da4, da5)
RE_matrix <- matrix(da, ncol = 5, nrow = 5, byrow = T) * 100

# Display or export results
d # MSE results for T1 estimators
d1 # MSE results for P1 estimators
RE_matrix # Relative Efficiency matrix

Figures
Fig. 1. Scatter plots across various datasets.
TABLES

Table 1

Parameters of five natural population data sets

A
UScereals
(Ripley et al., 2013)
B
Singh (pp: 1111)
(2003)
C
Murthy (pp: 399)
(1967)
D
Murthy (pp: 288)
(1967)
E
Engel
(Koenker and Bassett, 1982)
N = 65N = 50N = 34N = 80N = 235
n = 20n = 20n = 20n = 20n = 20
= 149.4083 = 555.4345 = 199.4412 = 5182.637 = 624.1501
= 237.8384 = 878.1624 = 208.8824 = 1126.463 = 982.473
ρ = 0.5286552ρ = 0.8038341ρ = 0.9800867ρ = 0.9413055ρ = 0.9112434
Cy = 0.4177271Cy = 1.052916Cy = 0.7531797Cy = 0.3541939Cy = 0.4429335
Cx = 0.549239Cx = 1.235168Cx = 0.7205298Cx = 0.7506772Cx = 0.5284938
βhm = 0.1928509βhm = 0.4123359βhm = 0.9537324βhm = 1.989718βhm = 0.5368326
β2(x) = 8.191083β2(x) = 4.617048β2(x) = 2.912272β2(x) = 2.866433β2(x) = 17.63426
Sx = 130.6296Sx = 1084.678Sx = 150.506Sx = 845.6097Sx = 519.2309
Sy = 62.41187Sy = 584.826Sy = 150.215Sy = 1835.659Sy = 276.457
Sxy = 4310.041Sxy = 509910.4Sxy = 22158.05Sxy = 1461142Sxy = 130804.4

Table 2

Percent relative efficiencies of the suggested estimators svi (i = 1, 2, …, 5) over the existing estimators

Data setsEstimatorsck1ck2ck3ck4ck5
Asv1212.7720212.0549202.5874212.6842194.9832
sv2213.1139212.3956202.9129213.0259195.2965
sv3217.7221216.9884207.3006217.6323199.5195
sv4212.8138212.0966202.6272212.7260195.0216
sv5221.5548220.8081210.9498221.4634203.0318

Bsv1250.9038250.3324248.7804250.7798249.1821
sv2251.2503250.6781249.1240251.1261249.5262
sv3252.1963251.6219250.0620252.0717250.4657
sv4250.9788250.4073248.8549250.8549249.2567
sv5251.9507251.3770249.8185251.8263250.2219

Csv1370.1677367.6736360.2442369.3084356.5053
sv2372.4287369.9194362.4446371.5642358.6828
sv3379.3293376.7735369.1601378.4487365.3287
sv4370.9437368.4444360.9993370.0825357.2526
sv5382.8989380.3190372.6340382.0100368.7665

Dsv1379.8456379.3468377.9464379.6715377.3188
sv2380.3142379.8148378.4127380.1399377.7843
sv3381.6361381.1350379.7280381.4612379.0975
sv4380.0090379.5100378.1091379.8348377.4812
sv382.2315381.7296380.3204382.0563379.6889

Esv1281.8757281.6214273.6011281.8612266.5705
sv2282.0685281.8140273.7882282.0541266.7529
sv3288.2916288.0315279.8286288.2768272.6380
sv4281.8866281.6323273.6117281.8722266.5809
sv5293.9832293.7180285.3532293.9681278.0206

Table 3

Relative efficiencies (%) of the considered estimators svi (i = 1, 2, …, 5) over the existing estimators for simulation studies for 5,000 iterations

Sample SizesEstimatorsck1ck2ck3ck4ck5
n = 20sv1287.4186287.1696284.8613287.3576282.4920
sv2287.6115287.3623285.0525287.5505282.6816
sv3289.4916289.2408286.9159289.4302284.5295
sv4287.4619287.2128284.9042287.4009282.5346
sv5291.1673290.9150288.5766291.1055286.1765

n = 30sv1281.6029281.3562278.3832281.5545275.7349
sv2281.7920281.5452278.5702281.7435275.9200
sv3284.2157283.9668280.9662284.1669278.2933
sv4281.6370281.3903278.4170281.5886275.7683
sv5286.1736285.9230282.9017286.1244280.2103

n = 40sv1280.9477280.7004277.2045280.9056274.2236
sv2281.1367280.8892277.3909281.0946274.4081
sv3283.9790283.7289280.1953283.9364277.1823
sv4280.9777280.7303277.2340280.9356274.2529
sv5286.2625286.0104282.4484286.2196279.4112

n = 50sv1284.7560284.5029280.5511284.7162277.1985
sv2284.9496284.6964280.7418284.9098277.3870
sv3288.1667287.9106283.9114288.1265280.5187
sv4284.7844284.5314280.5791284.7447277.2262
sv5290.7960290.5376286.5019290.7554283.0782

Table 4

Relative efficiencies (%) of the considered estimators svi (i = 1, 2, …, 5) over the existing estimators for simulation studies for 10,000 iterations

Sample SizesEstimatorsck1ck2ck3ck4ck5
n = 20sv1283.7811283.5358281.2569283.7217278.9507
sv2283.9702283.7247281.4443283.9108279.1366
sv3285.8136285.5665283.2714285.7538280.9487
sv4283.8230283.5776281.2984283.7636278.9919
sv5287.4332287.1847284.8765287.3731282.5407

n = 30sv1281.8515281.6043278.6550281.8028276.0324
sv2282.0411281.7937278.8425281.9923276.2181
sv3284.4471284.1975281.2211284.3978278.5743
sv4281.8860281.6387278.6891281.8372276.0661
sv5286.3941286.1429283.1461286.3446280.4812

n = 40sv1282.4314282.1826278.6957282.3882275.6903
sv2282.6213282.3723278.8831282.5781275.8757
sv3285.4612285.2097281.6854285.4175278.6478
sv4282.4620282.2132278.7259282.4189275.7203
sv5287.7615287.5080283.9553287.7175280.8932

n = 50sv1284.7022284.4496280.5371284.6623277.2014
sv2284.8955284.6427280.7276284.8556277.3896
sv3288.0771287.8214283.8626288.0367280.4873
sv4284.7308284.4782280.5653284.6909277.2293
sv5290.6930290.4350286.4403290.6522283.0343

Table 5

Relative efficiencies (%) of the considered estimators svi (i = 1, 2, …, 5) over the existing estimators for simulation studies for 1,00,000 iterations

Sample SizesEstimatorsck1ck2ck3ck4ck5
n = 20sv1283.8537283.6076281.3037283.7942278.9635
sv2284.0422283.7959281.4905283.9826279.1487
sv3285.8925285.6447283.3242285.8326280.9672
sv4283.8955283.6494281.3451283.8360279.0046
sv5287.5312287.2819284.9482287.4709282.5776

n = 30sv1283.6235283.3746280.3902283.5742277.6850
sv2283.8144283.5653280.5789283.7651277.8719
sv3286.2491285.9979282.9859286.1994280.2556
sv4283.6584283.4095280.4247283.6091277.7192
sv5288.2599288.0069284.9737288.2098282.2243

n = 40sv1282.6203282.3717278.8912282.5771275.8742
sv2282.8102282.5614279.0786282.7670276.0596
sv3285.6450285.3937281.8760285.6013278.8267
sv4282.6510282.4024278.9215282.6078275.9042
sv5287.9584287.7051284.1589287.9144281.0849

n = 50sv1282.9058282.6569278.8137282.8660275.5154
sv2283.0956282.8465279.0007283.0558275.7003
sv3286.2084285.9566282.0685286.1682278.7318
sv4282.9342282.6853278.8417282.8945275.5432
sv5288.7843288.5303284.6072288.7438281.2404

References
  1. Bulut H and Zaman T (2022). An improved class of robust ratio estimators by using the minimum covariance determinant estimation. Communications in Statistics-Simulation and Computation, 51, 2457-2463.
    CrossRef
  2. Edgeworth FY (1887). Observations relating to several quantities. Hermathena, 6, 279-285.
  3. Huber PJ (1973). Robust regression asymptotics, conjectures and monte carlo. Annals of Statistics, 1, 799-821.
    CrossRef
  4. Huber P (1981). Robust Statistics, Wiley, New York.
    CrossRef
  5. Kadilar C, Candan M, and Cingi H (2007). Ratio estimation using robust regression. Hacettepe Journal of Mathematics and Statistics, 36, 181-188.
  6. Kadilar C and Cingi H (2004). Ratio estimators in simple random sampling. Applied Mathematics and Computation, 151, 893-902.
    CrossRef
  7. Koenker R (2009) Quantreg: Quantile regression.
  8. Koenker R and Bassett G (1982). Robust tests of heteroscedasticity based on regression quantiles. Econometrica, 50, 43-61.
    CrossRef
  9. Murthy MN (1967). Sampling: Theory and Methods, Statistical Pub. Society.
  10. Noor-ul-Amin M, Shahbaz MQ, and Kadilar C (2016). Ratio estimators for population mean using robust regression in double sampling. Gazi University Journal of Science, 29, 793-798.
  11. Noor-ul-Amin M, Asghar S, and Sanaullah A (2018). Redescending M-estimator for robust regression. Journal of Reliability and Statistical Studies, 11, 69-80.
  12. Noor-ul-Amin M, Asghar S, and Sanaullah A (2022). Ratio estimators in the presence of outliers using redescending M-estimator. Proceedings of the National Academy of Sciences, India Section A: Physical Sciences, 92, 65-70.
    CrossRef
  13. Prasad S (2020). Some linear regression type ratio exponential estimators for estimating the population mean based on quartile deviation and deciles. Statistics in Transition, New Series, 21, 85-98.
    CrossRef
  14. R Core Team (2021). R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria.
  15. Rousseeuw P and Leroy A (1987). Robust Regression and Outlier Detection, Wiley, New York.
    CrossRef
  16. Ripley B, Venables B, Bates DM, Hornik K, Gebhardt A, Firth D, and Ripley MB (2013). Package ‘MASS’. Cran r, 538, 113-120.
  17. Singh S (2003). Advanced Sampling Theory With Applications: How Michael ‘Selected’ Amy (Vol. 2), Kluwer Academic Publishers.
    CrossRef
  18. Zaman T (2020). Generalized exponential estimators for the finite population mean. Statistics in Transition New Series, 21, 159-168.
    CrossRef
  19. Zaman T (2021). An efficient exponential estimator of the mean under stratified random sampling. Mathematical Population Studies, 28, 104-121.
    CrossRef
  20. Zaman T and Kadilar C (2021a). Exponential ratio and product type estimators of the mean in stratified two-phase sampling. AIMS Mathematics, 6, 4265-4279.
    CrossRef
  21. Zaman T and Kadilar C (2021b). New class of exponential estimators for finite population mean in two-phase sampling. Communications in Statistics-Theory and Methods, 50, 874-889.
    CrossRef
  22. Zaman T and Bulut H (2019). Modified ratio estimators using robust regression methods. Communications in Statistics - Theory and Methods, 48, 2039-2048.
    CrossRef
  23. Zaman T and Bulut H (2023). An efficient family of robust-type estimators for the population variance in simple and stratified random sampling. Communications in Statistics-Theory and Methods, 52, 2610-2624.
    CrossRef
  24. Zaman T, Dünder E, Audu A, Alilah DA, Shahzad U, and Hanif M (2021). Robust regression-ratio-type estimators of the mean utilizing two auxiliary variables: A simulation study. Mathematical Problems in Engineering, 2021, 1-9.
    CrossRef
  25. Zaman T, Bulut H, and Yadav SK (2022). Robust ratio-type estimators for finite population mean in simple random sampling: A simulation study. Concurrency and Computation: Practice and Experience, 34, e7273.
    CrossRef