
Recently, artificial intelligence (AI) is being used as decision-making tools in various domains such as credit scoring, criminal risk assessment, education of college admissions (Angwin
Demographic disparities due to AI, which refer to socially unacceptable bias that an AI model favors certain groups (e.g., white, men) over other groups (e.g., black, women), have been observed frequently in many applications of AI such as COMPAS recidivism risk assessment (Angwin
In this paper, we consider a problem of using the sensitive variable for fair prediction. In most real applications, the sensitive variable itself has important information for prediction and using the sensitive variable as a part of input variables is usually helpful to improve prediction accuracies. Moreover, some fairness AI algorithms inevitably produce prediction models which depend on the sensitive variable as well as input variables. An example is the algorithm of Jiang
A simple solution to reflect the information of the sensitive variable into prediction when using the sensitive variable explicitly in prediction is prohibitive to use an imputed sensitive variable in the prediction phase. That is, the sensitive variable is fully included in the learning phase to have a prediction model depending on the sensitive variable and then an imputed sensitive variable is used in the prediction phase. The aim of this paper is to evaluate this procedure by analyzing several benchmark datasets. We illustrate that using an imputed sensitive variable is helpful to improve prediction accuracies without hampering the degree of fairness much. That is, prediction models with an imputed sensitive variable are superior compared to prediction models not using the sensitive variable at all.
The paper is organized as follows. Various fairness algorithms are reviewed in Section 2. The proposed procedure to include the information of the sensitive variables by using an imputed sensitive variable into the prediction phase is explained in Section 3. Results of numerical studies are presented in Section 4 and concluding remarks follow in Section 5.
We let
be a set of training data of size , where
. We consider a binary classification problem, which means
, and for notational simplicity, we let
, where
In this paper, we focus on between-group fairness (BGF) which requires that certain statistics of predictive values in each sensitive group should be similar. Even if we do not consider other concepts of fairness such as individual fairness (Dwork
We consider AI algorithms which yield a real-valued function so called a score function which assigns positive labeled instances higher scores than negative labeled instances. An example of the score function is the conditional class probability Pr(
Let ? be a given set of score functions, in which we search an optimal score function in a certain sense (e.g. minimizing the cross-entropy for classification problems). Examples of ? are linear functions, reproducing kernel Hilbert space and deep neural networks to name a few. For a given .
For a given score function
for events ? and ?′ that might depend on
The group performance function
For given group performance functions
Several learning algorithms have been proposed to find an accurate model
Pre-processing methods remove bias in training data or find a fair representation with respect to a sensitive variable before the learning phase and learn AI models based on de-biased data or fair representation (Calmon
In-processing methods generally train an AI model by minimizing a given cost function (e.g. the cross-entropy, the sum of squared residuals, the empirical AUC etc.) subject to a
Post-processing methods first learn an AI model without any BGF constraint and then transform the decision boundary or score function of the trained AI model for each sensitive group to satisfy given BGF criteria (Chzhen
We consider two situations according to whether the sensitive variable can be used in the prediction. The first situation (Situation 1) is that the sensitive variable
In this paper, we propose a method to use the information in the sensitive variable under Situation 0. The idea of the proposed method is simple and intuitive. At the learning phase, a prediction model is learned with the input vector including the sensitive variable. Then, at the prediction phase, we impute the sensitive variable based on the other input variables and make a prediction with the imputed sensitive variable.
To be more specific, let with the training data (
which predicts
There are at least two advantages of the proposed method compared to the standard method for Situation 0 that learns a prediction model in the learning phase. First of all, using the information of
The second advantage, which is the main motivation of our proposed method, lies in that there are several useful fair AI algorithms which yield prediction models which are functions on . An example is the Wasserstein fair algorithm (Jiang
Learning fair prediction models with an imputed sensitive variable
[1] Learning phase: Learn a prediction model ![]() |
[2] Imputation phase: Learn a prediction model ![]() |
[3] Prediction phase: For a given new input |
In this section, we investigate the performance of the proposed procedure and compare it with the predictions models without using an imputed sensitive variable (either not using the sensitive variable under S0 or using the sensitive variable fully in the prediction phase under S1).
For fairness AI algorithms, we consider the three algorithms : (1) fair classifier with the disparity impact (DI) constraint (Zafar
The DI constraint requires that a prediction model
where is the expectation with respect to the empirical distribution. By use of the Lagrangian multiplier, we could learn a prediction model by minimizing the penalized empirical risk given as
over ?, where is neither continuous nor convex. A typical remedy is to replace the indicator function by a convex surrogate function. One of the popularly used convex surrogate functions is the hinge function defined as
over ?.
PI measures statistical dependence between the class label and sensitive variable. We think that a given classifier is more fair if its PI is smaller. For data
where
This method is a post-processing method. Let
where is the
To evaluate the proposed procedure, we analyze three benchmark real world datasets : Adult income dataset, Bank marketing dataset and Law school dataset.
Adult dataset is composed of 45,222 individuals’ features (e.g. age, education, race) with a binary label which indicates whether individual’s income is larger (positive class) than 50K USD. We take the variable ‘sex’ as a sensitive variable.
Bank dataset is composed of 41,188 clients’ features (e.g. job, education) with a binary label which indicates whether has a client subscribed a term deposit. We take the variable ‘age’ as a sensitive variable which is 1 when client’s age is between 25 and 60 years.
Law dataset is composed of 26,551 law school applicants’ features (e.g. lsat score, family income) with a binary label which indicates whether applicant is accepted to law school. We take the variable ‘race’ as a sensitive variable which is 1 when applicant is white.
We use three classification algorithms for
Whenever the regularization parameter is selected (e.g. Lagrangian multipliers for DI and PI), we search it so that the corresponding estimated classifier achieves a certain level of fairness which is set in advance and compare the accuracy of the classifier on test data. We select the one with the highest train accuracy among classifiers corresponded to regularization parameters that make the DI of the train data less than 0.05(the case of PI is 0.005). We repeat 5 times to split train data and test data with ratio 7:3 and average the performances.
For the Wasserstein fair classifier which is a post-processing method, we evaluate the strong demographic parity (SDP) measure as well as the DI. The SDP for a given belief function
where
Results of our numerical studies are presented in Tables 2 to 5. The remarks are summarized as follows.
Table 2 summarizes the results without fairness constraints. It is observed that using the sensitive variable in prediction is not helpful. However, as we see in Tables 3 and 4, using the sensitive variable in prediction is helpful for fair learning algorithms. It is interesting to figure out why the role of the sensitive variable in prediction is different for standard and fair learning algorithms, which we leave as a future work.
From Tables 3 and 4, we can see that the fair classifiers with an imputed sensitive variable are superior to the fair classifiers without using the sensitive variable (
For the Wasserstein fair classifier whose results are presented in Table 5, it is somehow surprising that the fair classifier with an imputed sensitive variable is superior to that using the sensitive variable itself. A possible answer would be that an imputed sensitive variable could regularize the estimated classifier and so that could avoid overfitting.
In general, the performances of fair prediction models with an surrogated sensitive variable do not strongly depend on the choice of an imputational algorithm. Table 6 summarizes the accuracies of the three imputation algorithms. Boosting seems to be the best but the performances of the corresponding fair prediction models are similar to the other imputation algorithms. These observations suggest that the accuracy of an imputed sensitive variable is not important for the performance of the corresponding fair prediction model unless the accuracy is too bad.
In this paper, we have illustrated that using an imputed sensitive variable is helpful when using the sensitive variable itself in the prediction phase is not allowed. Also, the accuracy of imputing the sensitive variable does not affect the overall performance of fair classifiers. Any reasonable supervised learning algorithms would be enough to obtain an imputed sensitive variables.
In this paper, we proposed a two-step procedure where we learn the fair classifier and the prediction model for surrogated sensitive variables separately. A better procedure would do this two jobs at the same time. That is, we are to learn the fair classifier and imputed sensitive variable simultaneously. This would be a promising direction for future works.
Some group performance functions
Fairness criteria | ? | ?′ |
---|---|---|
Disparate impact (Barocas and Selbst, 2016) | ∅? | |
Equal opportunity (Hardt |
{ |
|
Disparate mistreatment w.r.t. Error rate (Zafar |
∅? | |
Mean score parity (Coston |
∅? |
No fairness constraint : This table shows the performance of classfiers with no fairness constraint. We fit classifiers
Data | Value | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
S0 | S1 | Our method | S0 | S1 | Our method | ||||||
Boost | Logistic | DNN | Boost | Logistic | DNN | ||||||
Adults | ACC | 0.852 | 0.852 | 0.851 | 0.852 | 0.852 | 0.851 | 0.852 | 0.851 | 0.852 | 0.851 |
DI | 0.172 | 0.177 | 0.186 | 0.187 | 0.194 | 0.172 | 0.178 | 0.187 | 0.188 | 0.195 | |
PI | 0.021 | 0.023 | 0.026 | 0.026 | 0.026 | 0.021 | 0.024 | 0.041 | 0.026 | 0.029 | |
Bank | ACC | 0.911 | 0.911 | 0.911 | 0.911 | 0.910 | 0.911 | 0.911 | 0.911 | 0.911 | 0.911 |
DI | 0.174 | 0.229 | 0.343 | 0.395 | 0.336 | 0.173 | 0.210 | 0.316 | 0.366 | 0.307 | |
PI | 0.007 | 0.009 | 0.011 | 0.013 | 0.011 | 0.007 | 0.008 | 0.010 | 0.012 | 0.011 | |
Law | ACC | 0.823 | 0.823 | 0.823 | 0.822 | 0.823 | 0.823 | 0.823 | 0.823 | 0.823 | 0.823 |
DI | 0.119 | 0.148 | 0.277 | 0.277 | 0.260 | 0.119 | 0.145 | 0.272 | 0.273 | 0.251 | |
PI | 0.009 | 0.012 | 0.022 | 0.018 | 0.019 | 0.009 | 0.012 | 0.022 | 0.018 | 0.019 |
In-processing DI : This table shows the performance of classfiers with DI fairness constraint. We fit classifiers
Data | Value | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
S0 | S1 | Our method | S0 | S1 | Our method | ||||||
Boost | Logistic | DNN | Boost | Logistic | DNN | ||||||
Adults | ACC | 0.832 | 0.832 | 0.832 | 0.832 | 0.832 | 0.832 | 0.832 | 0.832 | 0.832 | 0.832 |
DI | 0.027 | 0.028 | 0.037 | 0.039 | 0.045 | 0.028 | 0.029 | 0.037 | 0.040 | 0.045 | |
Bank | ACC | 0.904 | 0.908 | 0.908 | 0.908 | 0.908 | 0.905 | 0.909 | 0.909 | 0.909 | 0.909 |
DI | 0.031 | 0.007 | 0.019 | 0.032 | 0.017 | 0.031 | 0.005 | 0.025 | 0.047 | 0.021 | |
Law | ACC | 0.811 | 0.820 | 0.820 | 0.821 | 0.821 | 0.811 | 0.820 | 0.819 | 0.820 | 0.820 |
DI | 0.017 | 0.030 | 0.082 | 0.093 | 0.085 | 0.016 | 0.030 | 0.081 | 0.091 | 0.083 |
In-processing PI : This table shows the performance of classfiers with PI fairness constraint. We fit classifiers
Data | Value | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
S0 | S1 | Our method | S0 | S1 | Our method | ||||||
Boost | Logistic | DNN | Boost | Logistic | DNN | ||||||
Adults | ACC | 0.829 | 0.832 | 0.833 | 0.832 | 0.833 | 0.829 | 0.832 | 0.833 | 0.833 | 0.834 |
PI | 0.003 | 0.000 | 0.001 | 0.001 | 0.002 | 0.003 | 0.000 | 0.001 | 0.001 | 0.002 | |
Bank | ACC | 0.907 | 0.909 | 0.909 | 0.909 | 0.909 | 0.908 | 0.911 | 0.911 | 0.911 | 0.911 |
PI | 0.002 | 0.001 | 0.001 | 0.002 | 0.007 | 0.002 | 0.004 | 0.006 | 0.007 | 0.006 | |
Law | ACC | 0.812 | 0.818 | 0.817 | 0.818 | 0.818 | 0.812 | 0.822 | 0.817 | 0.819 | 0.818 |
PI | 0.002 | 0.000 | 0.004 | 0.004 | 0.004 | 0.002 | 0.005 | 0.004 | 0.004 | 0.004 |
Wasserstein post-processing : This table shows the performance of classfiers with Wasserstein post-processing. We fit classifiers
Data | Value | ||||||||
---|---|---|---|---|---|---|---|---|---|
S1 | Boosting | Logistic | DNN | S1 | Boosting | Logistic | DNN | ||
Adults | ACC | 0.721 | 0.832 | 0.831 | 0.833 | 0.720 | 0.838 | 0.837 | 0.839 |
DI | 0.008 | 0.010 | 0.003 | 0.013 | 0.002 | 0.013 | 0.005 | 0.017 | |
SDP | 0.003 | 0.008 | 0.006 | 0.013 | 0.003 | 0.008 | 0.007 | 0.015 | |
Bank | ACC | 0.875 | 0.909 | 0.909 | 0.909 | 0.872 | 0.911 | 0.911 | 0.911 |
DI | 0.044 | 0.024 | 0.048 | 0.030 | 0.041 | 0.024 | 0.042 | 0.030 | |
SDP | 0.056 | 0.039 | 0.064 | 0.044 | 0.052 | 0.037 | 0.050 | 0.043 | |
Law | ACC | 0.789 | 0.815 | 0.817 | 0.816 | 0.790 | 0.825 | 0.828 | 0.827 |
DI | 0.034 | 0.042 | 0.052 | 0.044 | 0.046 | 0.043 | 0.052 | 0.046 | |
SDP | 0.029 | 0.060 | 0.071 | 0.066 | 0.036 | 0.061 | 0.070 | 0.066 |
Accuracy of
Data | Method | Acc | |
---|---|---|---|
train | test | ||
Adults | Boosting | 0.879 | 0.848 |
Logistic | 0.844 | 0.842 | |
DNN | 0.872 | 0.837 | |
Bank | Boosting | 0.982 | 0.967 |
Logistic | 0.964 | 0.964 | |
DNN | 0.972 | 0.966 | |
Law | Boosting | 0.928 | 0.881 |
Logistic | 0.877 | 0.878 | |
DNN | 0.883 | 0.882 |