With the proliferation of experimental studies in education, policymakers and practitioners have grown increasingly interested in understanding the extent to which the results of a study apply or generalize to a population or target group of students (Tipton and Olsen, 2018). When the goal is to generalize the results from a sample to a larger population of students, the so-called “narrow to broad” perspective (Shadish
Propensity score approaches have made an important contribution in improving the contextual generalizability of experimental results. Specifically, in narrow to broad generalizations, propensity scores are primarily used to match or reweight students in a sample with those in the population with the understanding that when the two groups are compositionally similar (based on the observable treatment effect moderators), treatment effect estimates will be generalizable. A similar assessment is made when generalizing across samples. However, generalization based on contextual (compositional) similarities is not the only type of generalization of interest in studies. The notion of generalization has different meanings in various research fields. For example, in the statistical learning context, generalizability refers to the extent to which a statistical model can be used to make out-of-sample predictions (Linden and Yarnold, 2016a, 2016b; Cai
Given the differences between how generalizability is defined, an important question is whether different notions of generalization in the contextual and model framework are associated with each other. The purpose of this study is to explore the connection between measures of contextual generalization (based on compositional similarity) and measures of model generalization, the latter of which refers to statistical models and predictive accuracy across various samples. This exploration is motivated by two factors. One, contextual generalization and propensity score methods are often used in studies where causal inference and the estimation of the causal impact of an intervention is the goal. In contrast, model generalization is related to prediction and the extent to which a fitted model can be applied across different samples of students or units to produce accurate predictions. Because causal inference and prediction are generally not discussed in the same context, a goal of this study is to assess whether a relationship exists between the two frameworks within the context of generalizability. Two, while both contextual and model generalization are important strands of generalization research, the concepts related to the former are often not present in studies that focus on the latter; namely, studies that center around the generalizability of models generally do not include discussions of contextual generalization. As a result, this study aims to bridge the two frameworks of generalization by providing the first empirical evidence of the association (or absence of one) between measures of contextual and model generalization. If a relationship exists between the two frameworks, this has important implications for generalization research as it would allow researchers to identify the conditions under which the results of a study apply across various groups of students.
The article is organized as follows. First, we provide a brief review of the concepts related to contextual and model generalization. In the same section, we also discuss the assumptions and existing measures used to assess each type of generalization (contextual and model) and highlight some important differences between the goals of each framework. Then, we explore the relationship between contextual and model generalization using a case study based on a subset of data from the Programme for International Student Assessment (PISA) 2015 study. PISA is a widely used, publicly available educational data set that assesses various aspects of 15-year-olds’ educational experiences across a sample of countries. Because the empirical example is based on observational data from PISA, our study differs from prior generalization research in an important way. We deviate from prior generalization studies by using an observational data set, in place of an experimental study, as the empirical example. This was done because PISA provides a rich source of covariate information that can be used to address a variety of research questions and because PISA is publicly available, the results of our study may motivate future work on the connection between contextual and model generalization. However, while our empirical example is based on an international large-scale dataset, the relationship between contextual and model generalization has broader implications for educational research. In the final section, we conclude with a discussion of these implications in relation to our findings based on the PISA data.
In this section, we provide a brief review of contextual generalization. Although the sample to population (narrow to broad) framework is the common one of interest in education studies, we center our review on the sample to sample (units of similar levels) perspective to be consistent with the PISA empirical example. However, extensions of the assumptions and approaches can be made to the narrow to broad perspective. Our review is based on the causal inference model used to estimate treatment effects in non-experimental studies (Rosenbaum and Rubin, 1983). In non-experimental or observational studies, the goal is to estimate an average treatment effect in the presence of self-selection (non-random selection) into the treatment and comparison groups. In our context, we refer to membership in a specific country (in PISA) as the equivalent of a “treatment” and assess the contextual generalizability across individuals in the treatment and comparison countries (groups).
Given this setup, consider a population
The challenge in many generalization studies in education is that the selection into samples is not random. This creates a selection bias in parameter estimates since units that select into a sample may be compositionally different from units that select into different samples. In this case, to make valid generalizations, model-based methods are needed (Olsen
Under the given framework, let
Because propensity scores are a function of the covariates
The validity of propensity score-based estimators depends on several assumptions. First, the Stable Unit Treatment Value Assumption (SUTVA) (Rubin, 1978, 1980, 1990; Tipton, 2013a) for the sample is required. Under this assumption, the outcomes of each student do not depend on membership in a specific sample and there is no interference among students across samples. Second, the treatment assignment must be strongly ignorable (Rosenbaum and Rubin, 1983). Under this assumption, the parameter
If the assumptions for propensity score methods are met, they can be used to produce bias-reduced estimates of the parameter of interest. In addition to their use in estimation, matching, and reweighting, propensity scores have also been used to assess generalization among students of different populations (Tipton, 2014; Chan, 2017; Tipton and Olsen, 2018). In this framework, contextual generalization is assessed by examining the extent of compositional similarity among groups of students or schools based on the propensity score distributions. Note that because propensity scores are univariate summaries of multiple covariates, similarity on the propensity scores is equivalent to similarity in the distributions of covariates that are used to estimate the propensity scores (Rosenbaum and Rubin, 1983). Formally, compositional similarity and contextual generalization is equivalent to the condition:
where the distributions of propensity scores are perfectly balanced among the treatment and control samples.
In practice, several statistical measures are available to quantify the extent of contextual generalization or compositional similarity. Among these, the generalizability index
Other measures used to quantify contextual generalization include standardized mean differences for individual covariates and for the propensity score logits (Stuart
Contextual generalization focuses on compositional similarity among samples based on the distributions of covariates. Model generalization focuses on the extent to which a predictive model fitted among units in one sample yields accurate predictions of an outcome for another sample. As a result, model generalization is assessed using measures of predictive accuracy between two samples; namely, the better the predictive accuracy, the more generalizable the model across various samples of students or schools. Formally, model generalization requires that:
In practice, to assess model generalization,
where the expression is the expected value of the squared differences between the observed outcomes
Given the differences between contextual and model generalizability, the current study addresses the following research questions.
Are measures of contextual and model generalizability associated with each other? That is, if two samples are contextually generalizable, does this imply that a model fit to one sample will generalize to units in the other sample (and have a small RMSE)?
Under what conditions, if any, does contextual generalization imply model generalization?
Collectively, the two research questions inform practitioners and policymakers of the conditions under which two frameworks for generalization are aligned with each other. If the compositional similarity between two groups is associated with accuracy in predictions, this can have implications for the choice of analytic approach to use when generalizing parameter estimates across populations.
To explore the relationship between contextual and model generalization, we use data from the Programme for International Student Assessment (PISA) 2015 study. PISA is a widely used publicly available educational data set that assesses various aspects of 15-year-olds’ education across a sample of countries. To date, seven waves of PISA data are publicly available, and the 2015 wave PISA data focuses on science as the learning outcome, which we also use in the current study. The original sample size of the PISA 2015 data is
Given the large number of variables and observations among all participating countries, we reduced the original size of the data by restricting the focus of our study to 30 countries and to students (in each country) who were categorized as low socioeconomic status (SES). Our decision to focus on a subset of the data was motivated by two factors. One, for our statistical analyses, it was essential to reduce the dimension of the observational data as including all variables significantly impacted the computational speed in the analyses. Second, we focused on low-SES students (rather than high-SES students) because researchers of education equity are strongly interested in understanding the factors that affect academic achievement among these student groups. Note that we also conducted the same analyses using the subpopulation of high SES students. The results were similar and can be provided upon request. Students were identified as low-SES using the PISA Economic, Social and Cultural Status (ESCS) index in the specified country. Using the subset of low-SES students among 30 countries, our final analytic sample comprised
Preliminary analyses revealed that most of the PISA covariates had missing data. However, among the variables that were used in our statistical analyses, most (90%) had rates of missingness of less than 10%. Because the identified variables were crucial for the generalization analyses, we performed missing data imputation using Multiple Imputation by Chained Equations (MICE) (Van Buuren and Groothuis-Oudshoom, 2010). MICE was chosen for its flexibility in dealing with different types of variables and it was used for covariates with missing data for each country and economy in the analytic sample. Note that alternative approaches to MICE for imputation can also be used, but these methods may depend on different assumptions (Murray, 2018).
The PISA 2015 study included a rich set of covariates for each participating country and economy. One of the key assumptions required to facilitate generalizations (of any form) is that the covariates be moderators of the outcome. As a preliminary analysis, we identified the variables that were significant predictors (as a result, potential moderators) of the outcome of science achievement. We used random forest and gradient boosting methods (Freund and Schapire, 1996; Breiman, 2001) to identify the covariates from the original PISA set that included hundreds of variables at the student, teacher, and school level. Random forest (RF) is an ensemble machine learning approach that uses classification and regression trees (CART) (Breiman, 2001) to identify a set of relevant predictors. Similarly, gradient boosting methods (GBM) (Friedman, 2001) use CARTs to build prediction models to rank the covariates based on the accuracy of predictions. Both methods were used as they can accommodate high dimensional data and facilitate model interpretability (Guelman, 2012; Natekin and Knoll, 2013; Taieb and Hyndman, 2014; Zhang and Haghani, 2015). Using the combination of RF and GBM, we identified 28 covariates that served as potential outcome moderators. Of these, 21 were at the student level and 7 were at the school level. A description of the variables can be found in Table 1. In the next section, we describe how the covariates were used to construct subsamples of the countries and compute statistics for contextual generalization.
Since the current study involves multiple countries, we organized our analysis and discussion of contextual and model generalization among three groups based on: (i) geographic region, (ii) GDP per capita and (iii)
Table 2 lists the three main groups of the 30 countries and economies based on the three criteria. Within each group, we identified multiple subsamples of countries. For the subsample based on geographic region (first row), we used a coarse grouping method based on the continents, which created four main subsamples of countries. For the second approach based on GDP per capita (second row), we created five subsamples based on ranges that quantified the wealth of the countries. Note that the 2015 GDP per capita used in this approach are in current US dollars from the World Bank, updated as of October 2019 (World Bank, 2019). The third grouping method was based on the statistical method of
where a
where
To perform
Using the 30 countries and three main groups (by geographic region, GDP, and
We assessed the relationship between contextual and model generalization in the following way. We computed values of the B-index and RMSE using an individual to subsample of countries approach. Under this method, generalization is assessed between a single country and the remaining countries in the subsample. Thus, to assess contextual generalization, the B-index is computed by comparing the propensity score distributions between students of a single country with students of the remaining countries in the subsample. Similarly, with model generalization, the model is trained to all countries but one in a subsample and it is validated using the holdout country (Grandvalet and Bengio, 2004).
To determine whether there is a relationship between contextual and model generalization, we computed correlations between the B-indices and RMSEs. Because a higher B-index implies stronger contextual generalization while a lower RMSE is associated with stronger model generalization, we expect that a significant negative correlation between the two measures would suggest an alignment between the two types of generalization.
In this section, we discuss the results from the contextual and model generalization analyses. For the sake of parsimony, we summarize the overall findings from the 90 combinations of B-index and RMSE computations (30 countries × 3 main groups). We center our discussion around three main trends: (i) values in the B-index for contextual generalization among the countries in each subsample, (ii) values of the RMSE for model generalization among the subsamples and, (iii) the correlation between values of the B-index and RMSE. In the following section, we begin with the values of the contextual and model generalization statistics for the linear regression case.
The results suggest that the B-indices that describe contextual generalization varied among the three main groups of countries. When grouped by geographic region, the B-indices were highest (B-index ranging from 0.49 to 0.78) for the countries in Western Europe and in the Central and Eastern European subsample. Note, however, that while the values were considered highest for these two subsamples, the range from 0.49 to 0.78 suggests moderate compositional similarity at best (Tipton, 2014). Thus, when grouped by geographic region, most countries in the subsample were less compositionally similar based on the potential outcome moderators. When grouped by GDP per capita, there was no discernible pattern among the values in each subsample with the exception that countries with high GDP per capita generally had small values of the B-index. This suggests that there was less compositional similarity among countries grouped by GDP. In the
We analyzed the values of the RMSE for model generalization in the same framework. As mentioned, we fit 10 models on the subsamples, of which four were linear regression models. When grouped by geographic region, the patterns of cross-validation RMSEs were similar overall across the four models. Countries in the Americas had the lowest RMSE values across the four models and as a result, these countries had the strongest generalizability in model predictions. In contrast, countries in Asia had the lowest level of model generalization with the highest values in RMSE. When grouped by GDP per capita, the countries with the lowest GDP had the lowest RMSEs and hence, the strongest model generalizability. This is consistent with the trends in the
An important question is how the values of the B-index for contextual generalization compared with the values of the RMSE for model generalization when fitting predictive models based on linear regression. Table 3 provides the overall correlations between the B-indices and RMSE values in each of the three main groups. The first row provides the average correlation across all the models. These were computed by taking the average of each model by country combination across the subsamples within each group. Because we assessed the statistical significance of the correlations across multiple combinations, we applied a Benjamini-Hochberg correction to the
In this section, we assess whether the association between contextual and model generalization depends on the type of model. In addition to the four linear regression models, we fit six multilevel models in which student data was nested within schools in each participating country. The models are depicted in Figure 4. Like the linear regression case, we assessed contextual and model generalization by referring to individual countries in the subsamples as “treatment” and the remaining countries as the comparison.
We first analyzed the trends in B-index and RMSE values. Because the propensity scores were estimated using the same covariates and countries in each subsample, the B-index values were the same as in the linear regression cases. As a result, we focus on the trends among RMSE values in the subsamples. When grouped by geographic region, the average RMSE varied across the subsamples with some of the smallest RMSE values (implying high model generalization) seen in the Americas and Eastern European countries. This trend was consistent across the six multilevel models. When grouped by GDP per capita, there was notably more variability in RMSE values though the smallest values were largely seen in the low GDP subsample. Finally, when grouped by
Using the estimated B-index and RMSE values for the multilevel models, we estimated the correlations between the two measures. Table 4 shows the correlations for the six models and three main groups of countries. The first row provides the average correlation across the multilevel models, which were computed in a similar way as the values in the linear regression models. Interestingly, the correlations for all models were negative but insignificant when countries were grouped by geographic region and GDP per capita. In contrast, they were all positive (but still insignificant) when countries were grouped by
The goal of this application to the PISA 2015 data was to assess the extent to which concepts of contextual generalization aligned with concepts associated with model generalization. The analyses above sought to address this question by focusing on three main groups of countries from the data set. Within each group, we created subsamples of countries to base our generalization assessments. The results suggest two main implications. First, the results are somewhat mixed with respect to an alignment between contextual and model generalization. While the correlations between the B-index and RMSE were largely negative among the geographic region and GDP groups, few were significant and the correlations were all positive when countries were grouped by
Generalization research continues to play an important role in informing educational research as policymakers and practitioners have grown increasingly interested in identifying approaches and best practices to support populations of students. Because generalization is defined and assessed differently across various disciplines, the current study sought to assess the extent to which two definitions of generalizability were aligned. This study focused on the relationship between contextual and model generalization that, at a broad level, relates to the relationship between causal inference and prediction. This connection, if present, is important in several ways. One, if compositional similarity was related to model and predictive accuracy, then precise estimates of parameters of interest can be derived for various populations of students or individuals. Two, if a relationship exists between contextual and model generalization, prediction can be used to derive parameter estimates for students or individuals not sampled in a study. This can have implications for ways to handle missing data. Finally, if contextual and model generalization are aligned, this relationship can potentially facilitate causal inference across multiple populations of students, particularly when a model that is used to derive precise causal estimates can be applied in various groups of individuals.
The results of our study suggest that contextual and model generalization are potentially aligned, but the results are inconsistent. Because the current study focused on generalizations from individual to subsamples of countries, the results suggest that any potential alignment between contextual and model generalization is difficult to observe in this perspective. This finding may not necessarily be surprising as model generalization is dependent on the specific samples used in the training set. However, the empirical evidence that an alignment between the two definitions of generalization potentially exists is useful. For researchers that use model-based methods for statistical inference, the results of this study can inform ways of deriving inference across different samples of students, particularly when data on potential outcome moderators is available. Future research should explore whether the empirical evidence of alignment is stronger in other perspectives of generalization, such as sample to population or population to sample (Shadish
Note: LM refers to linear model.
Note: MLM refers to multilevel model.
Covariate description (PISA 2015 data)
Variable | Variable abbreviation | Variable description |
---|---|---|
Student-level Covariates a | ||
ESCS | ESCS | Index of economic, social and cultural status (a composite score built by the indicators parental education (PARED), highest parental occupation (HISEI), and home possessions (HOMEPOS) including books in the home via principal component analysis (PCA)) |
Mother_Edu | ST005Q01TA | What is the highest level of schooling completed by your mother? (1 = ISCED level 3A; 2 = ISCED level 3B, 3C; 3 = ISCED level 2; 4 = ISCED level 1; 5 = She did not complete ISCED level 1) |
Father_Edu | ST007Q01TA | What is the highest level of schooling completed by your father? (1 = ISCED level 3A; 2 = ISCED level 3B, 3C; 3 = ISCED level 2; 4 = ISCED level 1; 5 = He did not complete ISCED level 1) |
Vocational_Edu | ISCEDO | Programme orientation (ISCEDO) indicates whether the programme’s curricular content was general, pre-vocational or vocational (1 = General; 2 = Pre-Vocational; 3 = Vocational; 4 = Modular) For the following item: (1 = No, never; 2 = Yes, once; 3 = Yes, twice or more) |
Retention_PrimaryEdu | ST127Q01TA | Have you ever repeated a grade? At ISCED 1: primary education |
Retention_LowSecEdu | ST127Q02TA | Have you ever repeated a grade? At ISCED 2: lower secondary education For the following items: How often do you do these things? (1 = Very often; 2 = Regularly; 3 = Sometimes; 4 = Never or hardly ever) |
Freq_HaveSciBooks | ST146Q02TA | Borrow or buy books on broad science topics |
Freq_GoSciClub | ST146Q05TA | Attend a science club |
Freq_SimNaturalSciLab | ST146Q06NA | Simulate natural phenomena in computer programs virtual labs |
Freq_SimTechSciLab | ST146Q07NA | Simulate technical processes in computer programs virtual labs |
Freq_VisitEcologyWebPage | ST146Q08NA | Visit web sites of ecology organisations |
Freq_FollowNewsBlog | ST146Q09NA | Follow news via blogs and microblogging |
Test_Anxiety | ANXTEST | Personality: Test Anxiety (WLE b), derived from ST118 based on IRT scaling. |
Enjoy_Teamwork | COOPERATE | Collaboration and teamwork dispositions: Enjoy cooperation (WLE), including answers to items ST082Q02NA, ST082Q03NA, ST082Q08NA, and ST082Q12NA. |
Value_Teamwork | CPSVALUE | Collaboration and teamwork dispositions: Value cooperation (WLE), including answers to items ST082Q01NA, ST082Q09NA, ST082Q13NA and ST082Q14NA. |
Achy_Motivat | MOTIVAT | Student Attitudes, Preferences and Self-related beliefs: Achieving motivation (WLE), derived from ST119 based on IRT scaling. |
Sch_Belonging | BELONG | Subjective well-being: Sense of Belonging to School (WLE), derived from ST034 based on IRT scaling. |
Num_SciClsAWeek | ST059Q03TA | Number of class periods required per week in science |
Num_ClsAWeek | ST060Q01NA | In a normal, full week at school, how many class periods are you required to attend in total? |
Min_MathLearnAWeek | MMINS | Learning time (minutes per week) - Mathematics |
Min_LearnAWeek | TMINS | Learning time (minutes per week) - in total |
School-level Covariates a | ||
Sch_Size | SCHSIZE | School Size (Sum) |
Prop_ComputerInternet | RATCMP2 | Proportion of available computers that are connected to the Internet |
Staff_Short | STAFFSHORT | Shortage of educational staff (WLE) |
Prop_CertTcher | PROATCE | Index proportion of all teachers fully certified |
Prop_SciTertiaryGrad | PROSTMAS | Index proportion of science teachers with ISCED level 5A and a major in science |
Num_TcherSch | TOTAT | Total number of all teachers at school |
Num_SciTcherSch | TOTST | Total number of science teachers at school |
a The covariates were selected using GBM and RF models.
b WLE refers to weighted likelihood estimates (Warm, 1989).
Subsamples of Countries
Grouping criterion | Subsamples | Description | Countries |
---|---|---|---|
Geographic region | 1 | Central & Eastern Europe | Montenegro, Bulgaria, Turkey, Croatia, Czech Republic, Estonia, Lithuania, Latvia, Poland, Russian Federation |
2 | Western Europe | Switzerland, Spain, Ireland, Iceland, Luxembourg, Portugal, Greece, Finland | |
3 | Asia | United Arab Emirates (UAE), B-S-J-G (China), Korea, Chinese Taipei, Hong Kong, Macao | |
4 | Americas | Costa Rica, Mexico, Colombia, Peru, Uruguay, United States | |
GDP per capita | 1 | GDP per capita < 10, 000 | Colombia, Peru, Montenegro, Bulgaria, B-S-J-G (China)*,Russian Federation, Mexico |
2 | GDP per capita < 15, 000 | Turkey, Costa Rica, Croatia, Poland, Latvia, Lithuania | |
3 | GDP per capita < 20, 000 | Uruguay, Estonia, Czech Republic, Greece, Portugal | |
4 | GDP per capita < 50, 000 | Spain, Korea, United Arab Emirates, Hong Kong, Finland | |
5 | GDP per capita > 50, 000 | Iceland, United States, Ireland, Macao, Switzerland, Luxembourg | |
1 | Low ESCS but high science achievement on average | B-S-J-G (China), Macao, Hong Kong, Chinese Taipei, Costa Rica, Luxembourg, Portugal, Spain, Uruguay, Colombia | |
2 | Low ESCS and low science achievement on average | Bulgaria, Mexico, Montenegro, Peru, Russian Federation, Turkey, United Arab Emirates | |
3 | High ESCS and high science achievement on average | Croatia, Czech Republic, Estonia, Finland, Greece, Iceland, Ireland, Korea, Latvia, Lithuania, Poland, Switzerland, United States |
ESCS refers to the PISA economic, social, and cultural status index. Because of data limitations, the GDP per capita for China was used to reflect the GDP per capita in the four provinces (B-S-J-G: Beijing, Shanghai, Jiangsu, Guangdong) in China, but the GDP per capita for the four richer provinces in China could be higher than the GDP per capita at the national level.
Average correlations between B-index and RMSE for linear regression models
Geographic region | GDP per capita | ||
---|---|---|---|
All Models | −0.17 | −0.16 | 0.24* |
LM Model 1 | −0.19 | −0.07 | 0.29 |
LM Model 2 | −0.29 | −0.20 | 0.06 |
LM Model 3 | −0.08 | −0.19 | 0.32 |
LM Model 4 | −0.12 | −0.18 | 0.34 |
Note: ***
Average correlations between B-index and RMSE for multilevel linear regression models
Geographic region | GDP per capita | ||
---|---|---|---|
All Models | −0.23* | −0.26** | 0.24** |
MLM Model 1 | −0.18 | −0.26 | 0.17 |
MLM Model 2 | −0.20 | −0.27 | 0.19 |
MLM Model 3 | −0.31 | −0.22 | 0.39 |
MLM Model 4 | −0.29 | −0.30 | 0.41 |
MLM Model 5 | −0.19 | −0.27 | 0.17 |
MLM Model 6 | −0.21 | −0.27 | 0.16 |
Note: ***