This study develops a new type of latent class analysis (LCA) in order to explain the associations between one latent variable and several other categorical latent variables. Our model postulates that the prevalence of the latent variable of interest is affected by another latent variable composed of other several latent variables. For the parameter estimation, we propose deterministic annealing EM (DAEM) to deal with local maxima problem in the proposed model. We perform simulation study to demonstrate how DAEM can find the set of parameter estimates at the global maximum of the likelihood over the repeated samples. We apply the proposed LCA model in an investigation of the effect of and joint patterns for drug-using behavior to violent behavior among US high school male students using data from the Youth Risk Behavior Surveillance System 2015. Considering the age of male adolescents as a covariate influencing violent behavior, we identified three classes of violent behavior and three classes of drug-using behavior. We also discovered that the prevalence of violent behavior is affected by the type of drug used for drug-using behavior.
Violent and drug-using behavior are a major social issue among US adolescents that contribute to premature death, disability, and other social problems (Van Horn
In this paper, a new type of LCA with multiple latent groups (LCA-MLG) has been proposed so that we can investigate the effect of drug-using behavior on violent behavior. In our model, the subgroups of drug-using behavior are identified by the joint latent class analysis (JLCA) using the framework of LCA with multiple groups. We adopted deterministic annealing EM (DAEM) as a parameter estimation strategy to overcome the local maxima problem. We then use the proposed model to the data from the Youth Risk Behavior Surveillance System 2015 (YRBSS 2015) in an investigation of the effect of the joint patterns of drug-using behavior to violent behavior among the US high school male students (Centers for Disease Control and Prevention, 2015).
The remainder of this paper presents the description of the proposed model LCA-MLG and estimation methods for the model parameters in Sections 2 and 3, respectively. In Section 4, we evaluate the performance of DAEM over repeated sampling. We then apply the proposed model to the real dataset from the YRBSS 2015 in Section 6.
The LCA is a finite mixture model for dividing population into several subgroups based on individuals’ responses to the manifest items. LCA assumes that the population is composed of several unobservable subgroups (i.e., latent classes) which can be measured by multiple manifest items indirectly, implying that associations among the manifest items are totally explained by latent class variable. Suppose there are
where
The JLCA is an extended version of the LCA model to deal with multiple latent class variables. Suppose there are
Let the joint latent class variable
where
The prevalence parameter,
The LCA-MLG postulates that latent class variable may be affected by joint latent class variable which can be identified by the JLCA model. We therefore consider joint latent class as a latent group variable in the traditional LCA. We propose the LCA-MLG and illustrate the model in Figure 1. The right side of Figure 1 is a JLCA with joint latent class variable
The likelihood of the manifest items (i.e., the observed-data likelihood) can be derived by the marginal summation of (
The prevalence of the outcome latent class variable may be affected by the demographic characteristics or other individual information, and these characteristics can be considered as covariates in the proposed model (Figure 1). Suppose we have a vector of covariates
The LCA with latent group variable is composed of an unobservable latent structure; therefore, the parameter estimation may be regarded as a missing-data problem. The expectation-maximization (EM) is the standard method to estimate the model parameters for LCA and JLCA. However, the EM algorithm is highly influenced by its initial value. Once the inappropriate initial values are given, the final solution of the EM may be deviated from the global maximum and one of the local maxima will be provided. To overcome this problem, a large number of sets of starting values should be tried for the standard EM algorithm, and we may choose the estimates with the highest log-likelihood. This may help to resolve the local maxima problem; however, the time and computational cost would be very expensive. There is also no guarantee that the result with the highest log-likelihood among the candidates is actually the global maximum. To overcome the difficulty in local maxima for the proposed model, we adopt DAEM method (Ueda and Nakano, 1998).
The DAEM is proposed to overcome the local maxima problem by using the principle of maximum entropy. In the DAEM process, a modified posterior is introduced by additional factor
E-step: The DAEM maximizes the modified observed-data log-likelihood which can be defined as
where
To determine the optimal choice of
The optimal choice
where the modified posterior probabilities for the specific dimensions are defined as
M-step: We may obtain the estimators that maximize the expectation given in (
Starting with the fixed value of
It is important for LCA to assess model fit with a balanced judgement that considers objective measures as well as substantive knowledge in order to understand distinctive features and underlying structure of the data in a simple manner. The chi-square asymptotic assumption for the likelihood ratio test statistic (
It is also important to examine the absolute model fit by calculating the difference between expected and the observed frequencies. Jeong and Lee (2009) suggested the parametric bootstrap testing procedure to obtain the asymptotic distribution of test statistics proposed as a goodness-of-fit statistic for the cumulative logit model. Chung
Once the number of latent classes for each latent variable is determined, the number of group latent classes (i.e., joint latent classes) may be selected in similar criteria such as smaller AIC (or BIC) and bootstrap
In this section, we performed two sets of simulation studies to check how fairly the DAEM method works. The first study confirms that the DAEM method is superior to the EM in finding ML solutions at the global maximum. The second study evaluates how properly the DAEM estimates model parameters in LCA with the latent group variable with covariates. We construct confidence intervals based on asymptotic standard errors from Hessian matrix given in
In the first study, we generated one target dataset whose number of observation is 500, and we randomly generated 30 sets of initial values. With these starting values, we independently performed parameter estimation using (a) the conventional EM method and (b) the DAEM method with
The second study examined if DAEM properly operates to provide ML estimates of the proposed LCA model. We generated 200 data sets with a sample size of 500 and calculated ML estimates using the DAEM. The calculated parameter estimates and the standard errors from the Hessian matrix for the one sample were then used to construct a 95% confidence interval that checked if it covered the true value of the parameter or not. These procedures were independently repeated for 200 generated data sets and the coverage of the confidence intervals were subsequently calculated.
The structure of the generated data set is as follows. There are two latent variables which have two latent classes measured by four manifest items, respectively. These two latent variables form a group latent variable whose number of joint class is two. There is also one outcome latent variable which has two latent classes measured by four manifest items. For the measurement parameters, both of strong measurement (Table 1) and mixed measurement (Table 2) were considered. The average estimates from the DAEM are considerably similar with the true values, and the coverage probabilities of the 95% confidence intervals are fairly near the 0.95 both in strong and mixed measurements. This implies that the parameter estimation and model identification work properly.
The Youth Risk Behavior Surveillance System 2015 (YRBSS 2015) is a biennial survey research about the health risk behavior and drug-using behavior among US adolescents. Among the 15,624 survey participants in the data, we focus on 4,957 high-school male students 16 to 18 years of age.
In this paper, we have 18 self-report items to measure violent and drug-using behavior. Five items were used to measure violent behavior: (1) During the past 30 days, on how many days did you carry a weapon such as a gun, knife, or club? (2) During the past 30 days, on how many days did you not go to school because you felt you would be unsafe at school or on your way to or from school? (3) During the past 12 months, how many times has someone threatened or injured you with a weapon such as a gun, knife, or club on school property? (4) During the past 12 months, how many times were you in a physical fight? (5) During the past 12 months, how many times were you in a physical fight in which you were injured and had to be treated by a doctor or nurse? Cigarette smoking was measured by four manifest items: (1) Have you ever tried cigarette smoking, even one or two puffs? (2) How old were you when you smoked a whole cigarette for the first time? (3) During the past 30 days, on how many days did you smoke cigarettes? (4) During the past 30 days, on the days you smoked, how many during the past 12 months, did you ever try to quit smoking cigarettes did you smoke per day? Four items on the alcohol consumption are as follows: (1) During your life, on how many days have you had at least one drink of alcohol? (2) How old were you when you had your first drink of alcohol other than a few sips? (3) During the past 30 days, on how many days did you have at least one drink of alcohol? (4) During the past 30 days, on how many days did you have 5 or more drinks of alcohol in a row, that is, within a couple of hours? Finally, five items on the other illegal drug-using behavior were: (1) During your life, how many times have you used marijuana? (2) How old were you when you tried marijuana for the first time? (3) During the past 30 days, how many times did you use marijuana? (4) Have you ever tried one of these illegal drugs: cocaine, sniffed solvents, heroin, methamphetamines, or ecstasy? (5) During the past 12 months, has anyone offered, sold, or given you an illegal drug on school property?
Among the original 18 manifest items, the quantitative variables are changed into binomial items: the variables about number of days (or times) are categorized as 1 if one day or more (or one time or more), and 0 otherwise, changing the responses into binary patterns (i.e., whether an individual has experience in something or not). The items on the age of first use are categorized as 1 under 13 years old, and 0 otherwise, indicating whether or not early exposure. Table 3 shows the proportions of responding ‘yes’ to the manifest items and the missing rates.
Using these 18 items along with their age as a covariate, we inspect the effect of drug-using behavior towards violent behavior, by answering following questions: (a) What kinds of latent classes may be found for each drug use and violent behavior? (b) What kinds of common joint patterns can be identified for cigarette, alcohol, and other illegal drug use behavior? (c) How does the prevalence of violent behavior change as the joint latent class membership of drug use varies?
To construct the LCA-MLG model we need to determine the number of latent classes for each latent variable. Firstly, we perform four LCAs to select the number of classes for the respective latent variables (i.e., violent behavior, cigarette smoking, alcohol consumption, and other illegal drug use) based on each set of manifest items. In this step, covariates are not necessary to be included due to the marginalization property (Bandeen-Roche
Table 4 shows the goodness-of-fit statistics with the different number of classes for each latent variable. Note that only 2- and 3-class models are allowed to be fitted for
Given the number of latent classes for each of drug-using behavior, we determine the number of joint latent classes in the LCA-MLG model. Table 5 lists a series of LCA-MLG models fitted with various number of joint latent classes for drug-using behavior and their goodness of fit measures. The 3-class model and the 4-class model were both considered since the number of latent classes for
The LCA-MLG model considered drug-using behavior as a group variable that may affect the outcome latent variable of
Under the selected model structure, the primary measurement parameter estimates for drug-using behavior (i.e.,
The estimated class prevalence for drug-using behavior may be estimated as
for
Table 7 shows the secondary measurement parameters (i.e.,
As a result, the second joint latent class may be interpreted as ‘marijuana user with cigarette and alcohol onset’ group. For the third joint latent class, individuals show probabilities of 0.580 for ‘early onset current smoker,’ 0.907 for ‘binge drinker,’ and 0.953 for ‘early and multiple drug user.’ Therefore, the third joint latent class can be labeled as ‘multiple drug user’ group. The last row in Table 7 indicates
Under the selected LCA-MLG model, the outcome latent variable is
for
We consider age as a covariate to investigate its influence on
The estimated prevalences of
Table 10 shows that the prevalences of
This paper proposes a new latent variable model to examine the relationship between violent behavior and multiple drug-using behavior among high-school male students. The conventional LCA may be able to deal with a single latent variable. The newly proposed LCA-MLG with covariates can investigate joint effects of several latent variables on the prevalence of outcome latent variable. EM algorithm is widely adopted for the parameter estimation of the incomplete data, but it has several problems of local maxima in the likelihood. We adopted the DAEM algorithm (Ueda and Nakano, 1998) to overcome this problem and to provide with precise ML estimates of latent variable model where the latent structure is quite complex. In addition, the Hessian matrix of the model was calculated to provide asymptotic standard error of the estimates. The analysis of YRBSS 2015 indicates three representative subgroups that show similar patterns in drug-using behavior including cigarette, alcohol and other illegal drugs. These common patterns form a joint latent variable whose joint latent classes can be referred as ‘non drug user,’ ‘current drug user,’ and ‘multiple drug user’ group depending on the extent of experiences or behavior towards various drugs. Similarly, three common subgroups were discovered for the violent behavior of high-school students, as measured by five binary items. The LCA-MLG model enables us to examine how the prevalence of
The structure of LCA-MLG tries to explain the association between several latent group variables and one latent outcome variable. However, the associations covered with LCA-MLG model are not exact causalities. Consequently, examining causal inference between several latent variable may be a valuable further research topic. Consequently, we have made a DAEM routine for LCA-MLG written in R language (version 3.3.1) which is available on request.
This work was supported by a Korea University Grant (K1509141 to Hwan Chung) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01056846 to Hwan Chung).
Average EST, MSE, and CP of 95% confidence intervals for parameter estimates (strong measurement)
Parameter | True | EST | MSE | CP | Parameter | True | EST | MSE | CP |
---|---|---|---|---|---|---|---|---|---|
0.10 | 0.101 | 0.0003 | 0.93 | 0.50 | 0.495 | 0.0020 | 0.93 | ||
0.10 | 0.098 | 0.0003 | 0.95 | 0.90 | 0.903 | 0.0016 | 0.96 | ||
0.10 | 0.101 | 0.0003 | 0.95 | 0.10 | 0.101 | 0.0018 | 0.94 | ||
0.10 | 0.100 | 0.0003 | 0.96 | 0.10 | 0.098 | 0.0017 | 0.93 | ||
0.90 | 0.900 | 0.0003 | 0.93 | 0.90 | 0.901 | 0.0018 | 0.93 | ||
0.90 | 0.901 | 0.0004 | 0.93 | −1.00 | −1.055 | 0.0823 | 0.96 | ||
0.90 | 0.900 | 0.0004 | 0.94 | 1.00 | 1.038 | 0.0511 | 0.93 | ||
0.90 | 0.900 | 0.0004 | 0.95 | 1.00 | 1.023 | 0.0672 | 0.98 | ||
0.90 | 0.902 | 0.0004 | 0.95 | −1.00 | −1.032 | 0.0488 | 0.96 | ||
0.90 | 0.900 | 0.0003 | 0.93 | 0.10 | 0.099 | 0.0004 | 0.97 | ||
0.90 | 0.900 | 0.0003 | 0.93 | 0.10 | 0.101 | 0.0004 | 0.95 | ||
0.90 | 0.899 | 0.0004 | 0.94 | 0.10 | 0.099 | 0.0004 | 0.93 | ||
0.10 | 0.100 | 0.0003 | 0.97 | 0.10 | 0.101 | 0.0004 | 0.93 | ||
0.10 | 0.098 | 0.0003 | 0.94 | 0.90 | 0.898 | 0.0003 | 0.91 | ||
0.10 | 0.099 | 0.0004 | 0.95 | 0.90 | 0.899 | 0.0004 | 0.95 | ||
0.10 | 0.101 | 0.0003 | 0.95 | 0.90 | 0.899 | 0.0004 | 0.94 | ||
0.90 | 0.902 | 0.0003 | 0.94 |
EST = estimates; MSE = mean square error; CP = coverage probability.
Average EST, MSE, and CP of 95% confidence intervals for parameter estimates (mixed measurement)
Parameter | True | EST | MSE | CP | Parameter | True | EST | MSE | CP |
---|---|---|---|---|---|---|---|---|---|
0.10 | 0.097 | 0.0011 | 0.91 | 0.50 | 0.513 | 0.0133 | 0.97 | ||
0.10 | 0.098 | 0.0011 | 0.95 | 0.70 | 0.700 | 0.0074 | 0.96 | ||
0.30 | 0.299 | 0.0009 | 0.98 | 0.20 | 0.187 | 0.0066 | 0.97 | ||
0.30 | 0.298 | 0.0011 | 0.95 | 0.30 | 0.298 | 0.0076 | 0.98 | ||
0.90 | 0.901 | 0.0008 | 0.96 | 0.80 | 0.813 | 0.0074 | 0.96 | ||
0.90 | 0.902 | 0.0006 | 0.97 | −1.00 | −1.242 | 0.7771 | 0.98 | ||
0.70 | 0.700 | 0.0008 | 0.98 | 1.00 | 1.212 | 0.7228 | 0.98 | ||
0.70 | 0.693 | 0.0008 | 0.95 | 1.00 | 1.331 | 0.6397 | 0.98 | ||
0.90 | 0.899 | 0.0005 | 0.98 | −1.00 | −1.349 | 0.6534 | 0.99 | ||
0.90 | 0.898 | 0.0006 | 0.97 | 0.10 | 0.098 | 0.0015 | 0.92 | ||
0.70 | 0.709 | 0.0006 | 0.98 | 0.10 | 0.087 | 0.0016 | 0.95 | ||
0.70 | 0.702 | 0.0009 | 0.93 | 0.30 | 0.310 | 0.0007 | 0.97 | ||
0.10 | 0.103 | 0.0007 | 0.95 | 0.30 | 0.302 | 0.0009 | 0.95 | ||
0.10 | 0.096 | 0.0008 | 0.98 | 0.70 | 0.691 | 0.0017 | 0.93 | ||
0.30 | 0.299 | 0.0009 | 0.98 | 0.70 | 0.701 | 0.0007 | 0.93 | ||
0.30 | 0.300 | 0.0015 | 0.96 | 0.10 | 0.103 | 0.0009 | 0.95 | ||
0.10 | 0.102 | 0.0017 | 0.96 |
EST = estimates; MSE = mean square error; CP = coverage probability.
Percentages of responding ‘yes’ to the manifest items for each latent variable and their missing rates
Latent variable | Manifest item Questionnaires on “Have you...?” | Yes (%) | Missing (%) |
---|---|---|---|
Carried a weapon during recent 30 days ( | 24.26 | 8.21 | |
Absent to school due to feeling unsafe recent 12 months ( | 5.14 | 0.32 | |
Threatened by weapon on school recent 12 months ( | 7.10 | 3.87 | |
Involved in a physical fight recent 12 months ( | 22.47 | 13.71 | |
Seriously injured in a physical fight recent 12 months ( | 3.31 | 12.16 | |
Ever tried cigarette smoking ( | 36.35 | 9.96 | |
Smoked before age 13 years for the first time ( | 8.73 | 5.85 | |
Smoked cigarettes during the recent 30 days ( | 13.33 | 5.18 | |
Smoked more than 10 cigarettes per day ( | 1.45 | 5.38 | |
Ever drunken alcohol ( | 64.95 | 3.18 | |
Drunken alcohol before age 13 years for the first time ( | 19.42 | 2.78 | |
Drunken alcohol during the recent 30 days ( | 33.10 | 16.76 | |
Drunken five or more drinks of alcohol in a row ( | 21.44 | 5.26 | |
Ever tried MJ ( | 45.87 | 4.17 | |
Tried MJ before age 13 years for the first time ( | 10.65 | 2.54 | |
Used MJ during the recent 30 days ( | 26.62 | 10.53 | |
Ever tried other illegal drugs ( | 15.29 | 5.41 | |
Ever sold or offered illegal drug on school property ( | 25.47 | 3.77 |
MJ = marijuana.
Goodness-of-fit measures for a series of latent class analysis models with the different number of classes for four different class variables
Latent variable | Number of classes | AIC | BIC | Bootstrap |
---|---|---|---|---|
2 | 14779.8 | 14851.4 | 0.00 | |
3 | 14687.0 | 14797.6 | 0.00 | |
4 | 14652.0 | 14801.7 | 0.34 | |
5 | 14663.7 | 14852.4 | 0.49 | |
2 | 11195.1 | 11253.6 | 0.00 | |
3 | 11058.8 | 11150.0 | 0.34 | |
2 | 16826.8 | 16885.4 | 0.00 | |
3 | 16607.9 | 16699.0 | 0.41 | |
2 | 20381.3 | 20452.9 | 0.00 | |
3 | 20275.3 | 20386.0 | 0.28 | |
4 | 20279.7 | 20429.4 | 0.45 | |
5 | 20291.2 | 20480.0 | 0.68 |
AIC = Akaike information criterion; BIC = Bayesian information criterion.
Goodness-of-fit measures for a series of LCA-MLG models with the different number of joint classes of drug-using behavior (i.e.,
Number of classes for | Number of joint classes for drug-using behavior | AIC | BIC | Bootstrap |
---|---|---|---|---|
3 | 2 | 78725.8 | 79187.9 | 0.01 |
3 | 78590.1 | 79110.8 | 0.06 | |
4 | 78555.8 | 79135.0 | 0.24 | |
5 | 78547.7 | 79185.5 | 0.22 | |
6 | 78611.0 | 79307.4 | 0.33 | |
4 | 2 | 78661.5 | 79169.2 | 0.04 |
3 | 77987.4 | 78560.1 | 0.16 | |
4 | 77911.2 | 78614.1 | 0.16 | |
5 | 77979.7 | 78617.5 | 0.32 | |
6 | 163880.7 | 164648.7 | 0.34 |
LCA-MLG = latent class analysis with multiple latent groups; AIC = Akaike information criterion; BIC = Bayesian information criterion.
The estimated probabilities of responding ‘yes’ to the manifest items and class prevalences for each of the latent variables
Manifest item | Latent class for | ||
---|---|---|---|
Non-smoker | Lifetime smoker | Early onset current smoker | |
0.047 | 1.000* | 1.000* | |
0.000* | 0.158 | 0.535 | |
0.000* | 0.219 | 1.000* | |
0.000* | 0.000* | 0.200 | |
Class prevalence | 0.614 | 0.298 | 0.088 |
Manifest item | Latent class for | ||
Non-drinker | Lifetime drinker | Binge drinker | |
0.160 | 1.000* | 1.000* | |
0.036 | 0.233 | 0.382 | |
0.000* | 0.306 | 1.000* | |
0.000* | 0.000* | 0.886 | |
Class prevalence | 0.378 | 0.352 | 0.270 |
Manifest item | Latent class for | ||
Non-user | Current marijuana user | Early and multiple drug user | |
0.038 | 1.000* | 1.000* | |
0.000* | 0.124 | 0.497 | |
0.000* | 0.497 | 0.835 | |
0.032 | 0.199 | 0.803 | |
0.165 | 0.306 | 0.577 | |
Class prevalence | 0.539 | 0.314 | 0.147 |
MJ = marijuana.
The estimated probabilities of belonging to a latent class for a given joint class membership and the prevalence of joint classes
Latent class | Joint latent class for drug-using behavior | |||
---|---|---|---|---|
Non drug user | Current drug user | Multiple drug user | ||
Non-smoker | 0.988 | 0.401 | 0.079 | |
Lifetime smoker | 0.005 | 0.598 | 0.341 | |
Early onset current smoker | 0.007 | 0.001 | 0.580 | |
Non-drinker | 0.760 | 0.089 | 0.034 | |
Lifetime drinker | 0.220 | 0.597 | 0.059 | |
Binge drinker | 0.020 | 0.314 | 0.907 | |
Non-user | 0.964 | 0.259 | 0.038 | |
Current marijuana user | 0.036 | 0.720 | 0.009 | |
Early and multiple drug user | 0.000* | 0.021 | 0.953 | |
Joint class prevalence | 0.443 | 0.412 | 0.145 |
^{*}The probabilities are constrained to be zero or one.
The estimated probabilities of responding ‘yes’ to the manifest items and class prevalences for each of latent variables
Manifest item | Latent class for | ||
---|---|---|---|
Not violent | Weapon carry and fight | Seriously violent | |
0.177 | 0.501 | 0.797 | |
0.018 | 0.066 | 0.494 | |
0.021 | 0.113 | 0.724 | |
0.057 | 1.000* | 0.827 | |
0.000* | 0.117 | 0.364 | |
Class prevalence | 0.762 | 0.185 | 0.053 |
^{*}The probabilities are constrained to be zero or one.
The estimated odds ratios of age for
Joint latent class | Latent class for | |
---|---|---|
Weapon carry and fight | Seriously violent | |
Non drug user | 0.723 [0.472, 1.106] | NA* |
Current drug user | 0.690 [0.558, 0.852] | 0.815 [0.325, 2.041] |
Multiple drug user | 0.892 [0.640, 1.241] | 0.857 [0.526, 1.396] |
^{*}The probability of belonging to ‘seriously violent’ for ‘non drug user’ group is constrained to be zero.
The estimated prevalence of
Group variable | Latent class for | ||
---|---|---|---|
Not violent | Weapon carry and fight | Seriously violent | |
Non drug user | 0.943 | 0.057 | 0.000* |
Current drug user | 0.742 | 0.231 | 0.027 |
Multiple drug user | 0.267 | 0.450 | 0.283 |
^{*}The probability of belonging to ‘seriously violent’ for ‘non drug user’ group is constrained to be zero.