Cardiovascular disease (CVD) is the leading cause of death worldwide and has a high mortality rate after onset; therefore, the CVD management requires the development of treatment plans and the prediction of prevalence rates. In our study, age, income, education level, marriage status, diabetes, and obesity were identified as risk factors for CVD. Using these 6 factors, we proposed a nomogram based on a naïve Bayesian classifier model for CVD. The attributes for each factor were assigned point values between −100 and 100 by Bayes’ theorem, and the negative or positive attributes for CVD were represented to the values. Additionally, the prevalence rate can be calculated even in cases with some missing attribute values. A receiver operation characteristic (ROC) curve and calibration plot verified the nomogram. Consequently, when the attribute values for these risk factors are known, the prevalence rate for CVD can be predicted using the proposed nomogram based on a naïve Bayesian classifier model.
Cardiovascular disease (CVD) occurs in the heart and major arteries. It is caused by narrowing or clogging of blood vessels and there are few early symptoms. However, it has a high mortality rate after onset. In Europe, Asia, and the Americas, CVD was the leading cause of death, in 2013 (Wilson
Many studies on the risk factors of CVD have been steadily progressing. One of the major risk factors for CVD is obesity. Nowadays, people who prefer fast food are increasing because of their busy daily lives. Therefore, the rate of obesity is gradually increasing. In many studies, it is emphasized that obesity is highly related to CVD (Poirier
When a risk factor is selected for a certain disease, nomogram which is a visual tool can be constructed to predict the prevalence rate through simple calculation. Nomogram consists of a point line, straight lines for each risk factor, a total point line, and a probability line (Iasonos
We used 2013–2015 data from KNHANES to build a nomogram applying a naïve Bayesian classifier model. The data was separated into a training set (
Posterior probability,
In naïve Bayes, the predicted prevalence rate can be calculated by conditional probability. When
In
According to
Odds of
When we take the log of both sides of
Therefore,
Here,
Therefore,
Modern medicine is advancing; however, it is still important to find and treat CVD early. The nomogram plot is a visualization technique showing construction for naïve Bayesian classifier model and it can be used to predict prevalence using the status of several factors. It consists of a point line, straight lines for each risk factor, a total point line, and a probability line. Now, attribute value is specified as
The point value corresponding to each attribute can be obtained by the following equation.
After all LR(
Through (
The total point values are the sum of corresponding
In
In
Now, we apply nomogram using naïve Bayesian classifier model to CVD data. 2013 to 2015 data was used in KNHANES VI, and the total were 7,856. The prevalence rate of CVD was 27.5%, the prevalence rate is 3.1% from 19 to 39 years old, 23.3% from 40 to 59 years old, and 60.2% from 60 to 80 years old (Table 1). Thus, it increases rapidly with age. Statistical analysis was conducted using Statistical Package for the Social Sciences (SPSS) Version 23.0. Table 1 is a result of Chi-square test for training data set. All variables, including sex, age, income, education level, marriage status, diabetes, renal failure, depression, rheumatoid arthritis, smoking status, alcohol status, stress, obesity, and starvation were significant at the 5% significance level and we choose 6 factors (age, income, education, marriage, diabetes, and obesity) as a risk factors for CVD according to
In Table 2, each likelihood ratio is obtained through
The CVD nomogram plot was drawn using the point values for each attribute obtained from Table 2 (Figure 1). The composition of the nomogram is point line, 6 predictor lines, total point line, and probability line and it graphically represents the numerical relationships between CVD and 6 risk factors (age, income, education level, marriage status, diabetes, and obesity status). A nomogram plot is obtained from naïve Bayesian classifier model and each patient receives a point values for each factor. We can then get the predicted prevalence rate by adding all of those point values, find the total points corresponding to the total point line, and find the prevalence rate corresponding to the probability line. We can also calculate the prevalence rate by assigning a 0 point, even if we do not know the attribute value for a risk factor. In Figure 2 the assigned point values are: 19–39 in age is −100 points, 40–59 in age −11 points, 60–80 in age is 53 points, < 25% in income is 38 points, 25–50% in income is 2 points, 50–75% in income is −14 points, ≥ 75% in income is −17 points, elementary school in education level is 49 points, middle school in education level is 23 points, high school in education level is −16 points, college in education level is −35 points, married in marriage status is 9 points, single is −90, yes in diabetes is 85 points, no is −9, lower weight in obesity status is −66 points, normal is −8, and obesity is 20 points. The higher point values in each factor, the more important factor for CVD. Age has the longest line and it is the most influential factor for the prevalence of CVD according to age. Diabetes are assigned the largest point value. For example, a patient (age: 68 years old, income: 25–50%, education level: high school, marriage status: married, diabetes: yes, obesity status: obesity) has total point values of about 153, and prevalence rate of CVD is about 95%. In this case, the risk of having CVD is high, the patient needs to establish a treatment plan for CVD.
To verify the discrimination of the nomogram plot, an ROC curve was drawn in Figure 2. The left figure in Figure 2 is an ROC curve using the training set and the area under the ROC curve (AUC) is 0.836 (
Figure 3 is a calibration plot for CVD and it is presented for verification for nomogram which is obtained for CVD. The groups with similar probability were grouped into 59 groups according to the predicted probabilities for nomogram, and the average value and the observation probability of each group were compared. The dotted line represents the straight line of
A nomogram is a graphical tool and can be used for anyone to a predict prevalence rate because it is calculated by adding points. It has been developed for many diseases studied in preventive medicine or diagnostic medicine (Ahn
The nomogram for CVD can also be built by a logistic regression model. Figure 4 is a nomogram when applied to the logistic regression model using the selected 6 factors. Diabetes and obesity factors should be considered in relation to the remaining social factors since we only considered the main effect in the logistic regression model, even though the length of the straight line is long. However, it is impossible to calculate it when we have missing values and calculate the prevalence rate using the logistic regression because there is no value to replace. Instead, the naïve Bayesian nomogram plot is useful to calculate it by replacing the 0 point without the need to consider overfitting for estimating regression coefficients as well as shrinkage for the problem. The prevalence rate is also calculated as posterior probability using the independence assumption between explanatory variables. However, it is disadvantageous in that it cannot independently identify the influence of each factor, unlike the logistic nomogram (Hosmer and Lemeshow, 2000).
We have proposed a naïve Bayesian nomogram for CVD using KNHANES data representing Korea for three years. This will also be useful in medical fields because dependencies are taken into consideration through conditional probability and the nomogram can be used to determine the CVD prevalence rate with attribute values. Also, for efficiency of naïve Bayesian nomogram, we are expected for various fields to build and use it in practice for early detection and treatment.
Characteristics of the study population
Characteristic | Cardiovascular disease | ||||
---|---|---|---|---|---|
Yes | No | ||||
Sex | male | 1114 (30.6) | 2529 (69.4) | 14.9 (1) | 0.0001 |
female | 1221 (26.7) | 3349 (73.3) | |||
Age | 19–39 | 78 (3.1) | 2414 (96.9) | 2069.8 (2) | < 0.0001 |
40–59 | 749 (23.3) | 2468 (76.7) | |||
60–80 | 1508 (60.2) | 996 (39.8) | |||
Income | <25% | 716 (51.0) | 689 (49.0) | 472.2 (3) | < 0.0001 |
25–50% | 619 (29.5) | 1479 (70.5) | |||
50–75% | 503 (21.7) | 1814 (78.3) | |||
>75% | 497 (20.8) | 1896 (79.2) | |||
Education level | elementary school | 970 (57.6) | 714 (42.4) | 1136.9 (3) | < 0.0001 |
middle school | 366 (41.7) | 512 (58.3) | |||
high school | 611 (21.2) | 2272 (78.8) | |||
college | 388 (14.0) | 2380 (86.0) | |||
Marriage status | married | 2282 (33.1) | 4615 (66.9) | 458.6 (1) | < 0.0001 |
single | 53 (4.0) | 1263 (96.0) | |||
Diabetes | yes | 508 (77.2) | 150 (22.8) | 836.3 (1) | < 0.0001 |
no | 1827 (24.2) | 5728 (75.8) | |||
Renal failure | yes | 26 (72.2) | 10 (27.8) | 34.1 (1) | < 0.0001 |
no | 2309 (28.2) | 5868 (71.8) | |||
Depression | yes | 161 (42.5) | 218 (57.0) | 38.5 (1) | < 0.0001 |
no | 2174 (27.8) | 5660 (72.2) | |||
Rheumatoid arthritis | yes | 67 (48.9) | 70 (51.1) | 28.7 (1) | < 0.0001 |
no | 2268 (28.1) | 5808 (71.9) | |||
Smoking status | current smoker | 355 (22.5) | 1223 (77.5) | 97.4 (2) | < 0.0001 |
past smoker | 670 (37.1) | 1138 (62.9) | |||
non-smoker | 1310 (27.1) | 3517 (72.9) | |||
Alcohol status | none | 647 (40.4) | 956 (59.6) | 178.1 (3) | < 0.0001 |
<2/week | 1115 (23.8) | 3571 (76.2) | |||
2–3/week | 356 (27.1) | 958 (72.9) | |||
>4/week | 217 (35.6) | 393 (64.4) | |||
Stress | yes | 486 (24.5) | 1500 (75.5) | 20.2 (1) | < 0.0001 |
no | 1849 (29.7) | 4378 (70.3) | |||
Obesity status | lower weight | 25 (7.1) | 329 (92.9) | 281.2 (2) | < 0.0001 |
normal | 1279 (24.3) | 3974 (75.7) | |||
obesity | 1031 (39.6) | 1575 (60.4) | |||
Starvation | yes | 520 (20.1) | 2062 (79.9) | 127.2 (1) | < 0.0001 |
no | 1815 (32.2) | 3816 (67.8) |
Log likelihood ratio and point values for risk factors
Risk factor | Attribute value | Likelihood ratio | Point value |
---|---|---|---|
Age | 19–39 | 0.0813 | −100 |
40–59 | 0.7640 | −11 | |
60–80 | 3.8114 | 53 | |
Income | < 25% | 2.6160 | 38 |
25–50% | 1.0536 | 2 | |
50–75% | 0.6980 | −14 | |
> 75% | 0.6599 | −17 | |
Education level | elementary school | 3.4199 | 49 |
middle school | 1.7995 | 23 | |
high school | 0.6770 | −16 | |
college | 0.4104 | −35 | |
Marriage status | married | 1.2448 | 9 |
single | 0.1056 | −90 | |
Diabetes | yes | 8.5254 | 85 |
no | 0.8029 | −9 | |
Obesity status | lower weight | 0.1913 | −66 |
normal | 0.8102 | −8 | |
obesity | 1.6479 | 20 |