The genes used for identification, including paternity identification and racial/ethnic identification, are parts of an entire genetic sequence. Such genes have a partly repeated nucleotide sequence or include a genetic mutation with a single or more base pair difference in the genetic sequence. Usually, these nucleotide sequences used for gene identification are called markers. By comparing and analyzing different DNA polymorphisms, individual identification, familial relationships, and racial/ethnic identification can be performed (Butler and Hill, 2012).
Research has sought to develop DNA markers that can accurately characterize an individual from among numerous others, and this continues to this day. Among the types of DNA markers explored, DNA nucleotide sequences that appear repeatedly at a specific position exhibit differences in the number of repetitions in each individual (Schneider, 2007). As with other genes, this factor is passed down from parent to offspring following the Mendelian inheritance rules. A short tandem repeat (STR) is a unit, normally 2–5 base pairs long, that repeats multiple times at a specific point on a chromosome (Evett and Weir, 1998). Each individual has two alleles of this repetition, each expressed as the number of repeats in one STR marker. Where alleles have identical repeated times, the code is called homozygous, and if the number of repetitions is different, it is called heterozygous (Evett and Weir, 1998).
These unique genes can identify individuals in the same way that fingerprints can, and criminals are being identified through this type of forensic investigation at increasing rates. It is possible to extract genetic information from evidence such as bloodstains or cigarette butts, making it possible to identify individuals and possibly criminal suspects. Beyond individual identification, these genes can be used to identify familial relationships, as they are inherited half from each parent, also making it possible to identify siblings (Butler and Hill, 2012). It is now possible to trace a criminal offender using only a tiny blood stain at the scene of the crime and to check parent-child relationships, as well as even identifying the ethnicity of a criminal (Bieber
With such a database, individual suspects can be identified, and moreover, familial searching technique can be used, in which families of blood relatives are investigated when a DNA profile that matches genetic evidence can not be found (Myers
Familial searching has been used in the investigation of various crimes, and it has produced successful apprehension of suspects (Schneider, 2007). Research to apply its use to a broader range of investigations is ongoing in many countries. However, it is difficult to apply it universally, due to the limited range of identifications and because the standard criteria for quantification is ambiguous (Lee
In this study, an analysis of familial data was conducted among 896 people (118 households) performed under the auspices of the Department of Forensic Medicine at Seoul National University and the National Bank of Korea (NBK) at Korea National Institute of Health. The data contained 790 parent-child pairs (first-degree relatives), 522 siblings pairs (second-degree relatives), 221 grandparent-grandchild pairs (second-degree relatives), 859 uncle-nephew pairs (third-degree relatives), and 468 first cousin pairs (fourth-degree relatives). In addition, 1,000 pairs were formed from individuals who were not related by blood. The experiment was conducted at the Department of Forensic Medicine at Seoul National University, and the genetic data included information on allele and genotype frequencies (Jeong
The STR markers used in the analysis were the CODIS 13 markers (D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, D21S11, CSF1PO, FGA, TH01, TPOX, vWA) used by the US Federal Bureau of Investigation and 9 markers commonly used in forensic analysis (D2S1338, D19S433, PentaE, PentaD, D1S1656, D2S441, D10S1248, D12S391, D22S1045), for a total of 22 markers (Yang
The familial relationship analysis method using the total number of shared alleles (TNSA) as indicated by identical by state (IBS) corresponds to the most fundamental method of estimation suggested by Evett and Weir (1998), which measures the number of alleles between two people who are thought to have a familial relationship, and when the number of alleles surpasses a certain critical value, they are considered to be related by blood (Cowen and Thomson, 2008).
The likelihood ratio (LR) method is commonly used to calculate the likelihood ratio of the probability that a gene is observed in the case of a familial relationship to the probability that a gene is observed in the absence of a familial relationship. Using this likelihood ratio value, the probability of a familial relation can be determined when the likelihood ratio value is greater than or equal to a pre-specified critical value (Bieber
The total number of shared alleles and likelihood ratio value (TNSA & LR) are considered at the same time in kinship identification. Where the number of shared alleles and likelihood ratio value exhibit at least an appropriate critical value, the sources of the samples are considered to be related (Jeong
Using previous familial relationship analysis methods in the identification of third-degree relatives does not yield a good result. This is because the distribution of unrelated groups overlaps the distribution of related groups, making it difficult to distinguish familial relationships through a comparison with the foregoing analysis method. Therefore, to evaluate the identification of familial relationships above the third degree, statistical methods can be applied to the classification analysis. These methods include linear discriminant analysis (LDA) (Friedman, 1988), shrinkage discriminant analysis (SDA) (Friedman, 1988), regularized discriminant analysis (RDA) (Friedman, 1988), classification and regression trees (CART) (Quinlan, 2007), C5.0 (Jansson, 2016), random forest (RF) (Breiman, 2001), extreme gradient boosting (Xgboost) (Chen and Guestrin, 2016), logistic regression (Cox, 1958), K-nearest neighbor (KNN) (Altman, 1992), Naive Bayes (Bickel and Levina, 2004), penalized multivariate analysis (PMA) (Witten
LDA seeks to maximize separation between classes and to reduce dimensions while maintaining the information between classes as much as possible (Friedman, 1988). That is, LDA maximizes between-class scatter and reduces within-class scatter by projecting data from
CART is binary decision trees that use entropy or a Gini index as an impurity function and return a decision tree in an operation that best distinguishes a given data point through a separator that maximizes the value of the impurity (Quinlan, 2007). C5.0 is an algorithm that uses entropy to perform multiple separations on a classification basis with reduced entropy (Jansson, 2016). To determine optimal segmentation using entropy, segmentation efficiency can be measured through InfoGain (F). This efficiency is calculated as the difference between the entropy S1 of the segment before the division and entropy S2 after the division (Jansson, 2016). RF is a type of ensemble learning method that is defined by means of classification or regression analysis results in multiple decision trees during training (Breiman, 2001). Following this method maximizes confidence, with information gain being used as a measurement criterion (Breiman, 2001). Xgboost is a type of gradient tree boosting algorithm that allows the use of multiple decision trees in parallel (Chen and Guestrin, 2016). Xgboost generates optimized models to control the complexity of trees to minimize training loss and prevent overfitting (Chen and Guestrin, 2016). Logistic regression does not require data normality or the assumption that the covariance structures of the two groups are the same; it is a statistical method that is used to predict the likelihood that an event uses a linear combination of independent variables (Cox, 1958). Here, the odds represent the probability of success divided by the probability of failure, and the logistic model is achieved by performing a logit transformation of the odds (Cox, 1958).
In machine learning algorithms, KNN is a type of nonparametric method that is used for classification or regression. It is trained to select
The stacking model refers to an ensemble model that generates a meta-model by re-learning predictions generated through two or more algorithms with learning data, as shown in Figure 1 (Rho and Kim, 2009). Because each individual model was developed under the assumption that it was independent, it was optimized to have the smallest misclassification rate in response to outliers (Rokach, 2010). The stacking model is largely developed out in two stages.
As described in Figure 1, in the first step, various basic models are selected and learned. The posterior probability is derived from this learned model, and a new learning dataset is generated by collecting the derived posterior probability (Rokach, 2010). In the learning data
The analysis was carried out using the statistical programs SAS 9.4 (SAS Institute, Inc., Cary, NC) and R3.5.1; the packages ‘lda’, ‘sda’, ‘klaR’, ‘tree’, ‘C50’, ‘varSelRF’, ‘xgboost’, ‘foreign’, ‘class’, ‘klaR’, ‘pamr’, ‘e1071’, and ‘nnet’ were used.
The method of the basic analysis of the familial relationship relies on calculating the TNSA and LR for each of the 22 STR gene loci. It is then compared to unrelated pairs, and an optimal cut-off value can be obtained to determine the existence of familial relationships. Taking into account sensitivity (the probability that a test correctly identifies that people are related), specificity (the probability that a test correctly identifies that people are unrelated), and accuracy (the ratio of the number of correctly classified divided by the total number analyzed), the set sharing the highest sensitivity and specificity results was defined as the optimal cut-off. In the stacking model, a new learning dataset was generated that used various data mining models as a basic model (step 1), and the method of optimizing the classification model was performed using it as input data for study with an appropriate classification model (step 2).
Analysis was conducted on 790 parent-child pairs and 790 randomly selected unrelated pairs. To avoid overfitting, a sample was allocated from among the total 1,580 pairs, as follows; train:validation:test = 6:2:2 (948 pairs:316 pairs:316 pairs). The results are obtained with 100 replicates and are presented as means ± standard deviations. The optimal cut-off TNSA value for distinguishing parent-child pairs is 23, and the cut-off value of the log-likelihood ratio is 4.36. Among conventional analysis methods (TNSA, LR, TNSA&LR), TNSA&LR had the highest accuracy (99.40±0.34%), and among classification models (LDA, SDA, RDA, CART, C5.0, RF, Xgboost, logistic, KNN, NB, PMA, SVM, and ANN), the Xgboost model had the highest value (99.85 ± 0.20%). In addition, the estimate based on the Xgboost model, which had the highest accuracy in the classification model, was applied to the stacking model. Thus, a highest accuracy of 99.86 ± 0.20% was obtained (cf. Table 1).
Analysis was based on 522 pairs of siblings and 522 pairs randomly selected from among unrelated pairs. To prevent overfitting, the sample of 1,044 pairs was distributed as train:validation:test = 6:2:2 (628 pairs:208 pairs:208 pairs). The analysis was iterated 100 times. Then, the results are presented as the means ± standard deviations. The optimal cut-off value for TNSA in distinguishing siblings is 21, and the cut-off value of the log-likelihood ratio is 0.15. In the conventional analysis method, TNSA&LR had the highest accuracy (99.00±0.62%), and among classification models, the RF model had the best performance (99.15 ± 0.55%). In addition, the estimate based on the RF model, which had the highest accuracy among the classification models, was applied to the stacking model, with the highest accuracy, of 99.41 ± 0.49% (cf. Table 2).
A set of 221 grandparent-grandchild pairs and 221 randomly selected unrelated pairs were analyzed. To avoid overfitting, the sample of 442 pairs was assigned as train:validation:test = 6:2:2 (266 pairs:88 pairs:88 pairs), and the analysis was replicated 100 times. The results are provided as means ± standard deviations. The optimal cut-off value of the TNSA for distinguishing grandparent-grandchild was 18, and the cut-off value of the log-likelihood ratio was 2.93. The LR showed the highest value in terms of accuracy (89.50 ± 2.96%), and the RF model had the highest (94.06 ± 2.24%) value among the classification models. In addition, the estimate based on the RF model was applied to the stacking model. The results showed the highest accuracy, at 95.40 ± 1.74% (cf. Table 2).
In all, 859 uncle-nephew pairs and 859 randomly selected pairs were assessed. To avoid overfitting, the sample of 1,718 pairs was assigned as train:validation:test = 6:2:2 (1034 pairs:342 pairs:342 pairs), and the analysis was replicated 100 times. The analysis was performed with 100 replicates, the results are given as means ± standard deviations. The optimal cut-off value of TNSA for identifying uncle-nephew relationships is 18, and the cut-off value of the log-likelihood ratio is −0.13. In the case of the earlier analysis method, TNSA&LR had the highest accuracy (89.10 ± 1.47%), and among classification models, the C5.0 model had the highest accuracy (91.94 ± 1.47%). An estimate of the C5.0 model was applied to the stacking model, which it showed the highest accuracy with 93.65 ± 1.22% (cf. Table 3).
The analysis was performed based on 468 pairs of first cousins and 468 randomly selected from among unrelated pairs. The total size of 936 pairs was analyzed, a sample allocated as train:validation:test = 6:2:2 (564 pairs:186 pairs:186 pairs) to avoid overfitting, and the analysis was performed with 100 replicates. The results are presented as means ± standard deviations. The cut-off value of the TNSA for discriminating the fourth-degree relatives is 16, and the cut-off value of the log-likelihood ratio is −0.11. In previous analysis method, TNSA&LR showed the highest accuracy (73.61 ± 2.35%), and the C5.0 model had the best performance among classification models (79.28 ± 2.93%). As in the cousin case, the estimate of C5.0 model, which had the highest accuracy in the classification model, was applied to the stacking model and showed the highest accuracy, with 82.10±2.97% (cf. Table 4).
The methods that are most commonly used for the identification of familial relationships can distinguish related from unrelated pairs by calculating LR through STR markers consisting of repeats of specific nucleotide sequences. Therefore, using traditional methods, such as the number of shared alleles (TNSA) or likelihood ratio (LR), a cut-off value that minimizes the classification error rate was deduced, and familial relations were identified based on the cut-off values (Jeong
Therefore, a more effective classification method is needed to identify familial relationships beyond the third cousin, and for this purpose, various data mining classification techniques have been applied (Yang
The classification result obtained by applying the TNSA and LR methods, which have been the main methods used in the past, is as follows. For fourth-degree relatives, the accuracy was 73.61 ± 2.35%, but a specificity of only about 56.91 ± 4.05% was predicted, and the misclassification rate was high. Therefore, to increase accuracy, it was necessary to produce and apply a new classification method rather than using existing methods, and this result can be shown in Figure 2. Our proposed stacking ensemble machine learning algorithm allowed us to find that the classification of third and fourth-degree relatives, which had been difficult to identify in the past, could be improved to reach a maximum of 93.65 ± 1.22% (uncle-nephew) and 82.10 ± 2.97% (first cousins).
If two people do not know which blood relationship is, we do not apply all the above five models at the same time but apply in step by step from the first-degree modeling. In other words, if it is classified as the first-degree relatives (parent-child pairs) by applying the first-degree modeling we presented (stacking model of xgboost), the blood relationship is predicted as the parent-child pairs. If it is not classified as the first-degree relatives (parent-child pairs), the next step is to apply our modeling in the order of second-degree relatives (full siblings pairs, grandparent-grandchild pairs), third-degree relatives (uncle-nephew pairs), and fourth-degree relatives (first cousins pairs) to determine the relationship sequentially.
This result can be applied to the calculation of the familial relationship distance between distant relatives of more than the fourth degree in future work. From this, we judge that this work can become important material for the development of more accurate and quantitative familial relationship determinations in various fields that require genetic identification through familial searching using a system of databases storing DNA profile.
This research was supported and funded by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00208882) and the Korean National Police Agency [Project Name: Advancing the Appraisal Techniques of Forensic Entomology/Project Number: PR10-04-000-22].
Result for identification of first-degree relatives (parent-child pairs) by classification models
Classification models | Sensitivity (n = 158) | Specificity (n = 158) | Accuracy (n = 316) | |
---|---|---|---|---|
Traditional method | TNSAa | 98.23 ± 0.85 | 99.63 ± 0.40 | 98.93 ± 0.45 |
LRb | 97.39 ± 1.01 | 99.03 ± 0.66 | 98.21 ± 0.61 | |
TNSA & LR | 100.00 ± 0.00 | 98.80 ± 0.69 | 99.40 ± 0.34 | |
One-step | LDA | 99.42 ± 0.58 | 99.22 ± 0.61 | 99.32 ± 0.40 |
SDA | 98.80 ± 0.75 | 99.32 ± 0.51 | 99.06 ± 0.46 | |
RDA | 50.37 ± 4.15 | 98.86 ± 0.83 | 74.61 ± 2.05 | |
CART | 99.68 ± 0.58 | 99.89 ± 0.24 | 99.78 ± 0.29 | |
C5.0 | 99.84 ± 0.35 | 99.86 ± 0.26 | 99.85 ± 0.20 | |
RF | 99.84 ± 0.33 | 99.84 ± 0.29 | 99.84 ± 0.21 | |
Logistic | 98.58 ± 1.05 | 98.97 ± 0.80 | 98.78 ± 0.63 | |
kNN | 99.22 ± 0.62 | 99.16 ± 0.60 | 99.19 ± 0.46 | |
NB | 90.83 ± 2.27 | 98.99 ± 0.77 | 94.91 ± 1.18 | |
PAM | 98.79 ± 0.73 | 99.34 ± 0.50 | 99.06 ± 0.43 | |
SVM | 99.15 ± 0.63 | 99.25 ± 0.61 | 99.20 ± 0.44 | |
ANN | 99.09 ± 0.88 | 99.36 ± 0.61 | 99.23 ± 0.53 | |
Two-step (stacking) |
a. The cut-off value of TNSA is 23
b. The cut-off value of LR is 4.36
Results for identification of second-degree relatives (siblings and grandparent-grandchild pairs) by classification models
Classification models | Siblings | Grandparent-grandchild | |||||
---|---|---|---|---|---|---|---|
Sensitivity (n = 104) | Specificity (n = 104) | Accuracy (n = 208) | Sensitivity (n = 44) | Specificity (n = 44) | Accuracy (n = 88) | ||
Traditional method | TNSAa | 94.99 ± 2.02 | 98.48 ± 1.08 | 96.73 ± 1.12 | 83.47 ± 5.02 | 85.78 ± 4.27 | 84.62 ± 3.48 |
LRb | 95.80 ± 1.64 | 99.12 ± 0.80 | 97.46 ± 0.87 | 88.44 ± 4.09 | 90.56 ± 3.87 | 89.50 ± 2.96 | |
TNSA & LR | 99.52 ± 0.60 | 98.48 ± 1.08 | 99.00 ± 0.62 | 91.24 ± 3.68 | 77.40 ± 5.89 | 84.32 ± 3.60 | |
One-step | LDA | 95.80 ± 1.89 | 99.13 ± 0.84 | 97.47 ± 0.98 | 84.51 ± 5.21 | 94.69 ± 3.31 | 89.60 ± 2.81 |
SDA | 70.70 ± 14.13 | 99.86 ± 0.34 | 85.28 ± 7.08 | 84.22 ± 5.90 | 94.73 ± 3.23 | 89.48 ± 3.27 | |
RDA | 59.49 ± 10.50 | 97.81 ± 1.46 | 78.65 ± 5.08 | 44.84 ± 10.29 | 95.73 ± 2.89 | 70.29 ± 5.40 | |
CART | 98.11 ± 1.98 | 99.06 ± 0.98 | 98.59 ± 1.06 | 89.51 ± 4.79 | 93.51 ± 4.63 | 91.51 ± 2.85 | |
C5.0 | 98.82 ± 0.97 | 99.48 ± 0.79 | 99.15 ± 0.57 | 91.36 ± 4.07 | 94.73 ± 3.55 | 93.04 ± 2.63 | |
Xgboost | 98.97 ± 0.98 | 99.14 ± 0.90 | 99.06 ± 0.59 | 91.20 ± 3.71 | 95.09 ± 3.02 | 93.14 ± 2.23 | |
Logistic | 96.50 ± 2.01 | 96.50 ± 2.24 | 96.50 ± 1.41 | 89.09 ± 4.88 | 89.91 ± 4.80 | 89.50 ± 3.42 | |
kNN | 95.60 ± 2.01 | 98.78 ± 1.07 | 97.19 ± 1.12 | 86.11 ± 4.70 | 90.16 ± 4.36 | 88.13 ± 2.92 | |
NB | 90.88 ± 2.49 | 96.02 ± 1.99 | 93.45 ± 1.48 | 69.80 ± 7.99 | 96.40 ± 2.60 | 83.10 ± 4.33 | |
PAM | 96.54 ± 1.54 | 99.40 ± 0.74 | 97.97 ± 0.80 | 87.36 ± 4.70 | 94.04 ± 3.40 | 90.70 ± 2.67 | |
SVM | 98.05 ± 1.21 | 98.09 ± 1.18 | 98.07 ± 0.77 | 88.49 ± 4.35 | 93.09 ± 3.87 | 90.79 ± 3.02 | |
ANN | 98.42 ± 1.26 | 98.02 ± 1.57 | 98.22 ± 0.97 | 90.02 ± 4.35 | 90.07 ± 4.59 | 90.04 ± 2.95 | |
Two-step (stacking) |
a. The cut-off value of TNSA is 21, 18 in siblings and grandparent-grandchild.
b. The cut-off value of LR is 0.15, 2.93 in siblings and grandparent-grandchild.
Results for identification of third-degree relatives (uncle-nephew pairs) by classification models
Classification models | Sensitivity (n = 171) | Specificity (n = 171) | Accuracy (n = 342) | |
---|---|---|---|---|
Traditional method | TNSAa | 73.99 ± 2.90 | 86.86 ± 2.24 | 80.43 ± 1.90 |
LRb | 79.07 ± 2.42 | 91.02 ± 1.59 | 85.05 ± 1.51 | |
TNSA & LR | 94.67 ± 1.45 | 83.54 ± 2.36 | 89.10 ± 1.47 | |
One-step | LDA | 84.99 ± 2.51 | 86.33 ± 2.47 | 85.66 ± 1.81 |
SDA | 85.38±2.56 | 85.17±2.71 | 85.28±1.90 | |
RDA | 36.33±4.56 | 96.23±1.47 | 66.28±2.26 | |
CART | 90.80±2.57 | 90.37±2.60 | 90.58±1.48 | |
RF | 91.00±2.48 | 91.44±2.02 | 91.22±1.63 | |
Xgboost | 91.52±2.07 | 92.09±2.02 | 91.81±1.44 | |
Logistic | 85.45±2.58 | 86.08±2.47 | 85.76±1.79 | |
kNN | 80.12±2.91 | 88.87±2.37 | 84.49±1.64 | |
NB | 50.67±4.78 | 94.96±1.60 | 72.82±2.30 | |
PAM | 85.21±2.49 | 84.21±2.31 | 84.71±1.70 | |
SVM | 88.74±2.20 | 89.05±2.29 | 88.90±1.65 | |
ANN | 87.63±2.87 | 87.26±2.58 | 87.45±1.88 | |
Two-step (stacking) |
a. The cut-off value of TNSA is 18
b. The cut-off value of LR is −0.13
Results for identification of fourth-degree relatives (first cousins) by classification models
Classification models | Sensitivity (n = 93) | Specificity (n = 93) | Accuracy (n = 186) | |
---|---|---|---|---|
Traditional method | TNSAa | 71.51±3.78 | 63.90±3.48 | 67.71±2.49 |
LRb | 73.12±4.42 | 68.94±4.22 | 71.03±3.26 | |
TNSA & LR | 90.31±2.75 | 56.91±4.05 | 73.61±2.35 | |
One-step | LDA | 69.66±4.95 | 72.55±4.00 | 71.11±2.91 |
SDA | 71.07±4.56 | 72.29±4.33 | 71.68±2.73 | |
RDA | 24.11±6.29 | 95.96±2.13 | 60.03±3.03 | |
CART | 73.96±8.96 | 71.65±8.67 | 72.80±3.45 | |
RF | 79.67±4.49 | 73.98±4.85 | 76.82±2.89 | |
Xgboost | 80.00±3.75 | 76.91±4.70 | 78.46±3.06 | |
Logistic | 68.88±4.84 | 72.13±4.12 | 70.51±2.96 | |
kNN | 57.85±5.25 | 70.48±5.00 | 64.16±3.58 | |
NB | 32.72±5.25 | 91.36±3.53 | 62.04±2.70 | |
PAM | 71.83±4.35 | 69.06±4.29 | 70.45±2.60 | |
SVM | 75.38±4.65 | 70.79±4.67 | 73.09±2.71 | |
ANN | 68.82±5.63 | 70.29±5.24 | 69.55±3.77 | |
Two-step (stacking) |
a. The cut-off value of TNSA is 16
b. The cut-off value of LR is −0.11