
Genes are sometimes involved in diseases such as cancer. Therefore, it is medically essential to know which genes are involved in a disease, and this information can assist in the development of new drugs and treatment methods. However, there are tens of thousands of genes, and it is impossible to conduct experiments on each gene. Therefore, candidate genes involved in a disease are typically narrowed down using statistical analyses (Dudoit
Genomic data were used to detect DEGs, and two data types are currently used. One was microarray data obtained from DNA microarray experiments, and the other was RNA-seq data obtained from next-generation sequencing experiments. Microarray data were used in this study. The expression level is defined as the fluorescence intensity of a fluorescent substance that shines more strongly when it is strongly expressed and is a continuous type of data (Draghici, 2012). These data roughly follow a logarithmic normal distribution and, when analyzed, a method that assumes a normal distribution with a logarithmic transformation (Holye
In this study, we focused on identifying the DEGs between the two groups. Examples of two-group comparisons include the cases mentioned above of normal cells and cancer cells, as well as cases of cells treated with drugs and cells not treated with drugs. Most of these are used to identify the causative gene of a particular disease or a gene affected by a specific substance. Outside of medicine, they are used to discover genes with different expression patterns between organisms, such as monkeys and chimpanzees. In addition, although not addressed in this study, it is sometimes used to compare expression levels in three or more groups of organisms (Churchill, 2004). This is the case when researchers want to compare the effects of multiple treatments or drugs for a particular disease or when they want to observe whether the expression levels change over time after the administration of a specific drug.
There are currently two primary methods for identifying DEGs using genomic data. One is based on fold change, and the other on testing. In a broad sense, fold change measures how much expression differs between two groups. The statistic that could be inferred to have a high probability of being a DEG with a large value was calculated, and the gene with a large value was designated as a DEG. Representative methods include the average difference (AD), weighted average difference (WAD) (Kadota
The test-based method performs statistical testing for each gene, and the gene with the most significant difference is designated as a DEG. Usually, genes are ranked by the
To overcome these weaknesses, we propose a new clustering method. When a specific reaction or change occurs in a living organism, it may be accompanied by another change. In this case, when a gene controlling one of the changes was expressed, the gene that is controlling the other was also expressed. Therefore, many genes interact with each other. Because there are many genes whose functions are not yet known, even with current technology, this property may be used to predict the function of an unknown gene (Brown, 2002). Specifically, the expression levels of genes were measured under multiple conditions and clustered, and the genes showing similar expression patterns were identified. In this case, the genes classified in the same cluster can be interpreted as containing similar information. Based on this assumption, DEGs and non-DEGs were classified into different clusters when genes in the genome data were clustered. Therefore, DEGs can be detected by determining which cluster is a DEG cluster rather than by deciding whether each gene is a DEG. Consequently, we considered a method to determine whether each cluster was a DEG cluster using a test after clustering. By testing clusters instead of genes, the number of tests is reduced, and the sample size for each test is increased; thus, the accuracy is expected to be higher than that of the conventional method.
When a correlation exists between genes, independence between samples in the same cluster cannot be assumed. Because most of the tests currently used for two-group comparisons require the assumption of independence, it is necessary to consider how to deal with this problem. Instead of the
In the next section, we describe the proposed method in detail. Section 3 presents the setup and simulation results, and Section 4 summarizes the study.
Let
The expression levels in group 2 were
In this case, we use
and
as samples and a two-group comparison test was performed. If the expression levels of the two groups were regarded as significantly different, all the genes belonging to cluster
The
where
To solve this, we obtain
where
As mentioned above, microarray data are often tested by assuming a normal distribution. Therefore, we decided to use the normal distribution method for clustering. The GMM introduced in this section is a clustering method that uses a mixed normal distribution. The mixed normal distribution is a combination of several multivariate normal distributions and has the following probability density function:
where
The parameters (
When the partial differentiation is set to zero, we obtain
where
This rate is known as the burden rate. As
Differentiating
Because all the parameters calculated above are functions of the burden ratio, and the burden ratio also depends on the parameters, they cannot be calculated analytically. Therefore, they were computed using the expectation-maximization (EM) algorithm (Dempster
If the data belonging to the same cluster are correlated, the higher the correlation coefficient, the better the clustering accuracy of the proposed method. The Euclidean distance between
We can say that the larger the correlation coefficient, the smaller the expected distance between two points. In the
In the proposed method, because the expression levels of multiple genes obtained from the same cell were tested as samples, independence between the samples cannot be assumed if there is a relationship between the genes. In most cases, a positive correlation was observed between the expression of these genes. One problem in this case is that the test statistic of the
When correlations are ignored, the
A correction for the
Finally, we discuss the tests that do not require random sampling. As mentioned above, the problem of correlated data is caused by sample variance. Therefore, an increase in type-one errors can be prevented by conducting a test in which the test statistic is less affected by sample variance. Consequently, we consider a nonparametric permutation test that uses the difference in means. Nonparametric permutation tests do not require random sampling (Edgington and Onghena, 2014). However, some studies have shown that if the samples are correlated, the permutation changes the correlation structure, thus reducing the power of the test (Blair and Karniski, 1993). Therefore, we decided to compare the simulations in which the test is more suitable for microarray data, the permutation test controls the type-one error, and the
The permutation test is a nonparametric test that calculates the
Then,
If the alternative hypothesis is that the population mean of Group 2 is larger, this
In the simulations, we compared the proposed method with two conventional methods. As a common setup for all simulations, we used data with
The simulation results are listed in Table 1. First, we calculated the AUC. The highest AUC is obtained by the method using the permutation test, followed by the method using the
Next, we consider detection power and type-one errors. First, both the detection power and type-one errors of the conventional method were minimal. This may be because the
Table 2 shows the results of comparing the proposed method with the test for gene sets. The clusters formed by
Table 3 shows the simulation results when genes are correlated but do not follow the same distribution. More specifically, half of the genes in Cluster 1 followed Distribution 1, and the other half followed Distribution 2. Similarly, half of the genes in Cluster 2 followed Distribution 2, and the other half followed Distribution 3. Finally, half of the genes in Cluster 3 followed Distribution 3, whereas the other half followed Distribution 1. It can be seen that the accuracy of the conventional method does not change significantly. Because this method considers each gene individually, it is believed that there is no effect on the correlation structure. The results of the other methods were not significantly different; however, the AUC was generally lower. This may be because the correlation structure is no longer simple and the clustering accuracy decreases. In addition, unlike the previous simulation, an increase in the correlation coefficient did not necessarily improve the accuracy in terms of AUC, detection power, and type-one error. This is also due to changes in the correlation structure. However, the proposed method using a permutation test is still recommended for DEG detection.
Although the type-one error when using the permutation test was smaller than that of the
In the actual data analysis, we applied only the
Less than half of all DEGs were determined to be DEGs using each method. Additionally, the same genes were more likely to be identified as DEGs when the threshold for identifying them as DEGs increased. We concluded that the number of genes detected by the proposed method significantly differed from that detected by the conventional method. This difference should be considered when performing detection. Another interpretation is that genes commonly identified as DEGs using these two methods have a very high probability of being DEGs. If the researcher wishes to narrow the number of DEG candidates to a more precise and smaller number, this may be accomplished by checking the results of multiple methods.
We proposed a new method to analyze microarray data to detect DEGs. The proposed method is an improvement over the test-based method. It can solve the drawbacks of the conventional method, such as low accuracy due to small sample size and multiplicity of tests. The simulation of the proposed method was performed under the assumptions that the expression levels followed a normal distribution and that there was a positive correlation among the genes. In addition, we conducted simulations for the case in which some genes were correlated but followed different distributions.
The results showed that all methods could detect DEGs in terms of AUC; however, it was necessary to use the proposed method because the conventional method did not work when the significance level was set at 0.05. Among the proposed methods, those using the
The simulations showed that the method using the permutation test is the most recommended, even when genes are correlated but follow different distributions. However, unlike the aforementioned simulations, the accuracy did not increase as the correlation coefficient increased. This indicates that, even if the correlation becomes more robust, the accuracy of clustering is not likely to improve if the distribution of the samples is different. However, because such a situation may occur in actual microarray data, it is necessary to consider ways to cope with it. One way to address this issue is to change the clustering method. By clustering using correlation coefficients, it was possible to divide genes that influenced each other into the same cluster.
The actual data analysis showed that the genes detected by the proposed method significantly differed from those seen by the conventional method. This indicates that the genes likely to be detected using the proposed method differ from those detected using the conventional method. Unlike simulation data, real data do not always have normality, and the sample size is often small; therefore, it is difficult to determine whether the conventional
The proposed method has room for improvement, as various testing and clustering methods can be applied. In this study, we applied tests that are generally available for a variety of data, but there are several tests for genes, such as the sequence kernel association test (SKAT) (Michael
Results of calculating the AUC, detection power, and type-one error for each method for different correlation coefficients
Evaluation indices | |||||
---|---|---|---|---|---|
Welch | AUC | 0.703 | 0.704 | 0.703 | 0.701 |
power | 4.0 × 10−6 | 8.3 × 10−6 | 1.0 × 10−6 | 1.1 × 10−6 | |
type-one error | 8.6 × 10−7 | 8.6 × 10−7 | 8.6 × 10−7 | 8.5 × 10−7 | |
Wilcoxon rank-sum test | AUC | 0.687 | 0.687 | 0.686 | 0.686 |
power | 0.000 | 0.000 | 0.000 | 0.000 | |
type-one error | 0.000 | 0.000 | 0.000 | 0.000 | |
AUC | 0.717 | 0.717 | 0.720 | 0.723 | |
power | 0.865 | 0.867 | 0.872 | 0.879 | |
type-one error | 0.672 | 0.673 | 0.674 | 0.680 | |
GMM+Welch | AUC | 0.717 | 0.718 | 0.721 | 0.726 |
power | 0.867 | 0.871 | 0.878 | 0.883 | |
type-one error | 0.673 | 0.675 | 0.680 | 0.686 | |
AUC | 0.714 | 0.714 | 0.716 | 0.720 | |
power | 0.849 | 0.851 | 0.856 | 0.862 | |
type-one error | 0.645 | 0.644 | 0.645 | 0.651 | |
GMM+Wilcoxon rank-sum test | AUC | 0.714 | 0.715 | 0.718 | 0.722 |
power | 0.853 | 0.857 | 0.862 | 0.867 | |
type-one error | 0.646 | 0.648 | 0.652 | 0.658 | |
AUC | 0.773 | 0.775 | 0.782 | 0.785 | |
power | 0.830 | 0.832 | 0.841 | 0.853 | |
type-one error | 0.361 | 0.359 | 0.354 | 0.360 | |
GMM+Permutation test | AUC | 0.769 | 0.772 | 0.781 | 0.784 |
power | 0.830 | 0.834 | 0.846 | 0.856 | |
type-one error | 0.362 | 0.361 | 0.356 | 0.363 |
Results of calculating the AUC, detection power, and type-one error for each method for different correlation coefficients when compared to the test for the gene set
Evaluation indices | |||||
---|---|---|---|---|---|
AUC | 0.648 | 0.645 | 0.637 | 0.632 | |
power | 0.782 | 0.786 | 0.798 | 0.820 | |
type-one error | 0.666 | 0.673 | 0.696 | 0.728 | |
GMM+GSEA | AUC | 0.654 | 0.648 | 0.636 | 0.632 |
power | 0.798 | 0.795 | 0.803 | 0.825 | |
type-one error | 0.671 | 0.677 | 0.699 | 0.729 | |
AUC | 0.691 | 0.692 | 0.698 | 0.705 | |
power | 0.037 | 0.038 | 0.037 | 0.039 | |
type-one error | 0.012 | 0.012 | 0.012 | 0.012 | |
GMM+SAM-GS | AUC | 0.697 | 0.699 | 0.706 | 0.711 |
power | 0.039 | 0.038 | 0.038 | 0.040 | |
type-one error | 0.013 | 0.012 | 0.012 | 0.012 | |
AUC | 0.691 | 0.692 | 0.698 | 0.711 | |
power | 0.037 | 0.037 | 0.038 | 0.038 | |
type-one error | 0.012 | 0.012 | 0.012 | 0.011 | |
GMM+ROAST | AUC | 0.697 | 0.699 | 0.707 | 0.711 |
power | 0.039 | 0.039 | 0.039 | 0.040 | |
type-one error | 0.013 | 0.013 | 0.012 | 0.012 | |
AUC | 0.773 | 0.775 | 0.782 | 0.785 | |
power | 0.830 | 0.832 | 0.841 | 0.853 | |
type-one error | 0.361 | 0.359 | 0.354 | 0.360 | |
GMM+Permutation test | AUC | 0.769 | 0.772 | 0.781 | 0.784 |
power | 0.830 | 0.834 | 0.846 | 0.856 | |
type-one error | 0.362 | 0.361 | 0.356 | 0.363 |
Results of calculating AUC, detection power, and type-one error for each method with different correlation coefficients when genes are correlated but do not follow the same distribution
Evaluation indices | ||||
---|---|---|---|---|
Welch | AUC | 0.705 | 0.703 | 0.698 |
power | 1.3 × 10−6 | 1.0 × 10−6 | 1.1 × 10−6 | |
type-one error | 1.57 × 10−7 | 1.57 × 10−7 | 2.57 × 10−7 | |
Wilcoxon rank-sum test | AUC | 0.688 | 0.686 | 0.684 |
power | 0.000 | 0.000 | 0.000 | |
type-one error | 0.000 | 0.00 | 0.000 | |
AUC | 0.718 | 0.716 | 0.716 | |
power | 0.866 | 0.870 | 0.874 | |
type-one error | 0.673 | 0.675 | 0.680 | |
GMM+Welch | AUC | 0.717 | 0.708 | 0.698 |
power | 0.870 | 0.875 | 0.874 | |
type-one error | 0.674 | 0.683 | 0.691 | |
AUC | 0.715 | 0.713 | 0.714 | |
power | 0.850 | 0.854 | 0.857 | |
type-one error | 0.644 | 0.646 | 0.652 | |
GMM+Wilcoxon rank-sum test | AUC | 0.714 | 0.706 | 0.696 |
power | 0.856 | 0.859 | 0.857 | |
type-one error | 0.647 | 0.656 | 0.665 | |
AUC | 0.774 | 0.779 | 0.780 | |
power | 0.832 | 0.838 | 0.845 | |
type-one error | 0.360 | 0.356 | 0.359 | |
GMM+Permutation test | AUC | 0.771 | 0.777 | 0.775 |
power | 0.834 | 0.841 | 0.846 | |
type-one error | 0.363 | 0.360 | 0.366 |
Type-one error in the proposed method using
3 : 3 | 0.361 | 0.359 | 0.354 | 0.360 |
10 : 10 | 0.168 | 0.187 | 0.219 | 0.261 |
20 : 20 | 0.115 | 0.116 | 0.114 | 0.115 |
30 : 30 | 0.046 | 0.057 | 0.075 | 0.110 |
Number and percentage of genes commonly identified as DEGs in each method
1000 | 149 | 0.149 |
3000 | 1079 | 0.360 |
5000 | 2122 | 0.424 |