## Abstract

We describe a simple method to transfer from weights in deep neural networks (NNs) trained by a deep belief network (DBN) to weights in a backpropagation NN (BPNN) in the recursive-rule eXtraction (Re-RX) algorithm with J48graft (Re-RX with J48graft) and propose a new method to extract accurate and interpretable classification rules for rating category data sets. We apply this method to the Wisconsin Breast Cancer Data Set (WBCD), the Mammographic Mass Data Set, and the Dermatology Dataset, which are small, high-abstraction data sets with prior knowledge. After training these three data sets, our proposed rule extraction method was able to extract accurate and concise rules for deep NNs trained by a DBN. These results suggest that our proposed method could help fill the gap between the very high learning capability of DBNs and the very high interpretability of rule extraction algorithms such as Re-RX with J48graft.

## 1  Introduction

Hinton and Salakhutdinov (2006) noted that with good data, unsupervised training of deep belief networks (DBNs) followed by a pass with backpropagation (BP) could provide better accuracy. In 2010, Erhan, Bengio, Courville, Manzagol, and Vincent (2010) suggested that unsupervised pretraining before supervised learning tasks can guide learning toward basins of attraction of minima that support better generalization from a training data set. These findings support a regularization explanation for the effect of pretraining. However, whether such deep architectures have theoretical advantages over shallow architectures in rating category data sets (Luo, Wu, & Wu, 2017) remains unclear. Erhan et al. (2010) noted that when using a small number of training data sets, little importance is typically placed on minimizing training errors because overfitting is a major issue. In such cases, unsupervised pretraining can help identify apparent local minima that have better generalization errors.

By triggering the important augment (Erhan et al., 2010) after a DBN is fully trained, Abdel-Zaher and Eldeib (2016) simply transferred its entire weights matrix to a BP neural network (BPNN) with a similar architecture and applied the idea to the diagnosis of the Wisconsin Breast Cancer Data Set (WBCD) (UCI Repository, 2015).

The WBCD is a rating category data set: biomarkers or pathologists' readings, with a diagnostic task that involves distinguishing between two mutually exclusive possibilities (e.g., “disease” or “nondisease”). It has nine pathological features (attributes) and two classes. The WBCD was originally developed with the aim of enabling accurate diagnosis of breast masses based solely on a fine needle aspiration (FNA) cytology test. Each sample consisted of nine attributes of FNA samples with high-level abstraction that were originally generated from cytopathologic “images” rated by pathologists, with each characteristic assigned an integer value between 1 and 10 (with 1 being the closest to benign and 10 being the most anaplastic). Thus, the WBCD is a high-level abstraction data set with prior knowledge (Gűlçehre & Bengio, 2016).

A recent approach for diagnosis of the WBCD exhibited two promising characteristics: very high classification performance and only a few simple rules, resulting in good interpretability. This approach confirmed that artificial intelligence (AI)–based data mining technologies could be successfully implemented for cancer prediction. Traditional breast cancer diagnosis can be transformed into a two-class classification problem to categorize tumors in existing data sets as benign or malignant. However, most of the current diagnostic methods using classification for breast cancer are black box models, which cannot satisfactorily reveal hidden information in the data that typically plays a key role in providing a quality medical diagnosis.

Rule extraction (Andrews, Diederich, & Tickle, 1995) is a powerful AI-based data mining technique that attempts to find compromise between both requirements by building a simple rule set that mimics how the well-performing complex model (the black box) makes decisions. The recursive-rule eXtraction (Re-RX) algorithm developed by Setiono, Baesens, and Mues (2008) was originally intended to be a rule extraction tool. However, because of its recursive nature, the algorithm tends to generate more rules than other rule extraction algorithms. Therefore, one of the major drawbacks of the algorithm is that it typically generates expansive rules for middle-sized or larger data sets.

To achieve both concise and highly accurate extracted rules while simultaneously maintaining the good framework of the Re-RX algorithm, we recently proposed supplementing the Re-RX algorithm with J48graft, a class for generating a grafted C4.5 (Quinlan, 1993) decision tree—hereafter Re-RX with J48graft (Hayashi & Nakano, 2015; Hayashi, 2017). J48graft (Class J48graft, 2017) is the result of the C4.5A (Webb, 1999) algorithm.

In contrast to the black box models, Re-RX with J48graft (Hayashi & Nakano, 2015; Hayashi, 2017) not only provides very high classification accuracy but also can be easily explained and interpreted in terms of the concise extracted rules—that is, Re-RX with J48graft provides if-then rules. This “white-box model” is easier to understand and is thus often preferred by physicians and clinicians. I confirmed the excellent synergy effects between grafting in J48graft and subdivision in the Re-RX with J48graft (Hayashi, 2017).

## 2  Motivation

Despite the impact of deep learning (Liu et al., 2017), most researchers tend to overlook three issues: (1) it is applicable only to data sets with rich information hidden in the data, such as computer vision and images; (2) it has a black box nature, that is, there are considerable descriptions of the limitations of deep learning in terms of its nonexplainability, even in recent successful applications of deep learning algorithms (Gulshan et al., 2016; Mohamed, Luo, Peng, Jankowitz, & Wu, 2017); and (3) it is applicable only to very large, labeled data sets (Oakden-Rayner et al., 2017). We believe that resolving these potential issues will vastly expand the utility of deep learning to explainable AI for more practical usages.

## 3  Objectives

The objectives of the study examined here were to provide a simple and general-purpose method to transfer from weights in deep NNs trained by DBNs to weights in BPNNs in Re-RX with J48graft, leading to the proposal of a new method to extract accurate and interpretable classification rules for rating category data sets, and then to apply this method to the WBCD, Mammographic Mass, and Dermatology data sets (UCI Repository, 2015) which are small, high-abstraction data sets with prior knowledge (Gűlçehre & Bengio, 2016). Here, we demonstrate an effective idea to use the exceptions of the drawbacks of DBN subjects for the no-free-lunch (NFL) theorem (Wolpert, 1996a, 1996b).

## 4  Theoretical Background

### 4.1  Effects of Introducing Prior Knowledge into the Abstraction Data Sets

Gűlçehre and Bengio (2016) hypothesized that more abstract learning tasks, such as those obtained by composing simpler tasks, are generally difficult for general-purpose machine learning algorithms and thus more likely to yield effective local minima for NNs.

Similarly, based on experimental observations about the training of deep NNs, Bengio (2014) makes the following inferences: It is more realistic to expect a concept to be discovered by an NN trained with supervision (where examples are provided when a concept is present such as the WBCD, or absent in various examples, such as the MNIST database (LeCun et al., 1998), than by unsupervised learning. Regarding the learning of high-level abstraction data sets, Bengio (2014) also proposed the following hypotheses: deeper networks require a more difficult hypothesis and solutions may exist, but effective local minima are more likely to hinder learning with increases in the required depth of the architecture.

### 4.2  Optimization Difficulty of Learning High-Level Abstraction Data Sets

Erhan et al. (2010) reported that early examples have much greater weight in the final solution. That the learner appears to become stuck in or near a local minimum highlights the issue of effective local minima and the regularization effect that occurs when a deep NN with unsupervised pretraining is initialized. Interestingly, the difficulty owing to effective local minima increases as the network deepens.

### 4.3  Limitations of a DBN Trained Using a Small Number of Data Sets by the NFL Theorem

Wolpert (1996a, 1996b) introduced what would become known as the NFL theorem, according to which sufficiently large data sets should have similar accuracies. The most sensible explanation for this is that real-world classification problems typically belong to a particular subgroup of possible theoretical data sets (Gómez & Rojas, 2016). Although these techniques have shown remarkable success, deep learning remains subject to NFL limitations. Although from the perspective of abstraction, deep learning has been shown to work for sets with inherently hierarchical data, such as images, without that kind of data structure, the added complexity may not be worth it (Gómez & Rojas, 2016).

### 4.4  Re-RX with J48graft

With the objective of extracting more accurate and concise classification rules, it has been proposed that the conventional Re-RX algorithm (Setiono et al., 2008), which uses a C4.5 decision tree, be replaced with Re-RX with J48graft. The conventional pruning used in J48 both complements and contrasts that used in J48graft (Webb, 1999). The performance of the Re-RX algorithm is thought to be greatly affected by the decision tree. To extract more accurate and concise classification rules, in consideration of the grafting concepts associated with J48graft, J48 was replaced with J48graft in the Re-RX algorithm. Re-RX with J48graft is frequently employed to form decision trees in a recursive manner while training multilayer perceptrons (MLPs) using BP, which allows pruning (Setiono, 1997) and therefore generates more concise MLPs for rule extraction.

### 4.5  Synergy Effects between Grafting and Subdivision in Re-RX with J48graft

Webb (1997) reported various advantages of J48graft over decision tree algorithms. However, previous papers have overlooked the potential capabilities of J48graft because they evaluated only the advantages of using J48graft independently. When J48graft is used in combination with Re-RX with J48graft, the synergy effects of both can be realized.

Hayashi (2017) provided a theoretical analysis of synergy effects between grafting and subdivision in Re-RX with J48graft from two points of view: the number of rules increased by subdivisions of the Re-RX algorithm and the number of rules increased by grafting in J48graft. Hayashi (2017) demonstrated that the mechanism of Re-RX with J48graft is also expected to avoid the problem of overfitting for the highly imbalanced Thyroid Disease Data Set (UCI Repository, 2015). It can therefore be assumed that Re-RX with J48graft achieves considerably higher classification accuracy.

## 5  Theory

First, we concretely demonstrate the idea of the simple and general-purpose transfer from weights in DBNs to weights in BPNNs (see Figure 1). The proposed simple transfer is considerably different from that previously mentioned (Abdel-Zaher & Eldeib, 2016). Furthermore, we newly propose a DBN-based Re-RX with J48graft algorithm for preprocessing in the initialization using a DBN, the DBN Re-RX with J48graft. A schematic diagram for DBN Re-RX with J48graft is shown in Figure 2. The DBN extracts enhance the classification accuracy in preprocessing.

Figure 1:

Simple and general-purpose transfer mechanism from weights in DBNs to weights in BPNNs.

Figure 1:

Simple and general-purpose transfer mechanism from weights in DBNs to weights in BPNNs.

Figure 2:

Schematic overview of DBN Re-RX with J48graft.

Figure 2:

Schematic overview of DBN Re-RX with J48graft.

### 5.1  Simple and General-Purpose Transfer from Weights in DBNs to Weights in BPNNs

Regarding the simple transfer shown in Figure 1, we summarize the following important points:

1. Only transfer from the weights in the last hidden and output layers of the DBN to the weights between one hidden layer and the output layer of the BPNN.

2. Adjust the number of hidden units in the last hidden layer of the DBN to the number of hidden units in one hidden layer of the BPNN.

That is, we only need one-to-one simple transfer between the weights in the last hidden layer of the DBN and the one hidden layer of the BPNN to obtain a better initialization point for BP learning in DBN Re-RX with J48graft. This transfer is not associated with a loss of generality, and for various applications, it represents an easier and more flexible method than a full transfer (Abdel-Zaher & Eldeib, 2016). No other weights are necessary to provide a better starting point for converging the sum-of-squared error function in BP.

The rationale behind this idea is based on the original setting of a well-known perceptron (Rosenblatt, 1958). In this letter, all weights between the input and hidden layers in the perceptron were randomly generated. In the same manner, all weights between the input and hidden layers in the BPNN were randomly generated to start each session of BP learning.

### 5.2  DBN Re-RX with J48graft

The mechanism of the proposed DBN Re-RX with J48graft is shown in Figure 2. The Re-RX algorithm (Setiono et al., 2008) does not make any assumptions regarding the NN architecture. The same holds for Re-RX with J48graft (Hayashi & Nakano, 2015; Hayashi, 2017), which is a variation of the Re-RX algorithm prioritized for interpretability to enhance conciseness and accuracy using the J48graft decision tree. Therefore, we can structure a BPNN in accordance with a DBN. Ideally, we should restrict ourselves to BPNNs with one hidden layer, as such networks have been shown to process the universal approximate property (Hornik, Stinchcombe, & White, 1989).

In our experience, however, the numbers of hidden units in the hidden layer can be varied in Re-RX with J48graft. Pruning (Setiono, 1997) is a crucial component of Re-RX with J48graft. By removing the inputs that are not needed for problem solving, the extracted rule set can be expected to be more concise. When network training terminates, many of the network connections have weights that are very close to zero.

Therefore, in our experiment, the number of hidden units in the hidden layer of a BPNN can be settled on as one. Although the number of hidden units can be set arbitrarily, many hidden units will require more training time and extracted rules with more complexity. In our experience with various rating category data sets, the number of hidden units in the hidden layer can affect the balance between accuracy and interpretability (Hayashi & Nakano, 2015).

The number of hidden units in the hidden layer of a BPNN must be identical to the number of hidden units in the last hidden layer of a DBN. However, there are no restrictions on the number of hidden layers in a DBN.

## 6  Experiments

First, we explored the effects of the number of layers in a DBN and the number of hidden units in a hidden layer of a DBN for high-level abstraction of the WBCD, Mammographic Mass, and Dermatology data sets, the characteristics of which are shown in Table 1. Second, based on the method described in section 5, after full training for the WBCD, Mammographic Mass, and Dermatology data sets, we transferred the specified weights in the DBN to weights in the NN using Re-RX with J48graft, resulting in a better initialization point to converge the sum-of-squared error function for BP learning.

Table 1:
The Data Sets Used and Their Characteristics.
 Name of Data Set Number of Features Number of Classes Number of Samples WBCD 9 2 699 Mammographic Mass 6 2 961 Dermatology 34 6 366
 Name of Data Set Number of Features Number of Classes Number of Samples WBCD 9 2 699 Mammographic Mass 6 2 961 Dermatology 34 6 366

Next, we used k-fold cross validation (CV; Salzberg, 1997) to evaluate the classification rule accuracy of the test data sets and guarantee the validity of the results. We trained the WBCD, Mammographic Mass, and Dermatology data sets using Re-RX with J48graft and the average classification accuracies using 10 runs of 10 CV or 10 runs of 5CV for the test data set (TS ACC), the number of extracted rules (# rules), the average number of antecedents (Ave. # ante.) per rule, and the area under the receiver operating characteristic curve (AUC-ROC) (Marqués, García, & Sanchéz, 2013) for the test data set. The parameter settings for DBN Re-RX with J48graft are shown in Table 2.

Table 2:
Parameters Used in Training of Three Data Sets for DBN Re-RX with J48graft.
 Number of Hidden Layers for DBN Number of Hidden Units of the Last Hidden Layer for DBN Learning Rate for BP Momentum Factor for BP Number of Hidden Units for BP Pruning Rate $δ1$ $δ2$ WBCD 3 1 0.1 0.1 1 0.895 0.125 0.125 Mammographic Mass 2 3 0.1 0.1 3 0.895 0.15 0.15 Dermatology 2 6 0.1 0.1 6 0.895 0.16 0.16
 Number of Hidden Layers for DBN Number of Hidden Units of the Last Hidden Layer for DBN Learning Rate for BP Momentum Factor for BP Number of Hidden Units for BP Pruning Rate $δ1$ $δ2$ WBCD 3 1 0.1 0.1 1 0.895 0.125 0.125 Mammographic Mass 2 3 0.1 0.1 3 0.895 0.15 0.15 Dermatology 2 6 0.1 0.1 6 0.895 0.16 0.16

Note: WBCD: Wisconsin Breast Cancer Data Set; DBN: deep belief network; BP: backpropagation; $δ1$: threshold of covering rate; $δ2$: threshold of error rate.

## 7  Results

In general, DBNs are subject to NFL limitations. As shown in Figures 3 to 5, the DBN achieved very high accuracy (10 $×$ 10 CV or 10 $×$ 5 CV) using varying numbers of hidden layers of the DBN and varying numbers of units in each hidden layer. We also showed rule sets extracted using DBN Re-RX with J48graft for the WBCD, Mammographic Mass, and Dermatology data sets.

Figure 3:

TS ACC versus number of layers in the DBN for the WBCD.

Figure 3:

TS ACC versus number of layers in the DBN for the WBCD.

### 7.1  WBCD

As shown in Figure 3, the DBN for the WBCD consisting of eight layers with four hidden units achieved 95.69% of TS ACC and appeared to plateau as a result of overfitting.

#### 7.1.1  Extracted Rules for WBCD

First, we tabulated performances for the WBCD using DBN Re-RX with J48graft and DBN-Re-RX for comparison in Table 3. Note that Re-RX means the combination of Re-RX algorithm and C4.5. Next, we showed a rule set of extracted rules for the WBCD using DBN Re-RX with J48graft:

1. If UCSI $≤$ 3 & BN $≤$ 4 Then Benign

2. If UCSI $≤$ 3 & UCSH $≤$ 1 & BN $>$ 4 Then Benign

3. If UCSI $≤$ 3 & UCSH $>$ 1 & BN $>$ 4 Then Malignant

4. If UCSI $>$ 3 Then Malignant1

Table 3:
Performances for the WBCD Data Set Using DBN-Re-RX with J48graft and DBN-Re-RX.
 WBCD Data Set TS ACC (%) Ave. # Rules AUC-ROC (%) Ave. # ante. DBN Re-RX with J48graft [10 $×$ 10 CV] 96.83 7.3 96.8 3.04 DBN Re-RX [10 $×$ 10 CV] 95.18 5.18 95.0 2.44
 WBCD Data Set TS ACC (%) Ave. # Rules AUC-ROC (%) Ave. # ante. DBN Re-RX with J48graft [10 $×$ 10 CV] 96.83 7.3 96.8 3.04 DBN Re-RX [10 $×$ 10 CV] 95.18 5.18 95.0 2.44

Note: WBCD: Wisconsin Breast Cancer Data Set; CV: cross validation; Re-RX: recursive-rule eXtraction; TS: testing data set; ACC: accuracy; Ave. # rules: average number of rules; AUC-ROC: area under the receiver operating characteristic curve; 10 $×$ 10 CV; Ave. # ante.: average number of antecedents.

### 7.2  Mammographic Mass Data Set

As shown in Figure 4, the DBN for the Mammographic Mass Data Set consisting of two layers with three hidden units achieved 85.01% of TS ACC. After four layers of the DBN, TS ACC sharply declined.

Figure 4:

TS ACC versus number of layers in the DBN for the Mammographic Mass Data Set.

Figure 4:

TS ACC versus number of layers in the DBN for the Mammographic Mass Data Set.

#### 7.2.1  Extracted Rules for Mammographic Mass Data Set

First, we tabulated performances for the Mammographic Mass Data Set using DBN-Re-RX with J48graft and DBN-Re-RX for comparison in Table 4. Next, we showed a rule set of extracted rules for Mammographic Mass Data Set using DBN Re-RX with J48graft:

1. If BI-RADS $≤$ 4 & Age $≤$ 64 Then Benign

2. If BI-RADS $≤$ 4 & Age $>$ 64 Then Malignant

3. If BI-RADS $>$ 4 Then Malignant2

Table 4:
Performances for the Mammographic Mass Data Set Using DBN Re-RX with J48graft and DBN Re-RX.
 Mammographic Mass Data Set TS ACC (%) Ave. # Rules AUC-ROC (%) Ave. # ante. DBN Re-RX with J48graft [10 $×$ 10 CV] 84.18 2.79 83.93 1.46 DBN Re-RX [10 $×$ 10 CV] 82.99 3.16 82.79 1.70
 Mammographic Mass Data Set TS ACC (%) Ave. # Rules AUC-ROC (%) Ave. # ante. DBN Re-RX with J48graft [10 $×$ 10 CV] 84.18 2.79 83.93 1.46 DBN Re-RX [10 $×$ 10 CV] 82.99 3.16 82.79 1.70

Note: CV: cross validation; Re-RX: Recursive-Rule eXtraction; TS: testing data set; ACC: accuracy; Ave. # rules: average number of rules; AUC-ROC: area under the receiver operating characteristic curve; 10 $×$ 10 CV.

### 7.3  Dermatology Data Set

As shown in Figure 5, the DBN for the Dermatology Data Set consisting of one layer with eight hidden units achieved 97.54% of TS ACC, and after the fourth layer of the DBN, TS ACC sharply declined.

Figure 5:

TS ACC versus number of layers in the DBN for the Dermatology Data Set.

Figure 5:

TS ACC versus number of layers in the DBN for the Dermatology Data Set.

#### 7.3.1  Extracted Rules for the Dermatology Data Set Using DBN Re-RX with J48graft

First, we tabulated performances for the Dermatology Data Set using DBN Re-RX with J48graft and DBN-Re-RX for comparison in Table 5. Next, we showed a rule set of extracted rules for the Dermatology Data Set using DBN Re-RX with J48graft:

1. If elongation of the rete ridges $>$ 0 Then psoriasis

2. If elongation of the rete ridges $=$ 0 & spongiosis $>$ 0 Then lichen planus

3. If fibrosis of the papillary dermis $>$ 0 & elongation of the rete ridges $=$ 0 & spongiosis $=$ 0 Then chronic dermatitis

4. If follicular papules $>$ 0 & fibrosis of the papillary dermis $=$ 0 & elongation of the rete ridges $=$ 0 & spongiosis $=$ 0 Then pityriasis rubra pilaris

5. If Koebner phenomenon $>$ 0 & follicular papules $=$ 0 & fibrosis of the papillary dermis $=$ 0 & elongation of the rete ridges $=$ 0 & spongiosis $=$ 0 Then pityriasis rosea

6. If Koebner phenomenon $=$ 0 & follicular papules $=$ 0 & fibrosis of the papillary dermis $=$ 0 & parakeratosis $>$ 0 & elongation of the rete ridges $=$ 0 & focal hypergranulosis $>$ 0 & spongiosis $=$ 0 Then psoriasis

7. If Koebner phenomenon $=$ 0 & follicular papules $=$ 0 & fibrosis of the papillary dermis $=$ 0 & exocytosis $>$ 0 & parakeratosis $=$ 0 & elongation of the rete ridges $=$ 0 & focal hypergranulosis $>$ 0 & spongiosis $=$ 0 Then pityriasis rosea

8. If Koebner phenomenon $=$ 0 & follicular papules $=$ 0 & fibrosis of the papillary dermis $=$ 0 & exocytosis $=$ 0 & parakeratosis $=$ 0 & elongation of the rete ridges $=$ 0 & focal hypergranulosis $>$ 0 & spongiosis $=$ 0 Then psoriasis

9. If Koebner phenomenon $=$ 0 & follicular papules $=$ 0 & fibrosis of the papillary dermis $=$ 0 & elongation of the rete ridges $=$ 0 & focal hypergranulosis $=$ 0 & spongiosis $=$ 0 Then seborrheic dermatiti

Table 5:
Performances for the Dermatology Data Set Using DBN Re-RX with J48graft and DBN-Re-RX.
 Dermatology Data Set TS ACC (%) Ave. # Rules AUC-ROC (%) Ave. # ante. DBN Re-RX with J48graft [10 $×$ 5 CV] 95.73 9.06 95.16 4.61 DBN Re-RX [10 $×$ 5 CV] 94.50 8.84 93.80 4.62
 Dermatology Data Set TS ACC (%) Ave. # Rules AUC-ROC (%) Ave. # ante. DBN Re-RX with J48graft [10 $×$ 5 CV] 95.73 9.06 95.16 4.61 DBN Re-RX [10 $×$ 5 CV] 94.50 8.84 93.80 4.62

Note: CV: cross validation; Re-RX: recursive-rule eXtraction; TS: testing data set; $±$: standard deviation; ACC: accuracy; Ave. # rules: average number of rules; AUC-ROC: area under the receiver operating characteristic curve; 10 $×$ 5 CV: 10 runs of 5 CV.

## 8  Discussion

### 8.1  Transparency of DBN Re-RX with J48graft for the WBCD

To the best of our knowledge, Onan (2015) showed the highest accuracy (99.71%) using 10CV for the WBCD. However, the competition to achieve better accuracy for the WBCD appeared to plateau, suggesting that unless diagnostic accuracy can be substantially improved more than just a couple of percentage points, few significant contributions to interpretable medical diagnosis in clinical settings can be expected (Hayashi & Nakano, 2015).

DBN Re-RX with J48graft achieved the highest accuracy among recent rule extraction methods while extracting concise rules. However, DBN Re-RX with J48graft achieved 96.83% of TS ACC using 10 $×$ 10 CV, which is only a slight improvement over that obtained using the DBN alone (95.69%).

Compared with the results obtained using DBN-Re-RX, the conciseness of the proposed DBN Re-RX with J48graft is at approximately the same level. However, the TS ACC achieved in the proposed method was substantially better (96.83%) than that achieved using DBN-Re-RX. This is an important point because the primary purpose of many people who use a DBN is to achieve very high TS ACC. Deep NNs trained by a DBN confront the black box and nonexplainability problems.

Thus, our proposed method extracted accurate and concise rules for the WBCD; that is, we realized the transparency of deep NNs trained by a DBN.

### 8.2  Transparency of DBN Re-RX with J48graft for the Mammographic Mass Data Set

To the best of our knowledge, Nithya and Santhi (2015) showed the highest accuracy (90%) for the Mammographic Mass Data Set. Benndorf, Burnside, Herda, Langer, and Kotter (2015) achieved the highest AUC-ROC (0.862). However, there is only one published article (Keleş & Keleş, 2013) on rule extraction from the Mammographic Mass Data Set; the authors achieved 80.78% using 10 CV with nine fuzzy rules extracted.

DBN Re-RX with J48graft achieved substantially higher accuracy (84.18%) than DBN-Re-RX (82.99%) and the previous rule extraction method (80.78%) while extracting much more concise and interpretable rules (2.79 crisp rules and 1.46 Ave. # ante.) using 10 $×$ 10 CV. The TS ACC obtained by DBN Re-RX with J48graft is approximately the same as that obtained by DBN alone (85.01%). Furthermore, the 2.79 crisp rules extracted in this letter are remarkably more concise than the nine fuzzy rules obtained by Keleş and Keleş (2013). That is, we realized the transparency of deep NNs trained by a DBN.

### 8.3  Transparency of DBN Re-RX with J48graft for the Dermatology Data Set

To the best of our knowledge, Azar, Inbarani, and Devi (2017) showed the highest TS ACC (97.92%) using 10 CV for the Dermatology Data Set. However, the competition to achieve better accuracy for the data set appeared to plateau. Chang and Chen (2009) reported only 90.89% of TS ACC using seven rules.

DBN Re-RX with J48graft achieved 95.73% of TS ACC and 0.951 of AUC using 10 $×$ 5 CV with 9.06 average rules and 4.61 Ave. # ante., and achieved slightly higher TS ACC (95.73%) than DBN-Re-RX (94.5%) and slightly lower TS ACC than DBN alone (97.54%). Thus, our proposed method achieved considerably higher TS ACC than that reported by Chang and Cheng (90.89%) and extracted interpretable rules for the Dermatology Data Set; that is, we realized the transparency of deep NNs trained by a DBN.

### 8.4  Effective Use of High-Level Abstraction Data Sets with Prior Knowledge

The WBCD, Mammographic Mass, and Dermatology data sets are high-level abstraction data sets with prior knowledge rated or graded by pathologists, radiologists, and histopathologists. Similarly, in a recent paper on the detection of diabetic retinopathy using deep learning (Gulshan et al., 2016), all retinal fundus images retrospectively obtained from the United States, India, and France were graded for the presence of diabetic retinopathy according to severity (none, mild, moderate, severe, or proliferate) by many ophthalmologists.

We believe that the empirical results for the WBCD, Mammographic Mass, and Dermatology data sets show that deep NNs cannot offer substantial improvements over shallow NNs, even when considering potential improvements owing to more sophisticated training methods.

### 8.5  Effective Ideas to Use Exceptions of DBN Subjects to NFL Limitations

A recent empirical overview (Gómez & Rojas, 2016) involving small, simple data sets and deep NNs reported finding disappointing results. The models could not produce good results because of the limited number of data sets and, possibly, the low potential for data abstraction.

As Gómez and Rojas (2016) claimed, although DBNs are apparently subject to NFL limitations, there are exceptions regarding the number of hidden layers in DBNs for the WBCD, Mammographic Mass, and Dermatology data sets, as shown in Figures 3 to 5. When we attempt to parry and turn the tables on these limitations, we find a limited range of numbers of hidden layers in the DBN. That is, one or two hidden layers in the DBN had a high TS ACC that never decreased, even with increases of up to eight hidden units per hidden layer. Their effects on the improvement of TS ACC were quite limited for more than two hidden layers for the WBCD.

One and two hidden layers in the DBN achieved very high TS ACC for the Mammographic Mass Data Set. The DBN consisting of two hidden layers with three hidden units for the Mammographic Mass Data Set achieved the highest TS ACC (85.01%). After the fourth hidden layer, the TS ACC for the DBN sharply decreased for the Mammographic Mass Data Set.

The DBN of a single hidden layer with eight hidden units for the Dermatology Data Set achieved the highest TS ACC (97.54%), and more than three hidden layers with any numbers of hidden units resulted in a sharp decline in TS ACC. In contrast to the WBCD and Mammographic Mass Data Set with two classes, the Dermatology Data Set has six classes to be correctly classified. As a result, we believe that the DBN for the Dermatology Data Set required the use of many more hidden units to achieve high TS ACC compared with the two-class data sets.

Interestingly, the DBN consisting of one or two hidden layers with more than four hidden units achieved sufficiently high TS ACC. The DBN consisting of three hidden layers with five to eight hidden units still showed much better TS ACC compared with that with fewer hidden units.

These phenomena are in accordance with the discussions on the representational power of DBNs and a puzzling result regarding the best that could be achieved when going from one- to two-layer DBNs (Roux & Bengio, 2008). This is likely because universal approximations (Hornik et al., 1989) by restricted Boltzmann machines (RBMs) (Freund & Hausseler, 1991) are constructive and typically not practical, as they would result in RBMs with potentially as many hidden units (Roux & Bengio, 2008). However, the WBCD, Mammographic Mass, and Dermatology data sets require neither many RBMs nor many hidden units.

We believe that the cases involving the WBCD, Mammographic Mass, and Dermatology data sets are not rare because there are many small, high-level abstraction data sets with prior knowledge in various fields. Straightforward examples include categorical data sets generated using the ratings and gradings of pathologists, radiologists, and histopathologists. Generally experts can quantitatively rate or grade images in a specific domain to create high-level abstraction data sets with prior knowledge.

To the best of our knowledge, the best classification accuracy for the WBCD was the 99.71% of TS ACC using 10 CV obtained by Onan (2015). Thus, it does not seem very difficult to classify simply, as mentioned in the empirical overview (Gómez & Rojas, 2016). By contrast, the German Credit Data Set (UCI repository, 2015) is a relatively low-level abstraction data set that consists of raw nominal and continuous attributes (mixed attributes) of banking professionals without prior knowledge. To our knowledge, the best classification accuracy achieved was the 86.47% of TS ACC (10 CV) obtained by Tripathi, Edla, and Cheruku (2018). The Germen Credit Data Set is well known as one of most difficult data sets to classify in the machine learning community.

Because the WBCD and German Credit Data Set belong to totally different categories from the viewpoints of level of abstraction and degree of prior knowledge, they should not be treated as the same category in terms of classification.

## 9  Conclusion

In this letter, we proposed a rule extraction method that can extract concise and accurate rules directly for deep NNs trained by a DBN. Therefore, we have provided a potential key to fill the gap between the very high learning capability of DBNs and the very high interpretability of rule extraction algorithms such as Re-RX with J48graft.

In future research, we plan to apply DBN Re-RX with J48graft to more practical middle- to high-level abstraction data sets with prior knowledge. The high explainability and interpretability for medical images achieved using extracted rules trained by DBNs are expected to be promising and to help establish improved technologies such as digital pathology, radiology, and ophthalmology using rule extraction.

## Notes

1

UCSI: uniformity of cell size; UCSH: uniformity of cell shape; BN: bare nuclei.

2

BI-RADS: Breast Imaging Reporting and Data System.

## Acknowledgments

I express my sincere appreciation to Tatsuhiro Oishi and Keisuke Nakajima for their faithful efforts in the numerical experiments conducted during this research. This work was supported in part by the Japan Society for the Promotion of Science through a Grant-in-Aid for Scientific Research (C) (18K11481).

## References

References
Abdel-Zaher
,
A. M.
, &
Eldeib
,
A. M.
(
2016
).
Breast cancer classification using deep belief networks
.
Expert Systems with Applications
,
46
,
139
144
.
Andrews
,
R.
,
Diederich
,
J.
, &
Tickle
,
A.
(
1995
).
Survey and critique of techniques for extracting rules from trained artificial neural networks
.
Knowledge-Based Systems
,
8
,
373
389
.
Azar
,
A. T.
,
Inbarani
,
M. H.
, &
Devi
,
K. R.
(
2017
).
Improved dominance rough set-based classification system
.
Neural Computing and Application
,
28
,
2231
2246
.
Benndorf
,
M.
,
Burnside
,
E. S.
,
Herda
,
C.
,
Langer
,
M.
, &
Kotter
,
E.
(
2015
).
External validation of a publicly available computer assisted diagnostic tool for mammographic mass lesions with two high prevalence research datasets
.
Medical Physics
,
42
,
4987
4996
.
Bengio
,
Y.
(
2014
). Evolving culture vs local minima. In
T.
Kowaliw
,
N.
Bredeche
, &
R.
Doursat
(Eds.),
Growing adaptive machines Studies in Computational Intelligence
557
(pp.
109
138
).
Berlin
:
Springer-Verlag
.
Chang
,
C.-L.
, &
Chen
,
C.-H.
(
2009
). Applying decision tree and neural network to increase quality of dermatologic diagnosis.
Expert Systems with Applications
,
36
,
4035
4041
.
Erhan
,
D.
,
Bengio
,
Y.
,
Courville
,
A.
,
Manzagol
,
P. A.
, &
Vincent
,
A.
(
2010
).
Why does unsupervised pre-training help deep learning?
Journal of Machine Learning Research
,
11
,
625
660
.
Freund
,
Y.
, &
Haussler
,
D.
(
1991
). Unsupervised learning of distributions of binary vectors using 2-layer networks. In
J.
Moody
,
S. J.
Hanson
, &
R.
Lippmann
(Eds.),
Advances in neural information processing systems
,
4
.
San Francisco
:
Morgan Kaufmann
.
Gómez
,
D.
, &
Rojas
,
A.
(
2016
). An empirical overview of the no-free-lunch theorem and its effect on real-world machine learning classification.
Neural Computation
,
28
,
216
228
.
Gűlçehre
,
C.
, &
Bengio
,
Y.
(
2016
).
Knowledge matters: Importance of prior information for optimization
.
Journal of Machine Learning Research
,
17
,
1
32
.
Gulshan
,
V.
,
Peng
,
L.
,
Coram
,
M.
,
Stumpe
,
M. C.
,
Wu
,
D.
,
Narayanaswamy
,
A.
, …
Webster
,
D. R.
(
2016
).
Development of and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs
.
JAMA
,
316
,
2402
2410
.
Hayashi
,
Y.
(
2017
).
Synergy effects between the grafting and the subdivision in the Re-RX with J48graft for the diagnosis of thyroid disease
.
Knowledge-Based Systems
,
131
,
170
182
.
Hayashi
,
Y.
, &
Nakano
,
S.
(
2015
).
Use of a recursive-rule extraction algorithm with J48graft to archive highly accurate and concise rule extraction from a large breast cancer dataset
.
Informatics in Medicine Unlocked
,
1
,
9
16
.
Hinton
,
G. E.
, &
Salakhutdinov
,
R. R.
(
2006
).
Reducing the dimensionality of data with neural networks
.
Science
,
313
,
504
507
.
Hornik
,
K.
,
Stinchcombe
,
M.
, &
White
,
H.
(
1989
).
Multilayer feedforward networks are universal approximators
.
Neural Networks
,
2
,
359
366
.
Keleş
,
A.
, &
Keleş
,
A.
(
2013
).
Extracting fuzzy rules for the diagnosing of breast cancer
.
Turkish J. Electrical Engineering and Computer Sciences
,
21
,
1495
1503
.
LeCun
,
Y.
,
Cortes
,
C.
, &
Burges
,
C. J. C.
(
1998
).
The MNIST database of handwritten digits
. http://yann.lecun.com/exdb/mnist/
Liu
,
W.
,
Wang
,
Z.
,
Liu
,
X.
,
Zeng
,
N.
,
Liu
,
Y.
, &
,
F. E.
(
2017
).
A survey of deep neural network architectures and their applications
.
Neurocomputing
,
234
,
11
26
.
Luo
,
C.
,
Wu
,
D.
, &
Wu
,
D.
(
2017
).
A deep learning approach for credit scoring using credit default swaps
.
Engineering Applications of Artificial Intelligence
,
65
,
406
420
.
Marqués
,
A. I.
,
García
,
V.
, &
Sánchez
,
J. S.
(
2013
).
On the suitability of resampling techniques for the class imbalance problem in credit scoring
.
Journal of Operational Research Society
,
64
,
1060
1070
.
Mohamed
,
A. A.
,
Luo
,
Y.
,
Peng
,
H.
,
Jankowitz
,
R. C.
, &
Wu
,
S.
(
2017
).
Understanding clinical mammographic breast density assessment: A deep learning perspective
.
Journal of Digital Imaging
,
31
,
387
392
. doi:10.1007/s10278-017-0022-2
Nithya
,
R.
, &
Santhi
,
B.
(
2015
).
Decision tree classifiers for mass classification
.
International Journal of Signal and Image Systems Engineering
,
8
,
39
45
.
Oakden-Rayner
,
L.
,
Carneiro
,
G.
,
Bessen
,
T.
,
Nascimento
,
J. C.
,
,
A. P.
, &
Palmer
,
L. J.
(
2017
).
Precision radiology: Predicting longevity using feature engineering and deep learning methods in a radiomics framework
.
Scientific Reports
,
7
,
1648
. doi:10.1038/s41598-017-01931-w
Onan
,
A.
(
2015
).
A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer
.
Expert Systems with Applications
,
42
,
6844
6852
.
Quinlan
,
J. R.
(
1993
).
C4.5: Programs for machine learning
.
San Mateo, CA
:
Morgan Kaufmann
.
Rosenblatt
,
F.
(
1958
).
The perceptron: A probabilistic model for information-storage and organization in the brain
.
Psychological Review
,
65
,
386
408
.
Roux
,
N. L.
, &
Bengio
,
Y.
(
2008
).
Representational power of restricted Boltzmann machines and deep belief networks
.
Neural Computation
,
20
,
1631
1649
.
Salzberg
,
S. L.
(
1997
).
On comparing classifiers: Pitfalls to avoid and a recommended approach
.
Data Mining and Knowledge Discovery
,
1
,
317
328
.
Setiono
,
R.
(
1997
).
A penalty-function approach for pruning feedforward neural networks
.
Neural Computation
,
9
,
185
204
.
Setiono
,
R.
,
Baesens
,
B.
, &
Mues
,
C.
(
2008
).
Recursive neural network rule extraction for data with mixed attributes
.
IEEE Trans. Neural Networks
,
19
,
299
307
.
Tripathi
,
D.
,
Edla
,
D. R.
, &
Cheruku
,
R.
(
2018
).
Hybrid credit scoring model using neighborhood rough set and multi-layer ensemble classification
.
Journal of Intelligent and Fuzzy Systems
,
34
,
1543
1549
.
UCI repository
. (
2015
).
University of California Irvine Learning Repository
, http://archive/ics.uci/edu/
Webb
,
G. I.
(
1997
).
Decision tree grafting
. In
Proceedings of 15th International Conference on Artificial Intelligence
(pp.
846
885
).
San Mateo, CA
:
Morgan Kaufmann
.
Webb
,
G. I.
(
1999
).
Decision tree grafting from the all-tests-but-one partition
. In
Proceedings of the 16th International Joint Conference on Artificial Intelligence
(pp.
702
707
).
San Mateo, CA
:
Morgan Kaufmann
.
Wolpert
,
D. H.
(
1996a
).
The lack of a priori distribution between learning algorithms
.
Neural Computation
,
8
,
1341
1390
.
Wolpert
,
D. H.
(
1996b
).
The existence of a priori distinctions between learning algorithms
.
Neural Computation
,
8
,
1391
1420
.