Urban drainage pipe network is the backbone of urban drainage, flood control and water pollution prevention, and is also an essential symbol to measure the level of urban modernization. A large number of underground drainage pipe networks in aged urban areas have been laid for a long time and have reached or practically reached the service age. The repair of drainage pipe networks has attracted extensive attention from all walks of life. Since the Ministry of ecological environment and the national development and Reform Commission jointly issued the action plan for the Yangtze River Protection and restoration in 2019, various provinces in the Yangtze River Basin, such as Anhui, Jiangxi and Hunan, have extensively carried out PPP projects for urban pipeline restoration, in order to improve the quality and efficiency of sewage treatment. Based on the management practice of urban pipe network restoration project in Wuhu City, Anhui Province, this paper analyzes the problems of lengthy construction period and repeated operation caused by the mismatch between the design schedule of the restoration scheme and the construction schedule of the pipe network restoration in the existing project management mode, and proposes a model of urban drainage pipe network restoration scheme selection based on the improved support vector machine. The validity and feasibility of the model are analyzed and verified by collecting the data in the project practice. The research results show that the model has a favorable effect on the selection of urban drainage pipeline restoration schemes, and its accuracy can reach 90%. The research results can provide method guidance and technical support for the rapid decision-making of urban drainage pipeline restoration projects.

With the rapid urbanization, the short board of urban drainage pipe network construction is showing up daily. Various structural and functional defects of the drainage pipe network caused by aging easily cause problems such as urban waterlogging, sewage overflow, and ground subsidence [1]. Therefore, completing the urban drainage pipe network detection and repair is critical for realizing city sewage quality and efficiency, which helps promote the development of high-quality urban governance. However, the urban drainage pipe network has long miles, complicated structure, uncertainty factors, and extensive range, which significantly influences residents’ living environment and urban traffic [2]. Thus, how to determine the drainage pipe network's status and performance, design reasonable repair scheme rapidly, shorten the whole pipeline repair project period, and reduce the project construction according to its testing results can greatly affect the social environment and sustainable development.

In recent years, the introduction of pipe network performance evaluation technology has made the best solution for pipe repair, which has become a research hot spot among domestic and foreign scholars and experts. The existing research mainly uses qualitative and quantitative research methods to establish the prediction model of pipe network performance indicators and formulate maintenance plans. However, a difficult problem remains, which is the low efficiency of the decision making. Using machine learning-related technologies in mining case history, the literature that studies the quick decisions for fixing urban drainage pipe networks is scant.

Thus, this research put forward RS-SVM machine learning approach driven by case data for selecting urban drainage network restoration scheme. The main contribution of this study is threefold. First, we combine the attribute reduction based on RS technology [3] and the SVM technology [4] to give full play to their technological advantages. The minimalist data set of the excellent classification characteristics is used as the input of the SVM. Second, we propose an RS-SVM model for selecting an urban drainage pipe network repair scheme. The basic idea is to collect history data set from urban pipeline repairing project management practice for a pipeline, use RS theory to reduce the sample's attributes, use the indirect method combining two classifiers to construct a multi-level SVM scheme selection model, and then use the built model for scheme selection of the test sample to solve the matching analysis. Finally, we select Wuhu's drainage pipeline repair engineering case data for big data analysis in Anhui Province. The effectiveness of the proposed model and method is verified. This study provides decision support for the quick selection of drainage pipeline repair schemes and has a certain application value.

As for the technology on predicting the state of pipe network and developing repair strategies, Altarabsheh A et al. [5] used the Markov model to predict pipeline networks in the future and choose the most appropriate operational plan by GA, according to the whole life cycle of a sewage pipe network, considering the construction cost, operation cost, and expected benefits. Hernández N et al. [6] used the differential evolution method as an optimization tool for hyperparameter combination and combined it with the SVM model for two different management objectives (network and pipe levels). This model was applied to Colombia's main cities of Bogotá and Medellin, resulting in a less than 6% deviation in the prediction of structural conditions in both cities at a network level. Wang Y J et al. [7] proposed an XGBoost-based MICC model with the benefits of hyperparameters autooptimization. Yu A L et al. [8] put forward a method to carry out the five directions of research for the drainage pipeline repair scheme. Intelligent decision provides the basis. To predict the future performance of trenchless rehabilitations, Ibrahim B et al. [9] presented condition prediction models for the chemical grouting rehabilitation of pipelines and manholes in the city of Laval, Quebec, Canada. Bakry I et al. [10] presented condition prediction models for CIPP rehabilitation of sewer mains, and the models can predict the structural and operational conditions of CIPP rehabilitation on the basis of basic input, such as pipe material, and rehabilitation type and date.

As for the technology on development of decision support tools for drainage network repair, Cai X T et al. [11] proposed a sensitivity-based adaptive procedure (SAP), which can be integrated with optimization algorithms. SAP was integrated with non-dominated sorting genetic algorithm II (NSGA-II) and multiple objective particle swarm optimization (MOPSO) methods. Ulrich A et al. [12] studied a novel solution combining both approaches (pipes and tanks) and proposed a decision support system based on the NSGA-II for the rehabilitation of urban drainage networks through the substitution of pipes and the installation of storage tanks. Debères P et al. [13] used multiple criteria to locate the repair section and made a repair plan according to the pipeline inspection report and the economic, social, and environmental indicators. Ramos-Salgado et al. [14] developed a decision support system (DSS) to help water utilities design intervention programs for hydraulic infrastructures. Chen S B et al. [15] summarized the characteristics, causes, and evolution mechanism of typical defect types of coastal urban drainage network, which can provide technical guidance for the evaluation and repair of drainage network in this area or similar cities. Based on the investigation of the current situation of the existing underground drainage network in a certain area of Chongqing, Liu W et al. [16] used the AHP and entropy weight method to study the urban drainage pipe network risk rating and provide decision support for pipeline repair plan design. In view of the urban drainage pipe network deterioration, Wang J L et al. [17] used AHP and fuzzy comprehensive evaluation methods to research the urban drainage pipe network status and operational efficiency to provide decision support for drainage pipe network maintenance and repair plan.

This study is inspired by Xie B et al. [18] but different from the previous works that focused on designating remediation solutions based on current drainage inspection results but does not sufficiently explore the value of historical cases to solve this current problem. The choice of an urban drainage pipeline repair plan belongs to the category of multiple attribute decision making [19]. For multiple attribute decision-making problems, policymakers tend to be objective in formulating the optimal alternatives [20]. Support vector machine (SVM) is established on VC dimension theory and structural risk minimization principle based on machine learning methods [21]. It has better properties, especially in quickly setting the optimal alternative for current cases based on multi-attribute case history data [22]. As the SVM's input, the case history data sets frequently have redundant attributes [23]. Redundant attributes can increase the complexity of SVM training, extend the SVM training time, and reduce the SVM decision efficiency. A rough set (RS) deals with the uncertainty problem of mathematical tools [24]. RS attribute reduction algorithm can effectively handle attribute redundancy; it reduces redundant attributes that interfere with the SVM. Thus, on the basis of the combination of rough set and SVM technology, we propose an RS-SVM model for selecting an urban drainage network restoration scheme in this study.

The detailed process of combining RS and SVM to select an urban drainage pipe network repair scheme consists of four components. The structure of which is shown in Figure 1, and the role of each component is as follows:

  • Collecting historical data sets and then standardizing them.

  • Using the RS attribute reduction algorithm to reduce the redundant attributes contained in the data set.

  • Training the model of multi-level SVM classification.

  • Using the trained model to match the new detection results.

Figure 1.

Detailed process of combining RS and SVM to select an urban drainage pipe network repair scheme.

Figure 1.

Detailed process of combining RS and SVM to select an urban drainage pipe network repair scheme.

Close modal

The related parameters described in the model are as follows: {Z1, Z2,…, Zi,…, Zm} are the historical data sets of drainage pipeline repair projects. The target case denoted by Z0 is the current case reflecting the need for a repair scheme. {C1p, C2p,…, Cjp,…, Cnp} is the attribute set of drainage pipe network detection results. (a01, a02,…, a0j,…, a0n) is the attribute value vector of Z0, where a0j is the Cjp value corresponding to Z0. In this study, the drainage pipe network detection properties are divided into two types, numeric and symbols. For example, the length, diameter, various defects, and quantity of the pipes belong to numeric data. The material of the pipes belongs to symbol data. {S1,S2,…,Sk,…,Sg} is the history solution set of drainage pipeline repair projects. According to the characteristics of the urban pipeline repair project, we presume that a solution may apply to multiple cases, with each case only a final implementation scheme.

3.1 Standardizing

To eliminate the influence of the dimension, we need to standardize the data set. Symbol variables are denoted by {Fj|i ∈ T}, where T equals {1, 2, 3, …, t}. Thus, we set the symbol sequence within the specified symbol in advance. The sequence number of the Sign Fi is seq(Fi), where Fi∈F, seq(Fi)∈T. Standardizing the symbolic variables uses Equation (1):

(1)

We use the normal method to standardize some numeric variables. This method is suitable for indicators with a nonzero-range. The standardized variable values are between 0 and 1. Standardizing the symbolic variables uses Equation (2):

(2)

3.2 RS Attributes Reduction

Selecting as many attributes that have a greater impact on the scheme as possible can avoid missing crucial ones. If there are a large number of attributes, the complexity of the model inevitably increases, and the prediction performance is reduced, supposing all attributes are input into the SVM model. Some attributes have little influence on the selection of repair schemes for urban drainage networks, and the repair schemes are determined to some extent by a few key attributes that best reflect the characteristics of the categories [25]. Therefore, attribute reduction is conducive to improving the prediction accuracy and efficiency of the SVM model. In light of the influence of the definition of attribute importance and reduction rules, the result of attribute reduction is often not unique, and finding a minimum reduction has been proven to be an NP-hard problem [26]. The general method to solve this problem is to find the optimal or suboptimal reduction by heuristic search method [27]. Thus, we propose the attribute reduction algorithm based on attribute similarity. The basic flow of the algorithm is as follows:

Step 1: All conditional attributes are reduced using discernible attribute matrix.

In rough set theory, tuples S=(U, A, V, f) are defined as information systems, where U is a discussion domain of A, A=C∪D is an attribute set, C represents conditional attributes, and D represents target attributes. In the information system S, the differentiation determined by RA is: A=U/RA={Ki: i≤l}, and fl (Ki) is used to represent the value of attribute cl with respect to the object in Ki, called I(Ki, Kj) (Equation (3)). It is the discernable attribute matrix of Ki and Kj, and the main diagonal element of I is the empty set.

(3)

The identification equation is as follows:

(4)

All attributes of the information system can be reduced using Equation (4). For example, in the information system shown in Table 1, {c1, c2, c3, c4} is the conditional attribute and D is the decision attribute.

Table 1.
Information system.
codec1c2c3Dcodec1c2c3D
Z1 Z5 
Z2 Z6 
Z3 Z7 
Z4 Z8 
codec1c2c3Dcodec1c2c3D
Z1 Z5 
Z2 Z6 
Z3 Z7 
Z4 Z8 

According to Table 1, the samples with the same attribute value are merged to obtain the simplified information system shown in Table 2.

Table 2.
Simplified information system.
codec1c2c3D
K1 = {Z1,Z3
K2={Z2
K3 = {Z7,Z8
K4 = {Z4,Z6
K5 = {Z5
codec1c2c3D
K1 = {Z1,Z3
K2={Z2
K3 = {Z7,Z8
K4 = {Z4,Z6
K5 = {Z5

From Table 2 and Formula (3), the discernable attribute matrix I shown in Table 3 is obtained.

Table 3.
Discernible attribute matrix I.
IK1K2K3K4K5
K1 Ø {c1,c2,c3{c1,c2{c1,c3{c1,c3
K2 {c1,c2,c3Ø {c1,c3{c1,c2,c3{c1,c2,c3
K3 {c1,c2{c1,c3Ø {c2,c3{c2,c3
K4 {c1,c3{c1,c2,c3{c2,c3Ø {c3
K5 {c1,c3{c1,c2,c3{c2,c3{c3Ø 
IK1K2K3K4K5
K1 Ø {c1,c2,c3{c1,c2{c1,c3{c1,c3
K2 {c1,c2,c3Ø {c1,c3{c1,c2,c3{c1,c2,c3
K3 {c1,c2{c1,c3Ø {c2,c3{c2,c3
K4 {c1,c3{c1,c2,c3{c2,c3Ø {c3
K5 {c1,c3{c1,c2,c3{c2,c3{c3Ø 

In Table 3, according to the discernable attribute matrix I and Equation (4), all the reductions of the information system are obtained as G1={c1, c3} and G2={c2, c3}.

Step 2: The core of the information system should be found.

Equation (5) is used to find the core of the information system, where {Gi: i≤l} represents all reductions of the information system.

(5)

According to the result of step 1 and Equation (5), the core of the information system can be obtained as core()=G1∩G2={c1, c3}∩{c2, c3}={c3}.

Step 3: The similarity of relatively necessary attributes is calculated with respect to decision attribute D.

If the core is not empty, any reduction contains the core. So the properties in the core are absolutely necessary properties. Equation (6) is used to represent the set of relatively necessary attributes that appear in some reductions.

(6)

According to the result of Step 2 and Equation (6), the relatively necessary attributes of the information system can be obtained: rna()={c1, c3}∪{c2, c3}-{c1, c3}∩{c2, c3}={c1, c2}.

Step 4: The similarity between each relatively necessary attribute and decision attribute is gradually added to the core in order from high to low to form set R until R satisfies the indistinguishable relation: IND (R) = IND(C). Therefore, R is the relative minimum.

The similarity between conditional attribute {c} and decision attribute D is calculated by Equation (7).

(7)

According to Table 1, S(c1,D) and S(c2,D) are calculated by the following equation:

By virtue of S(c1, D)>S(c2, D), adding the relatively necessary attribute c1 to core() can form the set R1={c1, c3}. Since IND(R1)=IND(C)=14, the indistinguishable relation is satisfied. Thus, {c1, c3} is the relative minimum for this example.

3.3 Selecting Kernel Function

The prerequisite for SVM classification is that the sample space is linearly separable. However, the complexity of data spatial distribution increases along with the increase of the sample space dimension. For example, in Figure 2, points are linearly indivisible in a two-dimensional plane. In Figure 3, a kernel function is adopted to transform the sample dimensionally. The requirement of linear divisibility is satisfied after the sample set is mapped to a higher-dimensional space. Different kernel functions can be used to construct and realize different types of nonlinear decision surface learning machines in the input space, thus generating different support vector algorithms [28]. We select several kernel functions commonly used in classification problems. Functions (8)-(11) are expressed as follows:

(8)
(9)
(10)
(11)
Figure 2.

Linearly indivisible points.

Figure 2.

Linearly indivisible points.

Close modal
Figure 3.

Linearly separable points.

Figure 3.

Linearly separable points.

Close modal

3.4 Training Multi-level SVM Classification Model

3.4.1 Binary Classification Algorithm Based on Historical Cases

Before building a multi-level SVM classification model, a binary classification model should be built. Let the scheme set of the sample be {E, F}. In Figure 4, the binary classification problem is to build a binary classifier that can distinguish schemes E and F through historical cases. The theoretical derivation of the binary classifier is as follows:

Figure 4.

Work flow of binary classifier.

Figure 4.

Work flow of binary classifier.

Close modal

We set some sample points in the sample space {xi|i=1, 2, 3, …, n}, where xi is the vector corresponding to each sample point. yi is the class value corresponding to each sample point, where yi∈{-1, 1}. The equations of positive hyperplane H1, negative hyperplane H2 and decision hyperplane H0 are as follows:

(12)

The constrained optimization problem for the maximum value L is written in the following form:

(13)

To facilitate calculation, Equation (13) is rewritten as

(14)

The dual problem of Equation (14) is as follows:

(15)

Considering the influence of w and b on Lagrange function (15), the following function is constructed.

(16)

The argument of function f(w,b) is unconstrained and the function is a convex function about the argument (w1,w2,w3,…,ws,b), so it has a unique minimum point. Its gradient expression is as follows:

(17)

Let ▿f(w,b) = 0; Equation (18) can be obtained according to Equation (17):

(18)

After substituting Equation (18) into the function and introducing some kernel function, Equation (15) can be written as

(19)

After solving for λ = (λ123,…,λn)T, we can solve for w* in terms of wT=i=1nλiyixi, We just use the support vector x*i in the computation of w*. Positive hyperplane H1: w*x+b*=1. Negative hyperplane H2: w*x+b*=-1. Decision hyperplane H0: w*x+b*=0. After the three hyperplane equations are determined, the binary classifier is constructed.

3.4.2 Multi-Classification Method Based on Combinatorial Thinking

The repair schemes of urban drainage pipe networks are diverse, so a single binary classifier cannot complete the classification of all schemes; combining multiple binary classifiers is necessary. The common combination methods of binary classification include indirect and direct methods [29]. The indirect method is to construct a series of binary classifiers in a certain way and combine them to achieve multi-class classification. The other combines the parameter solutions of multiple classification surfaces into an optimization problem and realizes the classification of multiple classes by solving the optimization problem. Although the direct method looks simple, the variables in the optimized problem-solving process are significantly more than the indirect method, and the training speed and classification accuracy are not as good as the first method. This problem is more prominent when the training sample size is large. Therefore, in Figure 5, the indirect method and combinatorial thinking are adopted in this study to build a binary tree multi-level classifier. Let the scheme set of the sample be {scheme 0, scheme 1, scheme 2…, scheme g-1}. The number of schemes is g. In Figure 6, the multi-level SVM classification model aims to differentiate all schemes by combining multiple binary classifiers step by step.

Figure 5.

The algorithm flow of multilevel classifier.

Figure 5.

The algorithm flow of multilevel classifier.

Close modal
Figure 6.

The work flow of multilevel classifier.

Figure 6.

The work flow of multilevel classifier.

Close modal

We take the management practice of urban pipe network restoration project in Wuhu City, Anhui Province as the background. Basing on sklearn. SVM and sklearn.metrics packages, we draw the conclusion by comparing the RS-SVM machine learning method proposed in this paper with other algorithms (SVM without attribute reduction, logistic algorithm without attribute reduction, logistic algorithm with attribute reduction). The data set selected in this study contains 1500 samples, of which 1000 samples are randomly selected as the training set and others as the test set. All experiments are conducted on a personal desktop computer with 2 GHz Intel(R) Xeon (R) E5-2620 CPU, 8 GB RAM and Python 3.8.

Referring to “Technical Specification for Inspection and Evaluation of Urban Drainage Pipes (CJ181-2012)”, 68 conditional attributes {c1p, c2p, …, c68p} are shown in Table 4.

Table 4.
Sixty-eight conditional attributes.
AttributeCodeAttributeCode
Pipe length c1p Transformation I-IV c9p-c12p 
Buried depth c2p Corrode I-IV c13p-c16p 
Pipe diameter c3p Wrong mouth I-IV c17p-c20p 
Material c4p Ups and downs I-IV c21p-c24p 
Rupture I-IV c5p-C8Disconnect I-IV c25p-c28p 
Interface material shedding I-IV c29p-c32p Scaling I-IV c49p-c52p 
Branch pipe dark connection I-IV c33p-c36p Obstacle I-IV c53p-c56p 
Foreign body penetration I-IV c37p-c40p Residual wall and dam root I-IV c57p-c60p 
Leakage I-IV c41p-c44p Root I-IV c61p-c64p 
Deposition I-IV c45p-c48p Scum I-IV c65p-c68p 
AttributeCodeAttributeCode
Pipe length c1p Transformation I-IV c9p-c12p 
Buried depth c2p Corrode I-IV c13p-c16p 
Pipe diameter c3p Wrong mouth I-IV c17p-c20p 
Material c4p Ups and downs I-IV c21p-c24p 
Rupture I-IV c5p-C8Disconnect I-IV c25p-c28p 
Interface material shedding I-IV c29p-c32p Scaling I-IV c49p-c52p 
Branch pipe dark connection I-IV c33p-c36p Obstacle I-IV c53p-c56p 
Foreign body penetration I-IV c37p-c40p Residual wall and dam root I-IV c57p-c60p 
Leakage I-IV c41p-c44p Root I-IV c61p-c64p 
Deposition I-IV c45p-c48p Scum I-IV c65p-c68p 

The sample has four fixes. To facilitate differentiation, schemes are numbered in this study, and the numbered values are taken as target attribute values, as shown in Table 5. According to Figure 6, three classifiers should be built for the four schemes: SVM0, SVM1, and SVM2.

Table 5.
Value of each scheme target attribute.
SchemeCodeValue
Not repair Scheme 0 
Excavation and reconstruction Scheme 1 
Ultraviolet light polymerization Scheme 2 
Spot in situ curing Scheme 3 
SchemeCodeValue
Not repair Scheme 0 
Excavation and reconstruction Scheme 1 
Ultraviolet light polymerization Scheme 2 
Spot in situ curing Scheme 3 

To compress the data distribution, eliminate the impact of dimension, and improve the efficiency of attribute reduction and classification of SVM, we standardize the attribute values of pipeline material (c4p) according to Equation (1) and standardize those of other attributes according to Equation (2). After standardizing, the distribution of sample attribute values corresponding to the four repair schemes is shown in Figure 7. In a repair scheme, the larger the sample size corresponding to the attribute value under an attribute, the darker the color of the area. If the attribute value under an attribute corresponds to a smaller sample size, the color of the area is lighter.

Figure 7.

Distribution of attribute values of four restoration schemes.

Figure 7.

Distribution of attribute values of four restoration schemes.

Close modal

The pipeline defect type shown in Table 6 does not appear in all sample pipe segments of the pipeline network repair project in Wuhu City, Anhui Province. Therefore, under the existing samples, the attributes in Table 6 do not affect the selection of repair schemes and can be preliminarily reduced.

Table 6.
Type of defect not present in samples.
AttributeCodeAttributeCode
Deformation III c11p Mismouth IV c20p 
Metamorphosis IV c12p Disjunction IV c28p 
Etch IV c16p Interface material off IV c32p 
Misopening III c19p The branch pipe is secretly connected to IV c36p 
Foreign body is penetrated into IV c40p Residual wall and dam root I-IV c57p-c60p 
Leakage IV c44p Root III c63p 
Scaling III c51p Root IV c64p 
Scaling IV c52p Scum I-IV c65p-c68p 
AttributeCodeAttributeCode
Deformation III c11p Mismouth IV c20p 
Metamorphosis IV c12p Disjunction IV c28p 
Etch IV c16p Interface material off IV c32p 
Misopening III c19p The branch pipe is secretly connected to IV c36p 
Foreign body is penetrated into IV c40p Residual wall and dam root I-IV c57p-c60p 
Leakage IV c44p Root III c63p 
Scaling III c51p Root IV c64p 
Scaling IV c52p Scum I-IV c65p-c68p 

According to Equation (3)-(5), the core of the sample information system is obtained: core()={c2p, c3p, c5p, c6p, c7p,c8p, c9p,c13p,c14p,c15p,c17p,c18p,c21p,c22p,c24p,c25p,c26p,c27p,c29p,c30p,c33p,c34p,c35p,c37p,c38p,c39p,c41p,c42p,c43p,c45p,c46p,c47p,c48p,c49p,c50p,c53p,c54p,c55p,c56p,c61p,c62p}. According to Equation (6), the relatively necessary attributes are obtained: rna()={c1p,c4p, c10p,c23p,c31p}. All reductions of the sample information system are G1={core(),c1p}, G2={core(),c4p}, G3={core(),c10p}, G4={core(),c23p}, and G5={core(),c31p}. According to Formula (7), S(c4p,D)>S(c10p,D)>S(c31p,D)>S(c23p,D)>S(c1p,D); therefore, the relatively necessary attribute c4p is first added to core() to form the set R={c4p,core()}. As IND(R)=IND(C) satisfies the indiscriminability relation, R={c4p, core()} is the relative minimalism of this example, and {c1p,c10p,c23p,c31p} is further reduced as a redundant attribute. Afterward, the 42 attributes in R are sorted from smallest to largest according to the subscript. The distribution of sample attribute values corresponding to the four repair schemes is shown in Figure 8. The attribute numbers from 42 to 68 in Figure 8 is the redundant attribute set reduced, and there is no peak value. By comparing Figures 7 and 8, we see that attribute reduction can remove redundancy and compress and reduce the dimension of the distribution of data sets.

Figure 8.

Sample attribute value distribution of the four restoration schemes after reduction.

Figure 8.

Sample attribute value distribution of the four restoration schemes after reduction.

Close modal

In recent years, many methods have been proposed by domestic and foreign experts and scholars to evaluate the prediction effect of SVM, most of which are the Macro-Average-ROC [30]. The curves can be used to qualitatively evaluate the prediction effect of different SVM classification models [31]. The area under the curve (Macro-Average-AUC) can quantitatively evaluate the prediction accuracy of different SVM classification models. To facilitate the narration, we use MAR to represent Macro-Average-ROC and MAA to represent Macro-Average-AUC.

Suppose the number of test samples is n and the number of categories is g. After the training is completed, the probability of each test sample under each category is calculated and a matrix P with n rows and g columns is obtained. Each line of P represents the probability value of a test sample under each category. Accordingly, the labels of each test sample are converted to a binary like form. Each position is used to mark whether it belongs to the corresponding category. Thus, a label matrix L with n rows and g columns can be obtained.

The basic idea of macro averaging is as follows: Under each category, you can get the probability that n test samples are of that category (the columns in the matrix P). Therefore, according to each corresponding column in the probability matrix P and label matrix L, false positive rate (FPR) and true positive rate (TPR) under each threshold can be calculated, thus drawing a ROC curve. Thus, a total of g ROC curves can be plotted. FPR_all is obtained by merging, de-duplicating, and sorting all FPR of these ROC curves. The FPR and TPR of the current class determine the ROC curve of the current class. The linear interpolation method [32] is used to interpolate the horizontal coordinate that does not exist relative to FPR_all in FPR, and the TPR’ after interpolation is obtained. TPR_mean is obtained by arithmetic averaging TPR’ after class g interpolation. Finally, MAR curves are drawn according to FPR_all and TPR_mean. The equations of TPR_ mean and MAR are as follows:

(20)
(21)

Among the four SVM kernel functions described in Section 3.3, we are uncertain which one is most suitable for this data set. The parameter values of each kernel function are also uncertain. Thus, the training set containing 1000 samples is used to optimize the selection of the optimal kernel function and parameters, as shown in Figure 1. Specifically, in the parameter definition domain, we draw an image of the MAA of the model after attribute reduction as the parameter values changed (Figure 9) so as to determine the optimal parameters of each kernel function after attribute reduction and the maximum MAA. To compare and analyze the influence of attribute reduction on the SVM classification effect, we draw an image of the model's AUC changing with parameter values in the case of no attribute reduction (Figure 10). We then determine the optimal parameter and maximum MAA of each kernel function without attribute reduction.

Figure 9.

Parameter optimization results after attribute reduction.

Figure 9.

Parameter optimization results after attribute reduction.

Close modal
Figure 10.

Parameter optimization results before attribute reduction.

Figure 10.

Parameter optimization results before attribute reduction.

Close modal

(1) Linear kernel parameter optimization

The penalty coefficient σ is an important parameter of linear kernel function. The greater the σ is, the greater the penalty for misclassification. Hence, the accuracy of the training set is high, but the generalization ability is weak. The smaller the σ value; the lower the penalty for misclassification, allowing fault tolerance, and treating them as noise points; the stronger the generalization ability [33]. For linear kernel function, before attribute reduction (Figure 10 (a)), if 0<σ<0.4, the value of MAA increases with the increase of σ. If 0.4≤σ≤6.4, MAA stabilizes around 0.62. If σ> 6.4, the data will be overfitted with the increase of σ, and the value of MAA first decreases and then stabilizes around 0.56. After attribute reduction (Figure 9 (a)), if 0<σ<0.3, the value of MAA increases with the increase of σ. If 0.3≤σ≤6.0, MAA stabilizes around 0.66. If σ>6.0, the data are overfitted with the increase of σ, and the value of MAA first decreases and then stabilizes around 0.58. Thus, before attribute reduction, the optimal parameter interval of linear kernel is [0.4,6.4], and then the value of MAA can reach 0.612. After attribute reduction, the optimal parameter interval of linear kernel is [0.3,6.0], and then the value of MAA can reach 0.656.

(2) Poly kernel parameter optimization

High-order coefficients degree and coef0 are important parameters of the poly kernel function. The greater the degree, the higher the spatial dimension after mapping, the higher the complexity of calculating the polynomial [34]. In particular, if degree=1 and coef0=0, the poly kernel is equivalent to the linear kernel. In the case of coef0=1, we study the influence of degree on the classification accuracy of the poly kernel function. Before attribute reduction (Figure 10 (b)), if 1≤degree≤3, the value of MAA increases with the increase of degree. If 3≤degree≤5, the classification accuracy is stable at around 0.59. If 16≤degree≤26, the MAA value fluctuates slightly between 0.51 and 0.53. With the increase of degree, the computational complexity increases, and the value of MAA is stable around 0.51. After attribute reduction (Figure 9 (b)), if 1≤degree≤2, the value of MAA increases with the increase of degree. If 2≤degree≤4, the value of MAA is stable around 0.72. If 10≤degree≤14, MAA values show another peak, but this peak is slightly lower than the first peak of 0.72. With the increase of degree, the computational complexity increases, and the value of MAA is stable around 0.61. The optimal parameter interval of the poly kernel before attribute reduction is [3,5], and then the value of MAA can reach 0.583. After attribute reduction, the optimal parameter interval of the poly kernel is [2,4], and then the value of MAA can reach 0.716.

(3) RBF kernel parameter optimization

Gamma, an important parameter of the RBF kernel function, controls the penalty threshold range and mainly defines the influence of a single sample on the entire classification hyperplane. If gamma is small, a single sample has little influence on the whole classification hyperplane, so selecting it as a support vector is hard. Conversely, a single sample has a greater impact on the whole classification hyperplane and is more likely to be selected as a support vector, or the whole model will have more support vectors [33]. In this study, the optimal gamma value is searched within [10-3,+∞), and the search ends when the value of MAA becomes stable with the increase of gamma. Before attribute reduction (Figure 10 (c)), if 0<gamma<3.3×10-3, the value of MAA increases with the increase of gamma. If 3.3×10-3≤gamma≤6.4×10-3, MAA is stable around 0.72. If gamma≥2.6×10-2, the value of MAA is stable around 0.67 with the increase of gamma. After attribute reduction (Figure 9 (c)), if 0<gamma<1.5x10-3, the value of MAA increases with the increase of gamma. If 1.5x10-3≤gamma≤2.0x10-3, MAA is stable around 0.79. If 2.0×10-3<gamma<3.2×10-3, the value of MAA showed a decreasing trend. If gamma≥3.2×10-3, the value of MAA is stable around 0.66. It can be seen that the optimal value interval of the RBF kernel parameter before attribute reduction is [3.3x10-3,6.4x10-3], and then the value of MAA can reach 0.718. After attribute reduction, the optimal parameter interval of RBF kernel is [1.5x10-3,2.0x10-3], and then the value of MAA can reach 0.793.

(4) Sigmoid kernel parameter optimization

k (gamma) and c (coef0) are two important parameters of the sigmoid kernel function [34]. To comprehensively consider the influence of k and c on classification accuracy, this study takes (k,c) as the independent variable and the value of MAA as the dependent variable and calculates the value of MAA of sigmoid kernel function within the search range of 0<k≤0.1 and 0≤c≤10. Moreover, we draw a spatial scatter plot containing 100 points (Figure 9 (d) and Figure 10 (d)). Before attribute reduction (Figure 9 (d)), if 0.04≤k≤0.07 and 3≤c≤6, the MAA value of the sigmoid kernel function reaches 0.735. If (k, c) takes other values, the MAA values of sigmoid kernel functions are all below 0.735. After attribute reduction (Figure 9 (d)), if 0.03≤k≤0.05 and 1≤c≤5, the MAA value of the sigmoid kernel function reaches 0.825. If (k, c) takes other values, the MAA values of the sigmoid kernel function are all below 0.825. Thus, before attribute reduction, the optimal value interval of the Sigmoid-kernel parameter is k∈[0.04,0.07] and c∈[3,6], and then the value of MAA can reach 0.732. After attribute reduction, the optimal value interval of the sigmoid kernel parameter is k∈[0.03,0.05] and c∈[1,5], at which time the value of MAA reaches 0.825.

The validation set containing 500 samples is used to test the prediction effect of our proposed RS-SVM classification model. To verify the effectiveness and feasibility of the proposed RS-SVM machine learning method in the urban drainage network repair scheme selection model, we also compare this algorithm with the Laplacian logistic regression algorithm [35]. The results of parameter optimization in Section 4.4 and the prediction results of logistic regression are shown in Table 7. According to Table 7, we draw the MAR curve of each kernel function with optimal parameters and the MAR curve of logistic regression, as shown in Figure 11. Figure 12 shows the average time of each kernel function and the average time of logistic regression change with the sample size.

Figure 11.

MAR curves of different classification models.

Figure 11.

MAR curves of different classification models.

Close modal
Figure 12.

Average time of different classification models.

Figure 12.

Average time of different classification models.

Close modal
Table 7.
Comparison models.
SVMLaplacian Logistic Regression
KernelOptimal parameterMAAAverage time consuming (second)MAAAverage time consuming (second)
beforeafterbeforeafterbeforeafterbeforeafterbeforeafter
Linear [0.4,6.4] [0.3,6.0] 0.612 0.656 129.4 85.6     
Poly [3,5] [2,4] 0.583 0.716 159.3 103.9     
RBF [0.0033, 0.0064] [0.0015, 0.0020] 0.718 0.793 161.3 100.7 0.564 0.598 523.3 483.1 
Sigmoid 0.04≤k≤0.07;
3≤c≤6 
0.03≤k≤0.05;
1≤c≤5 
0.732 0.825 150.5 90.6     
SVMLaplacian Logistic Regression
KernelOptimal parameterMAAAverage time consuming (second)MAAAverage time consuming (second)
beforeafterbeforeafterbeforeafterbeforeafterbeforeafter
Linear [0.4,6.4] [0.3,6.0] 0.612 0.656 129.4 85.6     
Poly [3,5] [2,4] 0.583 0.716 159.3 103.9     
RBF [0.0033, 0.0064] [0.0015, 0.0020] 0.718 0.793 161.3 100.7 0.564 0.598 523.3 483.1 
Sigmoid 0.04≤k≤0.07;
3≤c≤6 
0.03≤k≤0.05;
1≤c≤5 
0.732 0.825 150.5 90.6     

From the perspective of the MAR curve, before attribute reduction (Figure 11 (a)), according to the value of MAA, four kernel functions are sorted in descending order: Sigmoid, RBF, Linear, Poly. After attribute reduction (Figure 11 (b)), four kernel functions are sorted from largest to smallest according to the value of MAA: Sigmoid, RBF, Poly, Linear. In terms of average time spent before attribute reduction (Figure 12 (a)), classification time spent by SVM and Laplacian logistic regression shows an increasing trend with the increase in sample size, and the growth rate of Laplacian logistic regression is significantly faster than SVM. After attribute reduction (Figure 12 (b)), the classification time of SVM and Laplacian logistic regression also show an increasing trend with the increase in sample size, and the growth rate of Laplacian logistic regression is also significantly faster than that of SVM. From the perspective of the MAR curve and average time, although the Sigmoid classification time is not the lowest, it has little difference from the classification time of other kernel functions because Sigmoid can guarantee a high value of MAA. Therefore, we can consider that the Sigmoid-kernel algorithm with attribute reduction is more suitable to select an urban drainage network repair scheme under the current data set than other algorithms.

Through comparative analysis, the detailed contributions of the rough set and SVM model are as follows:

  1. The rough set attribute reduction algorithm can effectively reduce the sample attributes and improve the classification efficiency and accuracy of multilevel SVM algorithm. Additionally, comparative analysis reveals that the parameter optimization results of the four SVM kernel functions are different before and after attribute reduction. The classification time of the same classification model under different sample sizes is also different.

  2. Compared with Laplacian logistic regression algorithm, SVM adopts a kernel function mechanism. When calculating the decision surface, only samples representing the support vector in the SVM algorithm participate in the calculation, which can guarantee higher accuracy and shorter running time overall and effectively reduce overfitting. At the same time, SVM can build a linear learning machine in a high-dimensional feature space, which avoids the ‘disaster of dimension’ to some extent.

In view of the difficulty of urban drainage network repair detection and the mismatch between repair scheme design and construction schedule, this study organically combines the rough set attribute reduction algorithm based on attribute similarity with a multi-level support vector machine to construct an RS-SVM model for selecting an urban drainage network repair scheme. We select the case data of the Wuhu network repair project for big data analysis. In general, the scheme selection model proposed in this study can effectively predict the urban drainage network repair scheme and provide a basis for the rapid selection of urban drainage network repair schemes. At the same time, the method can be extended to the fields of disease diagnosis, project risk assessment, fire identification and bank loan risk prediction; moreover, it has high practical application value. However, in collecting case data, we did not consider the situation of multi-scheme combination repair nor the situation of using different kernel function combinations to search for optimization in the classification algorithm. Therefore, future studies can be carried out from the following aspects:

  1. According to the characteristics of the project, the urban pipeline repair scheme is further subdivided, and the pipeline sections repaired by combining multiple schemes are taken as samples into the training set to study a more accurate pipeline repair construction scheme.

  2. The prediction efficiency and accuracy of RS-SVM machine learning can be further improved by combining multiple kernel functions for SVM multi-level classifiers.

  3. The sample data of multiple pipeline repair projects are selected, and other prediction methods, such as random forest, are selected for comparative experiments, which not only enriches the experimental results but also improves the accuracy of pipeline repair schemes and provides auxiliary support for designers to make quick decisions.

This work is supported by the Funds for the Anhui Provincial Science and Technology lnnovation Strategy and Soft Science Research Project (Grant Number 202206f01050017), National Natural Science Foundation of China (Grant Number 72131006, 6201101347, 72071063), Fundamental Research Funds for the Central Universities (Grant Number JS2021ZSPY0037), Research Project of China Three Gorges Corporation (Grant Number 202103355), Yangtze Ecology and Environment Co., Ltd. (Grant Number HB/AH2021039), and Power China Huadong Engineering Corporation Limited (KY2019-ZD-03).

Li Jiang (E-mail: [email protected]): has participated in the proposed model design and writing of the manuscript.

Zheng Geng (E-mail: [email protected]): has participated in the coding, the experiment and analysis, writing the manuscript.

DongXiao Gu (E-mail: [email protected]): has participated in the part of the experiment and analysis.

Shuai Guo (E-mail: [email protected]): has participated in the part of the experiment and analysis.

RongMin Huang (E-mail: [email protected]): has participated in the writing and revision of the manuscript.

HaoKe Cheng (E-mail: [email protected]): has participated in the writing and revision of the manuscript.

KaiXuan Zhu (E-mail: [email protected]): has participated in the writing and revision of the manuscript.

[1]
Zhang
,
Z. Y.
:
Can the Sponge City Project improve the stormwater drainage system in China?—Empirical evidence from a quasi-natural experiment
.
International Journal of Disaster Risk Reduction
5
(
102980
),
1
9
(
2022
)
[2]
Liu
,
Y. X.
,
Ye
,
S. Z.
,
Lv
,
B.
, et al.
:
Information solution for Intelligent detection of drainage pipe network defects
.
China Water & Wastewater
37
(
8
),
32
36
(
2021
)
[3]
Yan
,
C.
,
Li
,
Z.
,
Boota
,
M. W.
, et al.
:
River pattern discriminant method based on Rough Set theory
.
Journal of Hydrology: Regional Studies
5
(
101285
),
1
14
(
2023
)
[4]
Boran
,
S.
,
Yoney
,
K. E.
,
Kamil
,
D.
, et al.
:
Comparative evaluation and comprehensive analysis of machine learning models for regression problems
.
Data Intelligence
4
(
3
),
620
652
(
2022
)
[5]
Altarabsheh
,
A.
,
Kandil
,
A.
, et al.
:
New multi-objective optimization approach to rehabilitate and maintain sewer networks based on whole life-cycle behavior
.
Journal of Computing in Civil Engineering
32
(
1
),
1
20
(
2018
)
[6]
Hernández
,
N.
,
Caradot
,
N.
,
Sonnenberg
,
H.
, et al.
:
Optimizing SVM models as predicting tools for sewer pipes conditions in the two main cities in Colombia for different sewer asset management purposes
.
Structure and Infrastructure Engineering
17
(
2
),
156
169
(
2021
)
[7]
Wang
,
Y. J.
,
Su
,
F.
,
Guo
,
Y.
, et al.
:
Predicting the microbiologically induced concrete corrosion in sewer based on XGBoost algorithm
.
Case Studies in Construction Materials
17
(
01649
),
1
17
(
2022
)
[8]
Yu
,
A. L.
:
Forecasting and decision optimization theory and methods based on artificial intelligence
.
Journal of Management Science
35
(
1
),
60
66
(
2022
)
[9]
Ibrahim
,
B.
,
Hani
,
A.
,
Khalid
,
K.
, et al.
:
Condition prediction for chemical grouting rehabilitation of sewer networks
.
Journal of Performance of Constructed Facilities
30
(
04016042
),
1
11
(
2016
)
[10]
Bakry
,
I.
,
Alzraiee
,
H.
,
Masry
,
M. E.
, et al.
:
Condition prediction for cured-in-place pipe rehabilitation of sewer mains
.
Journal of Performance of Constructed Facilities
30
(
04016016
),
1
12
(
2016
)
[11]
Cai
,
X. T.
,
Shirkhani
,
H.
, et al.
:
Sensitivity-based adaptive procedure (SAP) for optimal rehabilitation of sewer systems
.
Urban Water Journal
19
(
9
),
889
899
(
2022
)
[12]
Ulrich
,
A.
,
Ngamalieu-Nengoue
,
F.
,
Javier
,
M.
, et al.
:
Multi-objective optimization for urban drainage or sewer networks rehabilitation through pipes substitution and storage tanks installation
.
Water
11
(
5
),
935
949
(
2019
)
[13]
Debères
,
P.
,
Ahmadi
,
M.
, et al.
:
Deploying a sewer asset management strategy using the indigo decision support system
.
Optics Express
16
(
9
),
5997
6007
(
2013
)
[14]
Ramos
,
S.
,
Cristobal
,
M.
,
Jesus
,
A.
, et al.
:
A decision support system to design water supply and sewer pipes replacement intervention programs
.
Reliability Engineering & System Safety
216
(
107967
),
1
16
(
2021
)
[15]
Chen
,
S. B.
,
Yang
,
Y. X.
,
Wang
,
H.
, et al.
:
Typical defect types and cause mechanism of sewer pipelines in southern coastal city
.
Water & Wastewater Engineering
48
(
1
),
464
470
(
2022
)
[16]
Liu
,
W.
, et al.
:
Risk assessment on the drainage pipe network based on the AHP-entropy weight method
.
Journal of Safety and Environment
21
(
3
),
949
956
(
2021
)
[17]
Wang
,
J. L.
,
Xiong
,
Y. H.
,
Zhang
,
X. G.
, et al.
:
Evaluation of the state and operational effectiveness of urban drainage pipe network based on AHP-fuzzy comprehensive evaluation method: taking Huai'an District of Huai'an City as an example
.
Journal of Environmental Engineering Technology
12
(
4
),
1162
1169
(
2022
)
[18]
Xie
,
B.
,
Xiang
,
T.
,
Liao
,
X. F.
, et al.
:
Achieving privacy-preserving online diagnosis with outsourced SVM in internet of medical things environment
.
IEEE Transactions on Dependable and Secure Computing
19
(
6
),
4113
4126
(
2022
)
[19]
Wu
,
F.
,
Li
,
Z. Q.
, et al.
:
Research on application of pipeline repair technology in urban water environment management
.
Water & Waste water Engineering
48
(
1
),
471
475
(
2022
)
[20]
Mpimis
,
T.
,
Kapsis
,
T. T.
,
Panagopoulos
,
A. D.
, et al.
:
Cooperative D-GNSS aided with multi attribute decision making module: a rigorous comparative analysis
.
Future Internet
14
(
7
),
195
195
(
2022
)
[21]
Vapnik
,
V. N.
:
The nature of statistical learning theory
.
Springer Verlag
,
New York
(
1995
)
[22]
Fan
,
H. W.
,
Xue
,
C. Y.
,
Ma
,
J. T.
, et al.
:
A novel intelligent diagnosis method of rolling bearing and rotor composite faults based on vibration signal-to-image mapping and CNN-SVM
.
Measurement Science and Technology
34
(
4
),
44008
44022
(
2023
)
[23]
Yu
,
R.
,
Kong
,
X. H.
, et al.
:
Optimizing the diagnostic algorithm for pulmonary embolism in acute COPD exacerbation using fuzzy rough sets and support vector machine
.
COPD: Journal of Chronic Obstructive Pulmonary Disease
20
(
1
),
1
8
(
2023
)
[24]
Pawlak
,
Z.
:
Rough sets
.
International Journal of Computer and Information Sciences
11
(
2
),
341
356
(
1982
)
[25]
Li
,
Y. J.
,
Quan
,
J. S.
,
Tan
,
Y. Y.
, et al.
:
Attribute reduction for high-dimensional data based on bi-view of similarity and difference
.
Journal of Computer Applications
12
(
5
),
1
16
(
2022
)
[26]
Zhang
,
X. Y.
, et al.
:
A novel rough set method based on adjustable-perspective dominance relations in intuitionistic fuzzy ordered decision tables
.
International Journal of Approximate Reasoning
154
(
2
),
218
241
(
2023
)
[27]
Lu
,
J. R.
,
Chen
,
L.
,
Meng
,
K. M.
, et al.
:
Identifying user profile by incorporating self-attention mechanism based on CSDN data set
.
Data Intelligence
1
(
2
),
160
175
(
2019
)
[28]
Zhang
,
Y. X.
,
Zhang
,
W. D.
,
Wang
,
L. K.
, et al.
:
Study on stress state of loaded concrete based on PSO-SVM
.
Word Transportation Convention
2022
(
WCT 2022
),
334
339
(
2022
)
[29]
Zhang
,
Y. Y.
,
Chen
,
Y.
,
Yu
,
S. K.
, et al.
:
Bi-GRU relation extraction model based on keywords attention
.
Data Intelligence
4
(
3
),
552
572
(
2022
)
[30]
Muschelli
,
J.
:
ROC and AUC with a binary predictor: a potentially misleading metric
.
Journal of Classification
37
(
3
),
696
708
(
2020
)
[31]
Li
,
B.
,
Gatsonis
,
C.
,
Dahabreh
,
I. J.
, et al.
:
Estimating the area under the ROC curve when transporting a prediction model to a target population
.
Biometrics
10
(
5
),
1
12
(
2022
)
[32]
Huang
,
S. X.
:
Reading the moody chart with a linear interpolation method
.
Scientific Reports
12
(
1
),
6587
6599
(
2022
)
[33]
Wu
,
S. Z.
,
Wang
,
X. W.
,
Wang
,
Z. N.
, et al.
:
Prediction model of bank lending risk using rough set and support vector machine
.
Journal of Chengdu University of Technology (Science & Technology Edition)
49
(
2
),
249
256
(
2022
)
[34]
Ding
,
X.
,
Zhao
,
X. D.
,
Wu
,
X. J.
, et al.
:
Landslide susceptibility assessment model based on muli-class SVM with RBF kernel
.
China Safety Science Journal
32
(
3
),
194
200
(
2022
)
[35]
Tian
,
X. C.
, et al.
:
Cost-Sensitive Laplacian logistic regression for ship detention prediction
.
Mathematics
11
(
1
),
119
134
(
2022
)
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.