## ABSTRACT

Urban drainage pipe network is the backbone of urban drainage, flood control and water pollution prevention, and is also an essential symbol to measure the level of urban modernization. A large number of underground drainage pipe networks in aged urban areas have been laid for a long time and have reached or practically reached the service age. The repair of drainage pipe networks has attracted extensive attention from all walks of life. Since the Ministry of ecological environment and the national development and Reform Commission jointly issued the action plan for the Yangtze River Protection and restoration in 2019, various provinces in the Yangtze River Basin, such as Anhui, Jiangxi and Hunan, have extensively carried out PPP projects for urban pipeline restoration, in order to improve the quality and efficiency of sewage treatment. Based on the management practice of urban pipe network restoration project in Wuhu City, Anhui Province, this paper analyzes the problems of lengthy construction period and repeated operation caused by the mismatch between the design schedule of the restoration scheme and the construction schedule of the pipe network restoration in the existing project management mode, and proposes a model of urban drainage pipe network restoration scheme selection based on the improved support vector machine. The validity and feasibility of the model are analyzed and verified by collecting the data in the project practice. The research results show that the model has a favorable effect on the selection of urban drainage pipeline restoration schemes, and its accuracy can reach 90%. The research results can provide method guidance and technical support for the rapid decision-making of urban drainage pipeline restoration projects.

## 1. INTRODUCTION

With the rapid urbanization, the short board of urban drainage pipe network construction is showing up daily. Various structural and functional defects of the drainage pipe network caused by aging easily cause problems such as urban waterlogging, sewage overflow, and ground subsidence [1]. Therefore, completing the urban drainage pipe network detection and repair is critical for realizing city sewage quality and efficiency, which helps promote the development of high-quality urban governance. However, the urban drainage pipe network has long miles, complicated structure, uncertainty factors, and extensive range, which significantly influences residents’ living environment and urban traffic [2]. Thus, how to determine the drainage pipe network's status and performance, design reasonable repair scheme rapidly, shorten the whole pipeline repair project period, and reduce the project construction according to its testing results can greatly affect the social environment and sustainable development.

In recent years, the introduction of pipe network performance evaluation technology has made the best solution for pipe repair, which has become a research hot spot among domestic and foreign scholars and experts. The existing research mainly uses qualitative and quantitative research methods to establish the prediction model of pipe network performance indicators and formulate maintenance plans. However, a difficult problem remains, which is the low efficiency of the decision making. Using machine learning-related technologies in mining case history, the literature that studies the quick decisions for fixing urban drainage pipe networks is scant.

Thus, this research put forward RS-SVM machine learning approach driven by case data for selecting urban drainage network restoration scheme. The main contribution of this study is threefold. First, we combine the attribute reduction based on RS technology [3] and the SVM technology [4] to give full play to their technological advantages. The minimalist data set of the excellent classification characteristics is used as the input of the SVM. Second, we propose an RS-SVM model for selecting an urban drainage pipe network repair scheme. The basic idea is to collect history data set from urban pipeline repairing project management practice for a pipeline, use RS theory to reduce the sample's attributes, use the indirect method combining two classifiers to construct a multi-level SVM scheme selection model, and then use the built model for scheme selection of the test sample to solve the matching analysis. Finally, we select Wuhu's drainage pipeline repair engineering case data for big data analysis in Anhui Province. The effectiveness of the proposed model and method is verified. This study provides decision support for the quick selection of drainage pipeline repair schemes and has a certain application value.

## 2. RELATED WORKS

As for the technology on predicting the state of pipe network and developing repair strategies, Altarabsheh A et al. [5] used the Markov model to predict pipeline networks in the future and choose the most appropriate operational plan by GA, according to the whole life cycle of a sewage pipe network, considering the construction cost, operation cost, and expected benefits. Hernández N et al. [6] used the differential evolution method as an optimization tool for hyperparameter combination and combined it with the SVM model for two different management objectives (network and pipe levels). This model was applied to Colombia's main cities of Bogotá and Medellin, resulting in a less than 6% deviation in the prediction of structural conditions in both cities at a network level. Wang Y J et al. [7] proposed an XGBoost-based MICC model with the benefits of hyperparameters autooptimization. Yu A L et al. [8] put forward a method to carry out the five directions of research for the drainage pipeline repair scheme. Intelligent decision provides the basis. To predict the future performance of trenchless rehabilitations, Ibrahim B et al. [9] presented condition prediction models for the chemical grouting rehabilitation of pipelines and manholes in the city of Laval, Quebec, Canada. Bakry I et al. [10] presented condition prediction models for CIPP rehabilitation of sewer mains, and the models can predict the structural and operational conditions of CIPP rehabilitation on the basis of basic input, such as pipe material, and rehabilitation type and date.

As for the technology on development of decision support tools for drainage network repair, Cai X T et al. [11] proposed a sensitivity-based adaptive procedure (SAP), which can be integrated with optimization algorithms. SAP was integrated with non-dominated sorting genetic algorithm II (NSGA-II) and multiple objective particle swarm optimization (MOPSO) methods. Ulrich A et al. [12] studied a novel solution combining both approaches (pipes and tanks) and proposed a decision support system based on the NSGA-II for the rehabilitation of urban drainage networks through the substitution of pipes and the installation of storage tanks. Debères P et al. [13] used multiple criteria to locate the repair section and made a repair plan according to the pipeline inspection report and the economic, social, and environmental indicators. Ramos-Salgado et al. [14] developed a decision support system (DSS) to help water utilities design intervention programs for hydraulic infrastructures. Chen S B et al. [15] summarized the characteristics, causes, and evolution mechanism of typical defect types of coastal urban drainage network, which can provide technical guidance for the evaluation and repair of drainage network in this area or similar cities. Based on the investigation of the current situation of the existing underground drainage network in a certain area of Chongqing, Liu W et al. [16] used the AHP and entropy weight method to study the urban drainage pipe network risk rating and provide decision support for pipeline repair plan design. In view of the urban drainage pipe network deterioration, Wang J L et al. [17] used AHP and fuzzy comprehensive evaluation methods to research the urban drainage pipe network status and operational efficiency to provide decision support for drainage pipe network maintenance and repair plan.

This study is inspired by Xie B et al. [18] but different from the previous works that focused on designating remediation solutions based on current drainage inspection results but does not sufficiently explore the value of historical cases to solve this current problem. The choice of an urban drainage pipeline repair plan belongs to the category of multiple attribute decision making [19]. For multiple attribute decision-making problems, policymakers tend to be objective in formulating the optimal alternatives [20]. Support vector machine (SVM) is established on VC dimension theory and structural risk minimization principle based on machine learning methods [21]. It has better properties, especially in quickly setting the optimal alternative for current cases based on multi-attribute case history data [22]. As the SVM's input, the case history data sets frequently have redundant attributes [23]. Redundant attributes can increase the complexity of SVM training, extend the SVM training time, and reduce the SVM decision efficiency. A rough set (RS) deals with the uncertainty problem of mathematical tools [24]. RS attribute reduction algorithm can effectively handle attribute redundancy; it reduces redundant attributes that interfere with the SVM. Thus, on the basis of the combination of rough set and SVM technology, we propose an RS-SVM model for selecting an urban drainage network restoration scheme in this study.

## 3. METHODOLOGY

The detailed process of combining RS and SVM to select an urban drainage pipe network repair scheme consists of four components. The structure of which is shown in Figure 1, and the role of each component is as follows:

Collecting historical data sets and then standardizing them.

Using the RS attribute reduction algorithm to reduce the redundant attributes contained in the data set.

Training the model of multi-level SVM classification.

Using the trained model to match the new detection results.

The related parameters described in the model are as follows: {Z_{1}, Z_{2},…, Z_{i},…, Z_{m}} are the historical data sets of drainage pipeline repair projects. The target case denoted by Z_{0} is the current case reflecting the need for a repair scheme. {C_{1}^{p}, C_{2}^{p},…, C_{j}^{p},…, C_{n}^{p}} is the attribute set of drainage pipe network detection results. (a_{01}, a_{02},…, a_{0j},…, a_{0n}) is the attribute value vector of Z_{0}, where a_{0j} is the C_{j}^{p} value corresponding to Z_{0}. In this study, the drainage pipe network detection properties are divided into two types, numeric and symbols. For example, the length, diameter, various defects, and quantity of the pipes belong to numeric data. The material of the pipes belongs to symbol data. {S_{1},S_{2},…,S_{k},…,S_{g}} is the history solution set of drainage pipeline repair projects. According to the characteristics of the urban pipeline repair project, we presume that a solution may apply to multiple cases, with each case only a final implementation scheme.

### 3.1 Standardizing

To eliminate the influence of the dimension, we need to standardize the data set. Symbol variables are denoted by {F_{j}|i ∈ T}, where T equals {1, 2, 3, …, t}. Thus, we set the symbol sequence within the specified symbol in advance. The sequence number of the Sign F_{i} is seq(F_{i}), where F_{i}∈F, seq(F_{i})∈T. Standardizing the symbolic variables uses Equation (1):

We use the normal method to standardize some numeric variables. This method is suitable for indicators with a nonzero-range. The standardized variable values are between 0 and 1. Standardizing the symbolic variables uses Equation (2):

### 3.2 RS Attributes Reduction

Selecting as many attributes that have a greater impact on the scheme as possible can avoid missing crucial ones. If there are a large number of attributes, the complexity of the model inevitably increases, and the prediction performance is reduced, supposing all attributes are input into the SVM model. Some attributes have little influence on the selection of repair schemes for urban drainage networks, and the repair schemes are determined to some extent by a few key attributes that best reflect the characteristics of the categories [25]. Therefore, attribute reduction is conducive to improving the prediction accuracy and efficiency of the SVM model. In light of the influence of the definition of attribute importance and reduction rules, the result of attribute reduction is often not unique, and finding a minimum reduction has been proven to be an NP-hard problem [26]. The general method to solve this problem is to find the optimal or suboptimal reduction by heuristic search method [27]. Thus, we propose the attribute reduction algorithm based on attribute similarity. The basic flow of the algorithm is as follows:

Step 1: All conditional attributes are reduced using discernible attribute matrix.

In rough set theory, tuples S=(U, A, V, f) are defined as information systems, where U is a discussion domain of A, A=C∪D is an attribute set, C represents conditional attributes, and D represents target attributes. In the information system S, the differentiation determined by RA is: A=U/RA={K_{i}: i≤l}, and f_{l} (K_{i}) is used to represent the value of attribute cl with respect to the object in K_{i}, called I(K_{i}, K_{j}) (Equation (3)). It is the discernable attribute matrix of K_{i} and K_{j}, and the main diagonal element of I is the empty set.

The identification equation is as follows:

All attributes of the information system can be reduced using Equation (4). For example, in the information system shown in Table 1, {c_{1}, c_{2}, c_{3}, c_{4}} is the conditional attribute and D is the decision attribute.

code . | c_{1}
. | c_{2}
. | c_{3}
. | D . | code . | c_{1}
. | c_{2}
. | c_{3}
. | D . |
---|---|---|---|---|---|---|---|---|---|

Z_{1} | 2 | 1 | 3 | 0 | Z_{5} | 1 | 1 | 2 | 4 |

Z_{2} | 3 | 2 | 1 | 1 | Z_{6} | 1 | 1 | 4 | 3 |

Z_{3} | 2 | 1 | 3 | 0 | Z_{7} | 1 | 2 | 3 | 2 |

Z_{4} | 1 | 1 | 4 | 3 | Z_{8} | 1 | 2 | 3 | 2 |

code . | c_{1}
. | c_{2}
. | c_{3}
. | D . | code . | c_{1}
. | c_{2}
. | c_{3}
. | D . |
---|---|---|---|---|---|---|---|---|---|

Z_{1} | 2 | 1 | 3 | 0 | Z_{5} | 1 | 1 | 2 | 4 |

Z_{2} | 3 | 2 | 1 | 1 | Z_{6} | 1 | 1 | 4 | 3 |

Z_{3} | 2 | 1 | 3 | 0 | Z_{7} | 1 | 2 | 3 | 2 |

Z_{4} | 1 | 1 | 4 | 3 | Z_{8} | 1 | 2 | 3 | 2 |

code . | c_{1}
. | c_{2}
. | c_{3}
. | D . |
---|---|---|---|---|

K_{1} = {Z_{1},Z_{3}} | 2 | 1 | 3 | 0 |

K_{2}={Z_{2}} | 3 | 2 | 1 | 1 |

K_{3} = {Z_{7},Z_{8}} | 1 | 2 | 3 | 2 |

K_{4} = {Z_{4},Z_{6}} | 1 | 1 | 4 | 3 |

K_{5} = {Z_{5}} | 1 | 1 | 2 | 4 |

code . | c_{1}
. | c_{2}
. | c_{3}
. | D . |
---|---|---|---|---|

K_{1} = {Z_{1},Z_{3}} | 2 | 1 | 3 | 0 |

K_{2}={Z_{2}} | 3 | 2 | 1 | 1 |

K_{3} = {Z_{7},Z_{8}} | 1 | 2 | 3 | 2 |

K_{4} = {Z_{4},Z_{6}} | 1 | 1 | 4 | 3 |

K_{5} = {Z_{5}} | 1 | 1 | 2 | 4 |

I
. | K_{1}
. | K_{2}
. | K_{3}
. | K_{4}
. | K_{5}
. |
---|---|---|---|---|---|

K_{1} | Ø | {c_{1},c_{2},c_{3}} | {c_{1},c_{2}} | {c_{1},c_{3}} | {c_{1},c_{3}} |

K_{2} | {c_{1},c_{2},c_{3}} | Ø | {c_{1},c_{3}} | {c_{1},c_{2},c_{3}} | {c_{1},c_{2},c_{3}} |

K_{3} | {c_{1},c_{2}} | {c_{1},c_{3}} | Ø | {c_{2},c_{3}} | {c_{2},c_{3}} |

K_{4} | {c_{1},c_{3}} | {c_{1},c_{2},c_{3}} | {c_{2},c_{3}} | Ø | {c_{3}} |

K_{5} | {c_{1},c_{3}} | {c_{1},c_{2},c_{3}} | {c_{2},c_{3}} | {c_{3}} | Ø |

I
. | K_{1}
. | K_{2}
. | K_{3}
. | K_{4}
. | K_{5}
. |
---|---|---|---|---|---|

K_{1} | Ø | {c_{1},c_{2},c_{3}} | {c_{1},c_{2}} | {c_{1},c_{3}} | {c_{1},c_{3}} |

K_{2} | {c_{1},c_{2},c_{3}} | Ø | {c_{1},c_{3}} | {c_{1},c_{2},c_{3}} | {c_{1},c_{2},c_{3}} |

K_{3} | {c_{1},c_{2}} | {c_{1},c_{3}} | Ø | {c_{2},c_{3}} | {c_{2},c_{3}} |

K_{4} | {c_{1},c_{3}} | {c_{1},c_{2},c_{3}} | {c_{2},c_{3}} | Ø | {c_{3}} |

K_{5} | {c_{1},c_{3}} | {c_{1},c_{2},c_{3}} | {c_{2},c_{3}} | {c_{3}} | Ø |

In Table 3, according to the discernable attribute matrix **I** and Equation (4), all the reductions of the information system are obtained as G_{1}={c_{1}, c_{3}} and G_{2}={c_{2}, c_{3}}.

Step 2: The core of the information system should be found.

Equation (5) is used to find the core of the information system, where {G_{i}: i≤l} represents all reductions of the information system.

According to the result of step 1 and Equation (5), the core of the information system can be obtained as core()=G_{1}∩G_{2}={c_{1}, c_{3}}∩{c_{2}, c_{3}}={c_{3}}.

Step 3: The similarity of relatively necessary attributes is calculated with respect to decision attribute D.

If the core is not empty, any reduction contains the core. So the properties in the core are absolutely necessary properties. Equation (6) is used to represent the set of relatively necessary attributes that appear in some reductions.

According to the result of Step 2 and Equation (6), the relatively necessary attributes of the information system can be obtained: rna()={c_{1}, c_{3}}∪{c_{2}, c_{3}}-{c_{1}, c_{3}}∩{c_{2}, c_{3}}={c_{1}, c_{2}}.

Step 4: The similarity between each relatively necessary attribute and decision attribute is gradually added to the core in order from high to low to form set R until R satisfies the indistinguishable relation: IND (R) = IND(C). Therefore, R is the relative minimum.

The similarity between conditional attribute {c} and decision attribute D is calculated by Equation (7).

According to Table 1, S(c_{1},D) and S(c_{2},D) are calculated by the following equation:

By virtue of S(c_{1}, D)>S(c_{2}, D), adding the relatively necessary attribute c_{1} to core() can form the set R_{1}={c_{1}, c_{3}}. Since IND(R_{1})=IND(C)=14, the indistinguishable relation is satisfied. Thus, {c_{1}, c_{3}} is the relative minimum for this example.

### 3.3 Selecting Kernel Function

The prerequisite for SVM classification is that the sample space is linearly separable. However, the complexity of data spatial distribution increases along with the increase of the sample space dimension. For example, in Figure 2, points are linearly indivisible in a two-dimensional plane. In Figure 3, a kernel function is adopted to transform the sample dimensionally. The requirement of linear divisibility is satisfied after the sample set is mapped to a higher-dimensional space. Different kernel functions can be used to construct and realize different types of nonlinear decision surface learning machines in the input space, thus generating different support vector algorithms [28]. We select several kernel functions commonly used in classification problems. Functions (8)-(11) are expressed as follows:

### 3.4 Training Multi-level SVM Classification Model

#### 3.4.1 Binary Classification Algorithm Based on Historical Cases

Before building a multi-level SVM classification model, a binary classification model should be built. Let the scheme set of the sample be {E, F}. In Figure 4, the binary classification problem is to build a binary classifier that can distinguish schemes E and F through historical cases. The theoretical derivation of the binary classifier is as follows:

We set some sample points in the sample space {x_{i}|i=1, 2, 3, …, n}, where **x _{i}** is the vector corresponding to each sample point. y

_{i}is the class value corresponding to each sample point, where y

_{i}∈{-1, 1}. The equations of positive hyperplane H

_{1}, negative hyperplane H

_{2}and decision hyperplane H

_{0}are as follows:

The constrained optimization problem for the maximum value L is written in the following form:

To facilitate calculation, Equation (13) is rewritten as

The dual problem of Equation (14) is as follows:

Considering the influence of **w** and b on Lagrange function (15), the following function is constructed.

The argument of function f(**w**,b) is unconstrained and the function is a convex function about the argument (w_{1},w_{2},w_{3},…,w_{s},b), so it has a unique minimum point. Its gradient expression is as follows:

Let ▿*f*(** w**,b) =

**; Equation (18) can be obtained according to Equation (17):**

*0*After substituting Equation (18) into the function and introducing some kernel function, Equation (15) can be written as

After solving for λ = (λ_{1},λ_{2},λ_{3},…,λ_{n})^{T}, we can solve for **w*** in terms of $wT=\u2211i=1n\lambda iyixi$, We just use the support vector ** x^{*}_{i}** in the computation of

**w**

^{*}. Positive hyperplane H

_{1}:

**w**+b

^{*}x^{*}=1. Negative hyperplane H

_{2}:

**w**+b

^{*}x^{*}=-1. Decision hyperplane H

_{0}:

**w**+b

^{*}x^{*}=0. After the three hyperplane equations are determined, the binary classifier is constructed.

#### 3.4.2 Multi-Classification Method Based on Combinatorial Thinking

The repair schemes of urban drainage pipe networks are diverse, so a single binary classifier cannot complete the classification of all schemes; combining multiple binary classifiers is necessary. The common combination methods of binary classification include indirect and direct methods [29]. The indirect method is to construct a series of binary classifiers in a certain way and combine them to achieve multi-class classification. The other combines the parameter solutions of multiple classification surfaces into an optimization problem and realizes the classification of multiple classes by solving the optimization problem. Although the direct method looks simple, the variables in the optimized problem-solving process are significantly more than the indirect method, and the training speed and classification accuracy are not as good as the first method. This problem is more prominent when the training sample size is large. Therefore, in Figure 5, the indirect method and combinatorial thinking are adopted in this study to build a binary tree multi-level classifier. Let the scheme set of the sample be {scheme 0, scheme 1, scheme 2…, scheme g-1}. The number of schemes is g. In Figure 6, the multi-level SVM classification model aims to differentiate all schemes by combining multiple binary classifiers step by step.

## 4. EXPERIMENTS

We take the management practice of urban pipe network restoration project in Wuhu City, Anhui Province as the background. Basing on sklearn. SVM and sklearn.metrics packages, we draw the conclusion by comparing the RS-SVM machine learning method proposed in this paper with other algorithms (SVM without attribute reduction, logistic algorithm without attribute reduction, logistic algorithm with attribute reduction). The data set selected in this study contains 1500 samples, of which 1000 samples are randomly selected as the training set and others as the test set. All experiments are conducted on a personal desktop computer with 2 GHz Intel(R) Xeon (R) E5-2620 CPU, 8 GB RAM and Python 3.8.

## 4.1 Data Set

Referring to “Technical Specification for Inspection and Evaluation of Urban Drainage Pipes (CJ181-2012)”, 68 conditional attributes {c_{1}^{p}, c_{2}^{p}, …, c_{68}^{p}} are shown in Table 4.

Attribute . | Code . | Attribute . | Code . |
---|---|---|---|

Pipe length | c_{1}^{p} | Transformation I-IV | c_{9}^{p}-c_{12}^{p} |

Buried depth | c_{2}^{p} | Corrode I-IV | c_{13}^{p}-c_{16}^{p} |

Pipe diameter | c_{3}^{p} | Wrong mouth I-IV | c_{17}^{p}-c_{20}^{p} |

Material | c_{4}^{p} | Ups and downs I-IV | c_{21}^{p}-c_{24}^{p} |

Rupture I-IV | c_{5}p-C_{8}p | Disconnect I-IV | c_{25}^{p}-c_{28}^{p} |

Interface material shedding I-IV | c_{29}^{p}-c_{32}^{p} | Scaling I-IV | c_{49}^{p}-c_{52}^{p} |

Branch pipe dark connection I-IV | c_{33}^{p}-c_{36}^{p} | Obstacle I-IV | c_{53}^{p}-c_{56}^{p} |

Foreign body penetration I-IV | c_{37}^{p}-c_{40}^{p} | Residual wall and dam root I-IV | c_{57}^{p}-c_{60}^{p} |

Leakage I-IV | c_{41}^{p}-c_{44}^{p} | Root I-IV | c61^{p}-c64^{p} |

Deposition I-IV | c_{45}^{p}-c_{48}^{p} | Scum I-IV | c65^{p}-c68^{p} |

Attribute . | Code . | Attribute . | Code . |
---|---|---|---|

Pipe length | c_{1}^{p} | Transformation I-IV | c_{9}^{p}-c_{12}^{p} |

Buried depth | c_{2}^{p} | Corrode I-IV | c_{13}^{p}-c_{16}^{p} |

Pipe diameter | c_{3}^{p} | Wrong mouth I-IV | c_{17}^{p}-c_{20}^{p} |

Material | c_{4}^{p} | Ups and downs I-IV | c_{21}^{p}-c_{24}^{p} |

Rupture I-IV | c_{5}p-C_{8}p | Disconnect I-IV | c_{25}^{p}-c_{28}^{p} |

Interface material shedding I-IV | c_{29}^{p}-c_{32}^{p} | Scaling I-IV | c_{49}^{p}-c_{52}^{p} |

Branch pipe dark connection I-IV | c_{33}^{p}-c_{36}^{p} | Obstacle I-IV | c_{53}^{p}-c_{56}^{p} |

Foreign body penetration I-IV | c_{37}^{p}-c_{40}^{p} | Residual wall and dam root I-IV | c_{57}^{p}-c_{60}^{p} |

Leakage I-IV | c_{41}^{p}-c_{44}^{p} | Root I-IV | c61^{p}-c64^{p} |

Deposition I-IV | c_{45}^{p}-c_{48}^{p} | Scum I-IV | c65^{p}-c68^{p} |

Scheme . | Code . | Value . |
---|---|---|

Not repair | Scheme 0 | 0 |

Excavation and reconstruction | Scheme 1 | 1 |

Ultraviolet light polymerization | Scheme 2 | 2 |

Spot in situ curing | Scheme 3 | 3 |

Scheme . | Code . | Value . |
---|---|---|

Not repair | Scheme 0 | 0 |

Excavation and reconstruction | Scheme 1 | 1 |

Ultraviolet light polymerization | Scheme 2 | 2 |

Spot in situ curing | Scheme 3 | 3 |

## 4.2 Data Standardization

To compress the data distribution, eliminate the impact of dimension, and improve the efficiency of attribute reduction and classification of SVM, we standardize the attribute values of pipeline material (c_{4}^{p}) according to Equation (1) and standardize those of other attributes according to Equation (2). After standardizing, the distribution of sample attribute values corresponding to the four repair schemes is shown in Figure 7. In a repair scheme, the larger the sample size corresponding to the attribute value under an attribute, the darker the color of the area. If the attribute value under an attribute corresponds to a smaller sample size, the color of the area is lighter.

## 4.3 Attributes Reduction

Attribute . | Code . | Attribute . | Code . |
---|---|---|---|

Deformation III | c_{11}^{p} | Mismouth IV | c_{20}^{p} |

Metamorphosis IV | c_{12}^{p} | Disjunction IV | c_{28}^{p} |

Etch IV | c_{16}^{p} | Interface material off IV | c_{32}^{p} |

Misopening III | c_{19}^{p} | The branch pipe is secretly connected to IV | c_{36}^{p} |

Foreign body is penetrated into IV | c_{40}^{p} | Residual wall and dam root I-IV | c_{57}^{p}-c_{60}^{p} |

Leakage IV | c_{44}^{p} | Root III | c_{63}^{p} |

Scaling III | c_{51}^{p} | Root IV | c_{64}^{p} |

Scaling IV | c_{52}^{p} | Scum I-IV | c_{65}^{p}-c_{68}^{p} |

Attribute . | Code . | Attribute . | Code . |
---|---|---|---|

Deformation III | c_{11}^{p} | Mismouth IV | c_{20}^{p} |

Metamorphosis IV | c_{12}^{p} | Disjunction IV | c_{28}^{p} |

Etch IV | c_{16}^{p} | Interface material off IV | c_{32}^{p} |

Misopening III | c_{19}^{p} | The branch pipe is secretly connected to IV | c_{36}^{p} |

Foreign body is penetrated into IV | c_{40}^{p} | Residual wall and dam root I-IV | c_{57}^{p}-c_{60}^{p} |

Leakage IV | c_{44}^{p} | Root III | c_{63}^{p} |

Scaling III | c_{51}^{p} | Root IV | c_{64}^{p} |

Scaling IV | c_{52}^{p} | Scum I-IV | c_{65}^{p}-c_{68}^{p} |

According to Equation (3)-(5), the core of the sample information system is obtained: core()={c_{2}^{p}, c_{3}^{p}, c_{5}^{p}, c_{6}^{p}, c7^{p},c_{8}^{p}, c9^{p},c_{13}^{p},c_{14}^{p},c_{15}^{p},c17^{p},c_{18}^{p},c_{21}^{p},c_{22}^{p},c_{24}^{p},c_{25}^{p},c_{26}^{p},c_{27}^{p},c_{29}^{p},c_{30}^{p},c_{33}^{p},c_{34}^{p},c_{35}^{p},c_{37}^{p},c_{38}^{p},c_{39}^{p},c_{41}^{p},c_{42}^{p},c_{43}^{p},c_{45}^{p},c_{46}^{p},c_{47}^{p},c_{48}^{p},c_{49}^{p},c_{50}^{p},c_{53}^{p},c_{54}^{p},c_{55}^{p},c_{56}^{p},c_{61}^{p},c_{62}^{p}}. According to Equation (6), the relatively necessary attributes are obtained: rna()={c_{1}^{p},c_{4}^{p}, c_{10}^{p},c_{2}3^{p},c_{31}^{p}}. All reductions of the sample information system are G_{1}={core(),c_{1}^{p}}, G_{2}={core(),c_{4}^{p}}, G_{3}={core(),c_{10}^{p}}, G_{4}={core(),c_{23}^{p}}, and G_{5}={core(),c_{31}^{p}}. According to Formula (7), S(c_{4}^{p},D)>S(c_{10}^{p},D)>S(c_{31}^{p},D)>S(c_{2}3^{p},D)>S(c_{1}^{p},D); therefore, the relatively necessary attribute c_{4}^{p} is first added to core() to form the set R={c_{4}^{p},core()}. As IND(R)=IND(C) satisfies the indiscriminability relation, R={c_{4}^{p}, core()} is the relative minimalism of this example, and {c_{1}^{p},c_{10}^{p},c_{23}^{p},c_{31}^{p}} is further reduced as a redundant attribute. Afterward, the 42 attributes in R are sorted from smallest to largest according to the subscript. The distribution of sample attribute values corresponding to the four repair schemes is shown in Figure 8. The attribute numbers from 42 to 68 in Figure 8 is the redundant attribute set reduced, and there is no peak value. By comparing Figures 7 and 8, we see that attribute reduction can remove redundancy and compress and reduce the dimension of the distribution of data sets.

## 4.4 Kernel Parameter Optimization

In recent years, many methods have been proposed by domestic and foreign experts and scholars to evaluate the prediction effect of SVM, most of which are the Macro-Average-ROC [30]. The curves can be used to qualitatively evaluate the prediction effect of different SVM classification models [31]. The area under the curve (Macro-Average-AUC) can quantitatively evaluate the prediction accuracy of different SVM classification models. To facilitate the narration, we use MAR to represent Macro-Average-ROC and MAA to represent Macro-Average-AUC.

Suppose the number of test samples is n and the number of categories is g. After the training is completed, the probability of each test sample under each category is calculated and a matrix **P** with n rows and g columns is obtained. Each line of **P** represents the probability value of a test sample under each category. Accordingly, the labels of each test sample are converted to a binary like form. Each position is used to mark whether it belongs to the corresponding category. Thus, a label matrix **L** with n rows and g columns can be obtained.

The basic idea of macro averaging is as follows: Under each category, you can get the probability that n test samples are of that category (the columns in the matrix **P).** Therefore, according to each corresponding column in the probability matrix **P** and label matrix **L,** false positive rate (FPR) and true positive rate (TPR) under each threshold can be calculated, thus drawing a ROC curve. Thus, a total of g ROC curves can be plotted. FPR_all is obtained by merging, de-duplicating, and sorting all FPR of these ROC curves. The FPR and TPR of the current class determine the ROC curve of the current class. The linear interpolation method [32] is used to interpolate the horizontal coordinate that does not exist relative to FPR_all in FPR, and the TPR’ after interpolation is obtained. TPR_mean is obtained by arithmetic averaging TPR’ after class g interpolation. Finally, MAR curves are drawn according to FPR_all and TPR_mean. The equations of TPR_ mean and MAR are as follows:

Among the four SVM kernel functions described in Section 3.3, we are uncertain which one is most suitable for this data set. The parameter values of each kernel function are also uncertain. Thus, the training set containing 1000 samples is used to optimize the selection of the optimal kernel function and parameters, as shown in Figure 1. Specifically, in the parameter definition domain, we draw an image of the MAA of the model after attribute reduction as the parameter values changed (Figure 9) so as to determine the optimal parameters of each kernel function after attribute reduction and the maximum MAA. To compare and analyze the influence of attribute reduction on the SVM classification effect, we draw an image of the model's AUC changing with parameter values in the case of no attribute reduction (Figure 10). We then determine the optimal parameter and maximum MAA of each kernel function without attribute reduction.

### (1) Linear kernel parameter optimization

The penalty coefficient σ is an important parameter of linear kernel function. The greater the σ is, the greater the penalty for misclassification. Hence, the accuracy of the training set is high, but the generalization ability is weak. The smaller the σ value; the lower the penalty for misclassification, allowing fault tolerance, and treating them as noise points; the stronger the generalization ability [33]. For linear kernel function, before attribute reduction (Figure 10 (a)), if 0<σ<0.4, the value of MAA increases with the increase of σ. If 0.4≤σ≤6.4, MAA stabilizes around 0.62. If σ> 6.4, the data will be overfitted with the increase of σ, and the value of MAA first decreases and then stabilizes around 0.56. After attribute reduction (Figure 9 (a)), if 0<σ<0.3, the value of MAA increases with the increase of σ. If 0.3≤σ≤6.0, MAA stabilizes around 0.66. If σ>6.0, the data are overfitted with the increase of σ, and the value of MAA first decreases and then stabilizes around 0.58. Thus, before attribute reduction, the optimal parameter interval of linear kernel is [0.4,6.4], and then the value of MAA can reach 0.612. After attribute reduction, the optimal parameter interval of linear kernel is [0.3,6.0], and then the value of MAA can reach 0.656.

### (2) Poly kernel parameter optimization

High-order coefficients degree and coef0 are important parameters of the poly kernel function. The greater the degree, the higher the spatial dimension after mapping, the higher the complexity of calculating the polynomial [34]. In particular, if degree=1 and coef0=0, the poly kernel is equivalent to the linear kernel. In the case of coef0=1, we study the influence of degree on the classification accuracy of the poly kernel function. Before attribute reduction (Figure 10 (b)), if 1≤degree≤3, the value of MAA increases with the increase of degree. If 3≤degree≤5, the classification accuracy is stable at around 0.59. If 16≤degree≤26, the MAA value fluctuates slightly between 0.51 and 0.53. With the increase of degree, the computational complexity increases, and the value of MAA is stable around 0.51. After attribute reduction (Figure 9 (b)), if 1≤degree≤2, the value of MAA increases with the increase of degree. If 2≤degree≤4, the value of MAA is stable around 0.72. If 10≤degree≤14, MAA values show another peak, but this peak is slightly lower than the first peak of 0.72. With the increase of degree, the computational complexity increases, and the value of MAA is stable around 0.61. The optimal parameter interval of the poly kernel before attribute reduction is [3,5], and then the value of MAA can reach 0.583. After attribute reduction, the optimal parameter interval of the poly kernel is [2,4], and then the value of MAA can reach 0.716.

### (3) RBF kernel parameter optimization

Gamma, an important parameter of the RBF kernel function, controls the penalty threshold range and mainly defines the influence of a single sample on the entire classification hyperplane. If gamma is small, a single sample has little influence on the whole classification hyperplane, so selecting it as a support vector is hard. Conversely, a single sample has a greater impact on the whole classification hyperplane and is more likely to be selected as a support vector, or the whole model will have more support vectors [33]. In this study, the optimal gamma value is searched within [10^{-3},+∞), and the search ends when the value of MAA becomes stable with the increase of gamma. Before attribute reduction (Figure 10 (c)), if 0<gamma<3.3×10^{-3}, the value of MAA increases with the increase of gamma. If 3.3×10^{-3}≤gamma≤6.4×10^{-3}, MAA is stable around 0.72. If gamma≥2.6×10^{-2}, the value of MAA is stable around 0.67 with the increase of gamma. After attribute reduction (Figure 9 (c)), if 0<gamma<1.5x10^{-3}, the value of MAA increases with the increase of gamma. If 1.5x10^{-3}≤gamma≤2.0x10^{-3}, MAA is stable around 0.79. If 2.0×10^{-3}<gamma<3.2×10^{-3}, the value of MAA showed a decreasing trend. If gamma≥3.2×10^{-3}, the value of MAA is stable around 0.66. It can be seen that the optimal value interval of the RBF kernel parameter before attribute reduction is [3.3x10^{-3},6.4x10^{-3}], and then the value of MAA can reach 0.718. After attribute reduction, the optimal parameter interval of RBF kernel is [1.5x10^{-3},2.0x10^{-3}], and then the value of MAA can reach 0.793.

### (4) Sigmoid kernel parameter optimization

k (gamma) and c (coef0) are two important parameters of the sigmoid kernel function [34]. To comprehensively consider the influence of k and c on classification accuracy, this study takes (k,c) as the independent variable and the value of MAA as the dependent variable and calculates the value of MAA of sigmoid kernel function within the search range of 0<k≤0.1 and 0≤c≤10. Moreover, we draw a spatial scatter plot containing 100 points (Figure 9 (d) and Figure 10 (d)). Before attribute reduction (Figure 9 (d)), if 0.04≤k≤0.07 and 3≤c≤6, the MAA value of the sigmoid kernel function reaches 0.735. If (k, c) takes other values, the MAA values of sigmoid kernel functions are all below 0.735. After attribute reduction (Figure 9 (d)), if 0.03≤k≤0.05 and 1≤c≤5, the MAA value of the sigmoid kernel function reaches 0.825. If (k, c) takes other values, the MAA values of the sigmoid kernel function are all below 0.825. Thus, before attribute reduction, the optimal value interval of the Sigmoid-kernel parameter is k∈[0.04,0.07] and c∈[3,6], and then the value of MAA can reach 0.732. After attribute reduction, the optimal value interval of the sigmoid kernel parameter is k∈[0.03,0.05] and c∈[1,5], at which time the value of MAA reaches 0.825.

## 4.5 Comparison Models

The validation set containing 500 samples is used to test the prediction effect of our proposed RS-SVM classification model. To verify the effectiveness and feasibility of the proposed RS-SVM machine learning method in the urban drainage network repair scheme selection model, we also compare this algorithm with the Laplacian logistic regression algorithm [35]. The results of parameter optimization in Section 4.4 and the prediction results of logistic regression are shown in Table 7. According to Table 7, we draw the MAR curve of each kernel function with optimal parameters and the MAR curve of logistic regression, as shown in Figure 11. Figure 12 shows the average time of each kernel function and the average time of logistic regression change with the sample size.

SVM . | Laplacian Logistic Regression . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Kernel . | Optimal parameter . | MAA . | Average time consuming (second) . | MAA . | Average time consuming (second) . | |||||

before . | after . | before . | after . | before . | after . | before . | after . | before . | after . | |

Linear | [0.4,6.4] | [0.3,6.0] | 0.612 | 0.656 | 129.4 | 85.6 | ||||

Poly | [3,5] | [2,4] | 0.583 | 0.716 | 159.3 | 103.9 | ||||

RBF | [0.0033, 0.0064] | [0.0015, 0.0020] | 0.718 | 0.793 | 161.3 | 100.7 | 0.564 | 0.598 | 523.3 | 483.1 |

Sigmoid | 0.04≤k≤0.07; 3≤c≤6 | 0.03≤k≤0.05; 1≤c≤5 | 0.732 | 0.825 | 150.5 | 90.6 |

SVM . | Laplacian Logistic Regression . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Kernel . | Optimal parameter . | MAA . | Average time consuming (second) . | MAA . | Average time consuming (second) . | |||||

before . | after . | before . | after . | before . | after . | before . | after . | before . | after . | |

Linear | [0.4,6.4] | [0.3,6.0] | 0.612 | 0.656 | 129.4 | 85.6 | ||||

Poly | [3,5] | [2,4] | 0.583 | 0.716 | 159.3 | 103.9 | ||||

RBF | [0.0033, 0.0064] | [0.0015, 0.0020] | 0.718 | 0.793 | 161.3 | 100.7 | 0.564 | 0.598 | 523.3 | 483.1 |

Sigmoid | 0.04≤k≤0.07; 3≤c≤6 | 0.03≤k≤0.05; 1≤c≤5 | 0.732 | 0.825 | 150.5 | 90.6 |

From the perspective of the MAR curve, before attribute reduction (Figure 11 (a)), according to the value of MAA, four kernel functions are sorted in descending order: Sigmoid, RBF, Linear, Poly. After attribute reduction (Figure 11 (b)), four kernel functions are sorted from largest to smallest according to the value of MAA: Sigmoid, RBF, Poly, Linear. In terms of average time spent before attribute reduction (Figure 12 (a)), classification time spent by SVM and Laplacian logistic regression shows an increasing trend with the increase in sample size, and the growth rate of Laplacian logistic regression is significantly faster than SVM. After attribute reduction (Figure 12 (b)), the classification time of SVM and Laplacian logistic regression also show an increasing trend with the increase in sample size, and the growth rate of Laplacian logistic regression is also significantly faster than that of SVM. From the perspective of the MAR curve and average time, although the Sigmoid classification time is not the lowest, it has little difference from the classification time of other kernel functions because Sigmoid can guarantee a high value of MAA. Therefore, we can consider that the Sigmoid-kernel algorithm with attribute reduction is more suitable to select an urban drainage network repair scheme under the current data set than other algorithms.

Through comparative analysis, the detailed contributions of the rough set and SVM model are as follows:

The rough set attribute reduction algorithm can effectively reduce the sample attributes and improve the classification efficiency and accuracy of multilevel SVM algorithm. Additionally, comparative analysis reveals that the parameter optimization results of the four SVM kernel functions are different before and after attribute reduction. The classification time of the same classification model under different sample sizes is also different.

Compared with Laplacian logistic regression algorithm, SVM adopts a kernel function mechanism. When calculating the decision surface, only samples representing the support vector in the SVM algorithm participate in the calculation, which can guarantee higher accuracy and shorter running time overall and effectively reduce overfitting. At the same time, SVM can build a linear learning machine in a high-dimensional feature space, which avoids the ‘disaster of dimension’ to some extent.

## 5. CONCLUSION AND FUTURE WORK

In view of the difficulty of urban drainage network repair detection and the mismatch between repair scheme design and construction schedule, this study organically combines the rough set attribute reduction algorithm based on attribute similarity with a multi-level support vector machine to construct an RS-SVM model for selecting an urban drainage network repair scheme. We select the case data of the Wuhu network repair project for big data analysis. In general, the scheme selection model proposed in this study can effectively predict the urban drainage network repair scheme and provide a basis for the rapid selection of urban drainage network repair schemes. At the same time, the method can be extended to the fields of disease diagnosis, project risk assessment, fire identification and bank loan risk prediction; moreover, it has high practical application value. However, in collecting case data, we did not consider the situation of multi-scheme combination repair nor the situation of using different kernel function combinations to search for optimization in the classification algorithm. Therefore, future studies can be carried out from the following aspects:

According to the characteristics of the project, the urban pipeline repair scheme is further subdivided, and the pipeline sections repaired by combining multiple schemes are taken as samples into the training set to study a more accurate pipeline repair construction scheme.

The prediction efficiency and accuracy of RS-SVM machine learning can be further improved by combining multiple kernel functions for SVM multi-level classifiers.

The sample data of multiple pipeline repair projects are selected, and other prediction methods, such as random forest, are selected for comparative experiments, which not only enriches the experimental results but also improves the accuracy of pipeline repair schemes and provides auxiliary support for designers to make quick decisions.

## ACKNOWLEDGMENTS

This work is supported by the Funds for the Anhui Provincial Science and Technology lnnovation Strategy and Soft Science Research Project (Grant Number 202206f01050017), National Natural Science Foundation of China (Grant Number 72131006, 6201101347, 72071063), Fundamental Research Funds for the Central Universities (Grant Number JS2021ZSPY0037), Research Project of China Three Gorges Corporation (Grant Number 202103355), Yangtze Ecology and Environment Co., Ltd. (Grant Number HB/AH2021039), and Power China Huadong Engineering Corporation Limited (KY2019-ZD-03).

## AUTHOR CONTRIBUIONS

Li Jiang (E-mail: jiangli@hfut.edu.cn): has participated in the proposed model design and writing of the manuscript.

Zheng Geng (E-mail: gengzheng2023@163.com): has participated in the coding, the experiment and analysis, writing the manuscript.

DongXiao Gu (E-mail: dongxiaogu@yeah.net): has participated in the part of the experiment and analysis.

Shuai Guo (E-mail: guoshuai@hfut.edu.cn): has participated in the part of the experiment and analysis.

RongMin Huang (E-mail: huang_rongmin@ctg.com.cn): has participated in the writing and revision of the manuscript.

HaoKe Cheng (E-mail: cheng_haoke@ctg.com.cn): has participated in the writing and revision of the manuscript.

KaiXuan Zhu (E-mail: zkx9116@126.com): has participated in the writing and revision of the manuscript.