## Abstract

We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce, for example, monotonicity of the function over selected inputs. The aim is to find models which conform to expected behavior and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shape-constrained symbolic regression: (i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and (ii) a two-population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models.

## 1 Introduction and Motivation

Dynamical systems and processes in critical application areas such as control engineering require trustworthy, robust, and reliable models that strictly conform to behavioral expectations according to domain-specific knowledge, as well as safety and performance criteria. However, such models cannot always be derived from first principles and theoretical considerations alone. Thus, they must be determined empirically. From this perspective, models used in system identification can be categorized as:

*White-box*. Derived from first principles, explicitly include domain knowledge, can be valid for a wide range of inputs even beyond available observations, and allows for far-reaching predictions (extrapolation).*Gray-box*. Derived from data, with known internals, open for further inspection. Can be validated against domain knowledge. Only valid for inputs which are similar to observations that are used for model fitting.*Black-box*. Derived from data, with unknown internals (e.g., neural networks), difficult or impossible to inspect and/or validate.

In between the two extremes, there is a whole spectrum of gray-box models which combine aspects of both extremes (Ljung, 2010). The proponents of “explainable AI” propose to apply models on the white end of this spectrum either directly (Rudin, 2019) or as an approximation of black-box models in AI technologies.

Empirical, data-driven models are required in scenarios where first principles models are infeasible due to engineering or financial considerations. Ensuring that empirical models conform to expected system behavior, correctly reflect physical principles, and can extrapolate well on unseen data remains an open challenge in this area.

Symbolic regression (SR) is especially interesting in this context because SR models are closed-form expressions which are structurally similar to white-box models derived from first principles. The aim in SR is to identify the best function form and parameters for a given data set using common mathematical operators and functions which serve as building blocks for the function form (Koza, 1992). This is in contrast to other forms of regression analysis where the function form is prespecified and only the numerical coefficients are optimized through model fitting.

It should be noted, however, that SR is fundamentally a purely data-driven approach. It does not require prior information about the modelled process or system and leads to gray-box models which can be hard to understand. SR algorithms therefore favor shorter expressions and use simplification to facilitate detailed analysis of their models.

The most well-known solution methods for SR includes tree-based genetic programming (GP) variants (Koza, 1992), linear GP (Oltean and Grosan, 2003) including Pareto-GP (Smits and Kotanchek, 2005), Cartesian GP (Miller and Harding, 2008), grammatical evolution (GE) (O'Neill and Ryan, 2001), and gene expression programming (GEP) (Ferreira, 2001). More recent developments of SR include Geometric Semantic GP (Moraglio et al., 2012; Pawlak and Krawiec, 2018; Ruberto et al., 2020), Multiple-regression GP (Arnaldo et al., 2014), and a memetic evolutionary algorithm using a novel fractional representation for SR (Sun and Moscato, 2019). There are also several nonevolutionary algorithms for SR including fast function extraction (FFX) (McConaghy, 2011), SymTree (de França, 2018), and prioritized grammar enumeration (PGE) (Worm and Chiu, 2013). Recently several neural network architectures have been used for symbolic regression (Sahoo et al., 2018; Kim et al., 2019; Petersen et al., 2019; Udrescu and Tegmark, 2020).

This contribution introduces an extension of SR, which allows the inclusion of vague prior knowledge via constraints on the shape of SR model image and derivatives. We therefore call this approach shape-constrained SR as it is a specific form of shape-constrained regression. The main motivation is, that even in data-based modelling tasks, partial knowledge about the modelled system or process is often available, and can be helpful to produce better models which conform to expected behavior and might improve extrapolation. Such information could be used for:

data augmentation to decrease the effects of noisy measurements or for detecting implausible outliers,

fitting the model in input space regions where observations are scarce or unavailable,

distinguishing between correlations in observation data and causal dependencies,

increasing the efficiency of the statistical modelling technique.

### 1.1 Problem Statement

We focus on the integration of prior knowledge in SR solvers in order to find expressions compatible with the expected behavior of the system.

Our assumptions are that we have data in the form of measurements as well as side information about the general behavior of the modeled system or process that, however, is insufficient to formulate an explicit closed-form expression for the model. We additionally assume that the information can be expressed in the form of constraints on the function output as well as its partial derivatives.

The task is therefore to find a closed-form expression which fits the data, possibly with a small error, and conforms to our expectation of the general behavior of the system. In particular, the model must not violate shape constraints when interpolating and extrapolating within a bounded input space.

We investigate the hypotheses that (i) interval arithmetic can be used to approximate bounds for the image of SR models and their partial derivatives and therefore reject nonconforming models within SR solvers, and (ii) that the predictive accuracy of SR models can be improved by using side information in the form of shape constraints.

## 2 Related Work

The issue of knowledge integration into machine learning methods including genetic programming has been discussed before.

Closely related to this work is the work of Bladek and Krawiec (2019) who have discussed using formal constraints in combination with symbolic regression to include domain knowledge about the monotonicity, convexity, or symmetry of functions. Their approach, called Counterexample-Driven GP, utilizes satisfiability modulo theories (SMT) solvers to check whether candidate SR models satisfy the constraints and to extend the training set with counter examples. Bladek and Krawiec (2019) observe that it is very unlikely for GP to synthesize models which conform to constraints. The difference from this work is that we do not support symmetry constraints and instead of a SAT solver we use interval arithmetic to calculate bounds for the function image and its partial derivatives. We test our approach on a larger set of harder benchmark problems.

Versino et al. (2017) have specifically studied the generation of flow stress models for copper using GP with and without knowledge integration. They have investigated four different scenarios: (i) a canonical GP solver that returns a Pareto front of model quality and complexity, (ii) knowledge integration by means of transformed variables, samples weighting, introduction of artificial points, and introduction of seed solutions, (iii) search for the derivative of the expression and then reconstructing the original equation by integration, and (iv) search derivatives but with additional artificial data points.

This work emphasized the importance of integrating prior knowledge in the model fitting process. The authors observed that the canonical GP solver presented unpredictable behavior with respect to the physical constraints; thus the sampled data set alone was insufficient to guide the solver into a model with both low error and physical feasibility. In contrast to our approach which guarantees that solutions conform to prior knowledge, all techniques tried in Versino et al. (2017) are optimistic and might produce infeasible solutions. Additionally, adding artificial data points is subjective and does not scale to high-dimensional problems.

Schmidt and Lipson (2009) studied the influence of integrating expert knowledge into the evolutionary search process through a process called seeding. First, the authors generated approximation solutions by either solving a simpler problem or finding an approximate solution to a more complex problem. These solutions are then used during the seeding procedure by inserting approximations into the initial population, shuffled expressions, and building blocks. The authors found that seeding significantly improves the convergence and fitness average performance when compared to no seeding. Among the seeding procedures, the seeding of building blocks was the most successful allowing faster convergence even in more complex problems. Again, seeding is an optimistic approach and does not guarantee that the final SR solution conforms to prior knowledge.

Stewart and Ermon (2017) investigated how to use prior knowledge in artificial neural networks. Specifically, they incorporate physical knowledge into motion detection and tracking and causal relationships in object detection.

Even though the experiments were all small scaled, they showed promising results and opened up the possibility of training an artificial neural network by only describing the properties of the approximation function. This idea is similar to the idea of shape constraints.

Recently, an approach for knowledge integration for symbolic regression with a similar motivation to this work has been described in Li et al. (2019). Instead of shape constraints, the authors make use of semantic priors in the form of leading powers for symbolic expressions. They describe a Neural-Guided Monte Carlo Tree Search (MCTS) algorithm to search of SR models which conform to prior knowledge. The authors use a context-free grammar for generating symbolic expressions and use MCTS for generating solutions whereby a neural network predicts the next production rule given a sequence of already applied rules and the leading power constraints.

The algorithm was compared against variations of MCTS and a GP implementation using the DEAP framework (Fortin et al., 2012).

The results show a superior performance of neural-guided MCTS with a high success rate for easier instances but much lower rates on harder instances which were, however, still higher than the success rate of GP.

One aim of our work is improving extrapolation of SR solutions. Similarly to polynomial regression, SR models might produce extreme outputs values when extrapolating. Interestingly, extrapolation behavior of SR models has been largely ignored in the literature with only a few exceptions.

Castillo et al. (2013) investigated the difference in the response variation of different models generated by GP to three different data sets. They have shown that in some situations a GP algorithm can generate models with different behavior while presenting a similar $R2$ value. These results motivate the need of a post analysis of the validity of the model with respect to the system being studied and a verification of how well such models extrapolate.

The extrapolation capability of GP is also studied by Castillo et al. (2003). In this work, the authors evaluate and compare the best solution obtained by a GP algorithm to a linear model which uses the transformed variables found by the GP model.

The experiments suggest that the GP models have good interpolation capabilities but with a moderate extrapolation error. The proposed approach presented good interpolation and extrapolation capabilities, suggesting that at the very least, the GP models can guide the process of building a new transformed space of variables.

Another relevant result was obtained by Kurse et al. (2012) who used Eureqa (a commercial SR solver) to find analytical functions that correctly modelled complex neuromuscular systems. The experiments showed that the GP solver was capable of extrapolating the data set much better than the traditional polynomial regression, used to model such systems.

Stewart and Ermon (2017) trained a neural network by introducing prior knowledge through a penalty term of the loss function such that the generated model is consistent to a physics behavior. Different from our work, this generates a black-box model of the data. Finally, Zhu et al. (2019) incorporate the appropriate partial differential equations into the loss functions of a neural network and, by doing so, they can train their model on unlabeled data; this approach requires more information than usually available when modeling studied observations.

## 3 Methods

In this section, we describe in detail the methods proposed in this article to integrate prior knowledge into SR solvers using interval arithmetic and shape constraints. In summary, we extend two GP variants to support shape-constrained SR: a tree-based GP approach with optional memetic improvement of models parameters and the Interaction-Transformation Evolutionary Algorithm (ITEA) (de Franca and Aldeia, 2021). The extended algorithms are compared against the original versions on a set of benchmark problems. For the comparison, we calculate the number of constraint violations and the training as well as the test errors and qualitatively assess the extrapolation behavior.

### 3.1 Shape-Constrained Regression

Shape-constrained regression allows to fit a model whereby certain characteristics of the model can be constrained. It is a general concept, that encompasses a diverse set of methods including parametric and nonparametric, as well as univariate and multivariate methods. Examples include isotonic regression (Wright et al., 1980; Tibshirani et al., 2011), monotonic lattice regression (Gupta et al., 2016), nonparametric shape-restricted regression (Guntuboyina et al., 2018), non-negative splines (Papp and Alizadeh, 2014), and shape-constrained polynomial regression (Hall, 2018).

It is important to mention that shape constraints limit the function outputs and partial derivatives, but the optimal function which fits the observed data still needs to be identified by the algorithm. Shape-constrained regression is a general concept which is applicable to different forms of regression analysis. For specific models like multivariate linear regression or tree-based methods (e.g., XGBoost), it is easy to integrate shape constraints. For other models, for instance, polynomial regression, it is harder to incorporate shape constraints (Hall, 2018; Ahmadi and Majumdar, 2019).

Generally, one can distinguish optimistic and pessimistic approaches to shape-constrained regression (Gupta et al., 2016). Optimistic approaches check the constraints only for a finite set of points from the input space and accept that the identified model might violate the constraints for certain elements from the input space. Pessimistic approaches calculate bounds for the outputs of the model and its partial derivatives to guarantee that identified solutions are in fact feasible. However, pessimistic approaches might reject optimal models as a consequence of overly wide bounds. In this article, we study a pessimistic approach and calculate bounds for SR models and their derivatives using interval arithmetic.

### 3.2 Interval Arithmetic

Interval Arithmetic (IA) is a method for calculating output ranges for mathematical expressions (Hickey et al., 2001). It has many uses, such as dealing with uncertainties stemming from inaccurate representations and verifying boundary conditions. An interval is represented as $[a,b]$ with $a\u2264b$, named lower and upper endpoints, respectively.

If we have a function $f(x1,x2)=x1+x2$, for example, and knowing the intervals for each variable to be $x1\u2208[a,b],x2\u2208[c,d]$, we can say that the image of function $f$ will be in the interval $[a+c,b+d]$. IA defines the common mathematical operations and functions for such intervals. It is easy to see that IA can only give bounds for the expression instead of an accurate range because it does not track dependencies between arguments of operators. For instance, the IA result for the expression $x1-x1$ with $x1\u2208[a,b]$ is the interval $[a-b,b-a]$ instead of $[0,0]$.

IA has previously been used for SR for improving model quality as it allows to detect and reject partially defined functions resulting, for example, from division by zero or taking the logarithm of a non-positive number. The use of IA for SR was first proposed in Keijzer (2003) where IA was used to verify that a given expression is defined within the domain spanned by the variable values in the data set. The author showed that this indeed helps avoid partially defined functions even when using standard division instead of the commonly used protected division operator.

Pennachin et al. (2010) used Affine Arithmetic, an extension to IA, for finding robust solutions. The authors observed that the models generated by their algorithm were robust with respect to extrapolation error.

In Dick (2017) the approach of Keijzer (2003) was extended by introducing crossover and mutation operators that make use of IA to generate feasible expressions. The experiments showed a faster convergence rate with interval-aware operators.

We also include IA into SR. Our work is in fact a generalization of the idea discussed in Keijzer (2003) and Pennachin et al. (2010), but instead of limiting the use of IA to detect only whether a function is partial or not, we also expand it to test for other properties such as monotonicity. Another difference is that we assume that the considered intervals are derived from prior knowledge, instead of being inferred from the data.

### 3.3 Genetic Programming for Shape-Constrained Symbolic Regression

We integrate IA into two solvers for SR. In this section we describe the integration into tree-based GP (Koza, 1992) and in the next section we describe the integration into ITEA.

We use a tree-based GP variation with optional memetic local optimization of SR model parameters which has produced good results for a diverse set of regression benchmark problems (Kommenda et al., 2020).

The pseudocode for our GP algorithm for shape-constrained symbolic regression is shown in Algorithm 1. Solution candidates are symbolic expressions encoded as expression trees. Parameters of the algorithm are the function set $F$, the terminal set $T$, the maximum length (number of nodes) $Lmax$ and maximum depth $Dmax$ of expression trees, the number of generations $Gmax$, the population size $N$, the mutation rate $m$, the tournament group size $p$, and the number of iterations for memetic parameter optimization $nOpt$. The algorithm uses tournament selection, generational replacement, and elitism. New solution candidates are produced via sub-tree crossover and optional mutation. The initial population is generated with the PTC2 algorithm (Luke, 2000). All evolutionary operators respect the length and depth limits for trees.

Shape constraints are handled during fitness evaluation. First, we calculate intervals for the function output as well as the necessary partial derivatives using the known intervals for each of the input variables and check the output intervals against the constraint bounds. If a solution candidate violates any of the shape constraints, the solution candidate is assigned the worst possible fitness value. In a second step, the prediction error is calculated only for the remaining feasible solution candidates. The solution candidate with the smallest error within a group is selected as the winner in tournament selection.

We assume that the constraints are provided via a set $C$ as given in Section 3.1.

#### 3.3.1 Fitness Evaluation

Function *Evaluate* calculates the vector of residuals from
the predictions for inputs $X$ and the observed target values $y$.
We use the normalized mean of squared errors (NMSE) as the fitness
indicator. NMSE is the mean of squared residuals scaled with the inverse
variance of $y$.
As described in Keijzer (2003), we
implicitly scale all symbolic regression solutions to match the mean and
variance of $y$.
Therefore, we know that the worst possible prediction (i.e., a constant
value) has an NMSE of one and we use this value for infeasible solution
candidates.

#### 3.3.2 Optional Local Optimization

Procedure *Optimize* locally improves the vector of numerical
coefficients $\theta \u2208Rdim$ of each model using nonlinear least-squares fitting using the
Levenberg-Marquardt algorithm. It has been demonstrated that gradient-based
local improvement improves symbolic regression performance (Topchy and
Punch, 2001; Kommenda et al., 2020). Here, we want to investigate
whether it can also be used in combination with shape constraints. To test
the algorithms with and without local optimization we have included the
number of local improvement iterations $nOpt$ as a parameter for the GP algorithm.

The local improvement operator extracts the initial parameter values $\theta init$ from the solution candidates and updates the values after optimization. Therefore, improved parameter values $\theta \u2605$ become part of the genome and can be inherited to new solution candidates (Lamarckian learning, Houck et al., 1997).

### 3.4 Interaction-Transformation Evolutionary Algorithm

The algorithm is a simple mutation-based evolutionary algorithm depicted in Algorithm 2. It starts with a random population of expressions and repeats mutation and selection until convergence.

The mutation procedure chooses one of the following actions at random:

Remove one random term of the expression.

Add a new random term to the expression.

Replace the interaction strengths of a random term of the expression.

Replace a random term of the expression with the positive interaction of another random term.

Replace a random term of the expression with the negative interaction of another random term.

The positive and negative interactions of terms is the sum or subtraction of the strengths of the variables. For example, if we have two terms, $x12x23$ and $x1-1x21$, and we want to apply the positive interaction to them, we would perform the operation $x12\xb7x1-1\xb7x23\xb7x21$ that results in $x1x24$. Similarly, with the negative interactions we divide the first polynomial by the other, so performing the operation $x12\xb7x11\xb7x23\xb7x2-1$, resulting in $x13x22$.

Weights within IT expressions can be efficiently optimized using ordinary least squares (OLS) after the mutation step.

This algorithm was reported to surpass different variants of GP and stay on par with nonlinear regression algorithms (de Franca and Aldeia, 2021).

### 3.5 Feasible-Infeasible Two-Population with Interaction-Transformation

In order to introduce the shape constraints into ITEA, we will use the approach called Feasible-Infeasible Two-population (FI-2POP) (Kimbrough et al., 2008) that works with two populations: one for feasible models and another for infeasible models. Despite its simplicity, it has been reported to work well on different problems (Liapis et al., 2015; Scirea et al., 2016; Covões and Hruschka, 2018). We have chosen this approach to deal with constraints for the ITEA algorithm since it does not demand fine-tuning of a penalty function and a penalty coefficient, does not require any change to the original algorithm (as the repairing and constraint-aware operators), and does not introduce a high computational cost (i.e., multiobjective optimization).

Specifically for ITEA, the adaptation (named FI-2POP-IT) is depicted in Algorithm 3. The main difference from Algorithm 2 is that in FI-2POP-IT every function now produces two distinct populations, one for feasible solutions and another for the infeasible solutions. For example, the mutation operator when applied to an individual can produce either a feasible solution or an infeasible one; this new individual will be assigned to its corresponding population.

The fitness evaluation differs for each population; the feasible population minimizes the error function while the infeasible population minimizes the constraint violations.

#### 3.5.1 Defining Constraints with Interval Arithmetic

The IT representation allows for a simple algorithm to calculate the $n$-th order partial derivatives of the expression and to perform IA. Additionally, only linear real-valued parameters are allowed for IT expressions. Therefore, parameters can be optimized efficiently using ordinary least squares (OLS) and all shape constraints can be transformed to functions which are linear in the parameters.

All of the aforementioned transformation functions have derivatives representable by the provided functions. Following the same rationale, the second-order derivatives can be calculated as well.

## 4 Experimental Setup

For testing the effectiveness of the proposed approach we have created some
artificial data sets generated from fluid dynamics engineering (FDE) taken from Chen
et al. (2018) and some selected physics models
from the *Feynman Symbolic Regression Database*^{1} (Udrescu and Tegmark, 2020) (FEY). Additionally, we have used data sets built upon
real-world measurements from physical systems and engineering (RW). We have chosen
these data sets because they are prototypical examples of nontrivial nonlinear
models which are relevant in practical applications.

For each of the problem instances, we have defined a set of shape constraints that must be fulfilled by the SR solutions. For the FDE and FEY data sets, we have used constraints that can be derived from the known expressions over the given input space. For the physical systems, we have defined the shape constraints based on expert knowledge. The input space and monotonicity constraints for all problem instances are given in the supplementary material.

In the following subsections, we give more details on the problem instances including the constraint definitions and describe the methods to test the modified algorithms.

### 4.1 Problem Instances

All the data sets are described by the input variables names, the target variable, the true expression if it is known, domains of the variables for the training and test data, expected image of the function, and monotonic constraints. All data set specifications are provided as a supplementary material.

Details for *Aircraft Lift*, *Flow Psi*, *Fuel Flow* models can be found in Anderson (2010, 1982). Additionally, we have selected a subset of the FEY problem
instances (Udrescu and Tegmark, 2020).
The selection criteria were those problem instances which have been reported as
unsolved without dimensional analysis or for which a required runtime of more
than five minutes was reported by Udrescu and Tegmark (2020). We have not included the three benchmark problems
from the closely related work of Bladek and Krawiec (2019), because we could easily solve them in preliminary
experiments and additionally, the results would not be directly comparable
because we do not support symmetry constraints.

All constraints for FDE and FYE have been determined by analyzing the known expression and its partial derivatives. The formulas for the FDE and FYE problem instances are shown in Table 1. For all these problem instances we have generated data sets from the known formula by randomly sampling the input space using 100 data points for training and 100 data points for testing. For each of the data sets we generated two versions: one without noise and one where we have added normally distributed noise to the target variable $y'=y+N(0,0.05\sigma y)$. Accordingly, the optimally achievable NMSE for the noisy problem instances is $0.25%$.

Name . | Formula . |
---|---|

Aircraft lift | $CL=CL\alpha (\alpha +\alpha 0)+CL\delta e\delta eSHTSref$ |

Flow psi | $\Psi =V\u221ersin(\theta 2\pi )1-Rr2+\Gamma 2\pi logrR$ |

Fuel flow | $m\u02d9=p0A\u2605T0\gamma R21+\gamma (\gamma +1)/(\gamma -1)$ |

Jackson 2.11 | $q4\pi \epsilon y24\pi \epsilon Voltd-qdy3y2-d22$ |

Wave power | $-325G4c5m1m22m1+m2r5$ |

I.6.20 | $exp-\theta \sigma 2212\pi \sigma $ |

I.9.18 | $Gm1m2x2-x12+y2-y12+z2-z12$ |

I.15.3x | $x-ut1-u2c2$ |

I.15.3t | $t-uxc211-u2c2$ |

I.30.5 | $asinlambdnd$ |

I.32.17 | $12\epsilon cEf28\pi r23\omega 4\omega 2-\omega 022$ |

I.41.16 | $h\omega 3\pi 2c2exph\omega kbT-1$ |

I.48.20 | $mc21-v2c2$ |

II.6.15a | $pd4\pi \epsilon 3zr5x2+y2$ |

II.11.27 | $n\alpha 1-n\alpha 3\epsilon Ef$ |

II.11.28 | $1+n\alpha 1-n\alpha 3$ |

II.35.21 | $nrhomomtanhmomBkbT$ |

III.9.52 | $pdEfthsin\omega -\omega 0t221\omega -\omega 0t22$ |

III.10.19 | $momBx2+By2+Bz2$ |

Name . | Formula . |
---|---|

Aircraft lift | $CL=CL\alpha (\alpha +\alpha 0)+CL\delta e\delta eSHTSref$ |

Flow psi | $\Psi =V\u221ersin(\theta 2\pi )1-Rr2+\Gamma 2\pi logrR$ |

Fuel flow | $m\u02d9=p0A\u2605T0\gamma R21+\gamma (\gamma +1)/(\gamma -1)$ |

Jackson 2.11 | $q4\pi \epsilon y24\pi \epsilon Voltd-qdy3y2-d22$ |

Wave power | $-325G4c5m1m22m1+m2r5$ |

I.6.20 | $exp-\theta \sigma 2212\pi \sigma $ |

I.9.18 | $Gm1m2x2-x12+y2-y12+z2-z12$ |

I.15.3x | $x-ut1-u2c2$ |

I.15.3t | $t-uxc211-u2c2$ |

I.30.5 | $asinlambdnd$ |

I.32.17 | $12\epsilon cEf28\pi r23\omega 4\omega 2-\omega 022$ |

I.41.16 | $h\omega 3\pi 2c2exph\omega kbT-1$ |

I.48.20 | $mc21-v2c2$ |

II.6.15a | $pd4\pi \epsilon 3zr5x2+y2$ |

II.11.27 | $n\alpha 1-n\alpha 3\epsilon Ef$ |

II.11.28 | $1+n\alpha 1-n\alpha 3$ |

II.35.21 | $nrhomomtanhmomBkbT$ |

III.9.52 | $pdEfthsin\omega -\omega 0t221\omega -\omega 0t22$ |

III.10.19 | $momBx2+By2+Bz2$ |

We have included three real-world data sets (RW). The dataset for *Friction $\mu stat$* and *Friction $\mu dyn$* has been collected in a standardized testing procedure for friction plates and
has two target variables that we model independently. The *Flow
Stress* data set stems from hot compression tests of aluminium
cylinders (Kabliman et al., 2019). The
constraints for these data sets were set by specialists in the field.

The *Cars* data set has values for displacement, horsepower,
weight, acceleration time and fuel efficiency of almost 400 different car
models. This set was first used by Mammen et al. (2001) and then evaluated by Shah et al. (2016) to assess the performance of Support
Vector Regression with soft constraints.

### 4.2 Testing Methodology

Each one of the following experiments has been repeated 30 times and, for every repetition, we have measured the NMSE in percent for both training and test data. The algorithms receive only the training set as the input and only the best found model is applied to the test set. Runs with and without shape constraints have been executed on the same hardware.

The first experiment tests the performance of the unmodified algorithms, that is, without shape constraints. This test verifies whether unmodified SR algorithms identify conforming models solely from the observed data. If this result turns out negative, it means that constraint handling mechanisms are required for finding conforming models.

In the second experiment, we test the performance of the two modified algorithms. This will verify whether the algorithms are able to identify models conforming to prior knowledge encoded as shape constraints.

This testing approach allows us to compare the prediction errors (NMSE) of models produced with and without shape constraints to assess whether the prediction error improves or deteriorates. We perform the same analysis for the two groups of instances with and without noise separately.

### 4.3 Algorithms Configurations

In the following, the abbreviation GP refers to tree-based GP without local optimization and GPC refers to tree-based GP with local optimization. Both algorithms can be used with and without shape constraints. Table 2 shows the parameter values that have been used for the experiments with GP and GPC. ITEA refers to the Interaction-Transformation Evolutionary Algorithm and FI-2POP-IT (short form: FIIT) refers to the two-population version of ITEA which supports shape constraints. Table 3 shows the parameters values for ITEA and FI-2POP-IT.

Parameter . | Value . |
---|---|

Population size | 1000 |

Generations | 200 |

20 (for GPC with memetic optimization) | |

Initialization | PTC2 |

$Lmax$, $Dmax$ | 50, 20 |

Fitness evaluation | NMSE with linear scaling |

Selection | Tournament (group size = 5) |

Crossover | Subtree crossover |

Mutation (one of) | Replace subtree with random branch |

Add $x\u223cN(0,1)$ to all numeric parameters | |

Add $x\u223cN(0,1)$ to a single numeric parameter | |

Change a single function symbol | |

Crossover rate | 100% |

Mutation rate | 15% |

Replacement | Generational with a single elite |

Terminal set | real-valued parameters and input variables |

Function set | $+,*,%,log,exp,sin,cos,tanh,x2,x$ |

GPC | max. 10 iterations of Levenberg-Marquardt (LM) |

Parameter . | Value . |
---|---|

Population size | 1000 |

Generations | 200 |

20 (for GPC with memetic optimization) | |

Initialization | PTC2 |

$Lmax$, $Dmax$ | 50, 20 |

Fitness evaluation | NMSE with linear scaling |

Selection | Tournament (group size = 5) |

Crossover | Subtree crossover |

Mutation (one of) | Replace subtree with random branch |

Add $x\u223cN(0,1)$ to all numeric parameters | |

Add $x\u223cN(0,1)$ to a single numeric parameter | |

Change a single function symbol | |

Crossover rate | 100% |

Mutation rate | 15% |

Replacement | Generational with a single elite |

Terminal set | real-valued parameters and input variables |

Function set | $+,*,%,log,exp,sin,cos,tanh,x2,x$ |

GPC | max. 10 iterations of Levenberg-Marquardt (LM) |

Parameter . | Value . |
---|---|

Population size | 200 |

Number of Iterations | 500 |

Function set | $sin,cos,tanh,,log,log1p,exp$ |

Fitness evaluation | RMSE |

Maximum number of terms (init. pop.) | 4 |

Range of strength values (init. pop.) | $[-4,4]$ |

Min. and max. term length | $2,6$ |

Regression model | OLS |

Parameter . | Value . |
---|---|

Population size | 200 |

Number of Iterations | 500 |

Function set | $sin,cos,tanh,,log,log1p,exp$ |

Fitness evaluation | RMSE |

Maximum number of terms (init. pop.) | 4 |

Range of strength values (init. pop.) | $[-4,4]$ |

Min. and max. term length | $2,6$ |

Regression model | OLS |

For comparison we have used auto-sklearn (Feurer et al., 2015) (AML) and an implementation of shape-constrained
polynomial regression (SCPR) (Curmei and Hall, 2020).^{2} Auto-sklearn
does not allow to set monotonicity constraints. Therefore, we can only use it as
a comparison for symbolic regression models without constraints. We execute 30
independent repetitions for each problem with each run limited to one hour.

The degree for SCPR has been determined for each problem instance using 10-fold cross-validation and using the best CV-RMSE as the selection criteria. We tested homogeneous polynomials up to degree eight (except for I.9.18 where we used a maximum degree of five because of runtime limits). The runtime for the grid search was limited to one hour for each problem instance. The final model is then trained with the selected degree on the full training set. SCPR is a deterministic algorithm; therefore, only a single run was required for each problem instance.

## 5 Results

In this section, we report the obtained results in tabular and graphical forms focusing on the violations of the shape constraints, goodness of fit, computational overhead, and extrapolation capabilities with and without shape constraints.

### 5.1 Constraint Violations

The procedure for checking of constraint violations is as follows: for each problem instance we sample a million points uniformly from the full input space to produce a data set for checking constraint violations. The final SR solutions as well as their partial derivatives are then evaluated on this data set. If there is at least one point for which the shape constraints are violated, then we count the SR solution as infeasible.

Figures 1a to 1c show exemplarily the number of infeasible SR models for each algorithm and constraint, over the 30 runs for the FDE data sets. The picture is similar for all problem instances. We observe that without shape constraints none of the algorithms produce feasible solutions in every run.

Only for the easiest data sets (e.g., *Fuel Flow*) we rediscovered
the data-generating functions—which necessarily conform to the
constraints—in multiple runs. Over all data sets, ITEA has the highest
probability of producing infeasible models. We observe that even a small amount
of noise increases the likelihood to produce infeasible models. In short, we can
see that we cannot hope to find feasible solutions using SR algorithms without
shape constraints for noisy data sets. This is consistent with the observation
in Bladek and Krawiec (2019).

The results for the extended algorithms show, that with few exceptions, all three SR algorithms are capable of finding feasible solutions most of the time.

### 5.2 Interpolation and Extrapolation

The difference between SR models and shape-constrained SR models becomes clearer when we visualize their outputs.

Figures 2a to 2f show the predictions of every SR model identified for the *Friction*$\mu stat$ data set. The partial dependence plots show the predictions for $\mu stat$ for all 30 models over the complete range of values allowed for the two
variables $p$ and $T$.
The plot of $\mu stat$ over $p$ shows the predictions when $v$ and $T$ are
fixed to their median values and the plot of $\mu stat$ over $T$ shows the predictions when $p$ and $v$ are
fixed. The dashed vertical lines mark the subspace from which we have sampled
points for the training and test data sets.

*Friction*data sets (see Figure 2e as well as Table 4). However, the models do not conform to prior knowledge as we would expect that $\mu stat$ decreases with increasing pressure and temperature and the models show a slight increase in $\mu stat$ when increasing $p$.

. | w/o. info . | w. info . | ||||||
---|---|---|---|---|---|---|---|---|

. | GP . | GPC . | ITEA . | AML . | GP . | GPC . | FIIT . | SCPR . |

Friction $\mu dyn$ | 8.28 | 7.73 | 4.35 | 12.99 | 12.53 | 16.30 | 35.90 | 8.07 |

Friction $\mu stat$ | 7.22 | 5.44 | 4.46 | 6.82 | 9.98 | 7.76 | 11.83 | 1.77 |

Flow stress | 8.36 | 4.66 | – | 0.15 | 34.05 | 26.04 | 68.16 | 19.46 |

Cars | 75.18 | 76.23 | 75.06 | 74.72 | 76.86 | 77.67 | 76.64 | 73.83 |

without noise | ||||||||

Aircraft lift | 0.63 | 0.15 | 0.22 | 0.00 | 0.80 | 1.01 | 0.14 | 0.00 |

Flow psi | 0.75 | 0.13 | 0.05 | 0.00 | 4.80 | 5.36 | 2.91 | 0.00 |

Fuel flow | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |

Jackson 2.11 | 0.00 | 0.00 | 0.00 | 0.62 | 0.00 | 0.00 | 0.00 | 0.90 |

Wave Power | 14.55 | 30.82 | 2.26 | 2.74 | 18.71 | 80.34 | 21.31 | 13.50 |

I.6.20 | 0.46 | 0.00 | 0.31 | 0.00 | 1.61 | 0.42 | 3.20 | 0.01 |

I.9.18 | 2.88 | 2.49 | 0.91 | 4.98 | 4.03 | 16.16 | 0.74 | 1.20 |

I.15.3x | 0.34 | 0.01 | 0.01 | 0.02 | 0.36 | 0.04 | 0.01 | 0.01 |

I.15.3t | 0.21 | 0.01 | 0.00 | 0.00 | 0.15 | 0.03 | 0.00 | 0.00 |

I.30.5 | 0.00 | 0.00 | 0.00 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 |

I.32.17 | 0.76 | 1.13 | 8.07 | 42.36 | 2.07 | 12.76 | 2.42 | 7.79 |

I.41.16 | 2.78 | 2.29 | 1.08 | 15.14 | 8.99 | 17.72 | 5.15 | 1.56 |

I.48.20 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |

II.6.15a | 3.55 | 2.50 | 4.66 | 16.17 | 4.67 | 7.30 | 32.12 | 1.01 |

II.11.27 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 |

II.11.28 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |

II.35.21 | 3.29 | 1.18 | 2.61 | 1.40 | 3.67 | 6.10 | 3.22 | 1.34 |

III.9.52 | 116.25 | 20.44 | 66.16 | 19.88 | 104.56 | 106.48 | 71.93 | 33.41 |

III.10.19 | 0.41 | 0.04 | 0.17 | 0.01 | 0.55 | 0.33 | 0.31 | 0.00 |

with noise | ||||||||

Aircraft lift | 0.45 | 0.26 | 0.25 | 0.32 | 1.24 | 1.30 | 0.28 | 0.46 |

Flow psi | 0.75 | 0.29 | 0.21 | 0.37 | 5.90 | 6.02 | 4.65 | 0.57 |

Fuel flow | 0.21 | 0.24 | 0.18 | 0.34 | 0.30 | 0.30 | 0.25 | 0.25 |

Jackson 2.11 | 0.28 | 0.31 | 0.38 | 3.18 | 0.24 | 0.25 | 0.30 | 0.83 |

Wave Power | 21.23 | 51.36 | 99.88 | 44.83 | 22.36 | 68.96 | 21.39 | 11.88 |

I.6.20 | 1.09 | 0.40 | 0.56 | 0.45 | 2.14 | 0.78 | 3.61 | 0.55 |

I.9.18 | 3.77 | 3.55 | 1.56 | 4.02 | 5.25 | 15.70 | 1.33 | 1.62 |

I.15.3x | 0.55 | 0.36 | 0.38 | 0.37 | 0.56 | 0.35 | 0.36 | 0.42 |

I.15.3t | 0.65 | 0.48 | 0.58 | 0.53 | 0.59 | 0.51 | 0.48 | 0.45 |

I.30.5 | 0.34 | 0.35 | 0.62 | 0.81 | 0.32 | 0.33 | 0.34 | 0.39 |

I.32.17 | 0.78 | 3.14 | 8.50 | 47.60 | 3.95 | 14.02 | 2.53 | 6.22 |

I.41.16 | 3.13 | 2.32 | 3.47 | 15.19 | 6.68 | 19.72 | 5.05 | 2.93 |

I.48.20 | 0.37 | 0.36 | 0.51 | 0.35 | 0.32 | 0.32 | 0.34 | 0.32 |

II.6.15a | 3.08 | 2.88 | 7.56 | 19.29 | 3.87 | 6.05 | 45.32 | 1.76 |

II.11.27 | 0.35 | 0.39 | 1.06 | 0.62 | 0.37 | 0.61 | 0.41 | 0.47 |

II.11.28 | 0.38 | 0.44 | 0.39 | 0.51 | 0.27 | 0.29 | 0.38 | 0.30 |

II.35.21 | 3.88 | 1.33 | 2.43 | 2.10 | 4.38 | 7.49 | 4.27 | 1.34 |

III.9.52 | 126.84 | 18.91 | 74.08 | 24.81 | 106.56 | 90.18 | 73.44 | 32.69 |

III.10.19 | 0.85 | 0.38 | 0.70 | 0.64 | 0.91 | 0.64 | 0.70 | 0.46 |

. | w/o. info . | w. info . | ||||||
---|---|---|---|---|---|---|---|---|

. | GP . | GPC . | ITEA . | AML . | GP . | GPC . | FIIT . | SCPR . |

Friction $\mu dyn$ | 8.28 | 7.73 | 4.35 | 12.99 | 12.53 | 16.30 | 35.90 | 8.07 |

Friction $\mu stat$ | 7.22 | 5.44 | 4.46 | 6.82 | 9.98 | 7.76 | 11.83 | 1.77 |

Flow stress | 8.36 | 4.66 | – | 0.15 | 34.05 | 26.04 | 68.16 | 19.46 |

Cars | 75.18 | 76.23 | 75.06 | 74.72 | 76.86 | 77.67 | 76.64 | 73.83 |

without noise | ||||||||

Aircraft lift | 0.63 | 0.15 | 0.22 | 0.00 | 0.80 | 1.01 | 0.14 | 0.00 |

Flow psi | 0.75 | 0.13 | 0.05 | 0.00 | 4.80 | 5.36 | 2.91 | 0.00 |

Fuel flow | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |

Jackson 2.11 | 0.00 | 0.00 | 0.00 | 0.62 | 0.00 | 0.00 | 0.00 | 0.90 |

Wave Power | 14.55 | 30.82 | 2.26 | 2.74 | 18.71 | 80.34 | 21.31 | 13.50 |

I.6.20 | 0.46 | 0.00 | 0.31 | 0.00 | 1.61 | 0.42 | 3.20 | 0.01 |

I.9.18 | 2.88 | 2.49 | 0.91 | 4.98 | 4.03 | 16.16 | 0.74 | 1.20 |

I.15.3x | 0.34 | 0.01 | 0.01 | 0.02 | 0.36 | 0.04 | 0.01 | 0.01 |

I.15.3t | 0.21 | 0.01 | 0.00 | 0.00 | 0.15 | 0.03 | 0.00 | 0.00 |

I.30.5 | 0.00 | 0.00 | 0.00 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 |

I.32.17 | 0.76 | 1.13 | 8.07 | 42.36 | 2.07 | 12.76 | 2.42 | 7.79 |

I.41.16 | 2.78 | 2.29 | 1.08 | 15.14 | 8.99 | 17.72 | 5.15 | 1.56 |

I.48.20 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |

II.6.15a | 3.55 | 2.50 | 4.66 | 16.17 | 4.67 | 7.30 | 32.12 | 1.01 |

II.11.27 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 |

II.11.28 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |

II.35.21 | 3.29 | 1.18 | 2.61 | 1.40 | 3.67 | 6.10 | 3.22 | 1.34 |

III.9.52 | 116.25 | 20.44 | 66.16 | 19.88 | 104.56 | 106.48 | 71.93 | 33.41 |

III.10.19 | 0.41 | 0.04 | 0.17 | 0.01 | 0.55 | 0.33 | 0.31 | 0.00 |

with noise | ||||||||

Aircraft lift | 0.45 | 0.26 | 0.25 | 0.32 | 1.24 | 1.30 | 0.28 | 0.46 |

Flow psi | 0.75 | 0.29 | 0.21 | 0.37 | 5.90 | 6.02 | 4.65 | 0.57 |

Fuel flow | 0.21 | 0.24 | 0.18 | 0.34 | 0.30 | 0.30 | 0.25 | 0.25 |

Jackson 2.11 | 0.28 | 0.31 | 0.38 | 3.18 | 0.24 | 0.25 | 0.30 | 0.83 |

Wave Power | 21.23 | 51.36 | 99.88 | 44.83 | 22.36 | 68.96 | 21.39 | 11.88 |

I.6.20 | 1.09 | 0.40 | 0.56 | 0.45 | 2.14 | 0.78 | 3.61 | 0.55 |

I.9.18 | 3.77 | 3.55 | 1.56 | 4.02 | 5.25 | 15.70 | 1.33 | 1.62 |

I.15.3x | 0.55 | 0.36 | 0.38 | 0.37 | 0.56 | 0.35 | 0.36 | 0.42 |

I.15.3t | 0.65 | 0.48 | 0.58 | 0.53 | 0.59 | 0.51 | 0.48 | 0.45 |

I.30.5 | 0.34 | 0.35 | 0.62 | 0.81 | 0.32 | 0.33 | 0.34 | 0.39 |

I.32.17 | 0.78 | 3.14 | 8.50 | 47.60 | 3.95 | 14.02 | 2.53 | 6.22 |

I.41.16 | 3.13 | 2.32 | 3.47 | 15.19 | 6.68 | 19.72 | 5.05 | 2.93 |

I.48.20 | 0.37 | 0.36 | 0.51 | 0.35 | 0.32 | 0.32 | 0.34 | 0.32 |

II.6.15a | 3.08 | 2.88 | 7.56 | 19.29 | 3.87 | 6.05 | 45.32 | 1.76 |

II.11.27 | 0.35 | 0.39 | 1.06 | 0.62 | 0.37 | 0.61 | 0.41 | 0.47 |

II.11.28 | 0.38 | 0.44 | 0.39 | 0.51 | 0.27 | 0.29 | 0.38 | 0.30 |

II.35.21 | 3.88 | 1.33 | 2.43 | 2.10 | 4.38 | 7.49 | 4.27 | 1.34 |

III.9.52 | 126.84 | 18.91 | 74.08 | 24.81 | 106.56 | 90.18 | 73.44 | 32.69 |

III.10.19 | 0.85 | 0.38 | 0.70 | 0.64 | 0.91 | 0.64 | 0.70 | 0.46 |

The model predictions for shape-constrained SR are shown in Figures 2b, 2d, and 2f. The visualization clearly shows that there is higher variance and that all algorithms produced a few bad or even constant solutions. This invalidates our hypothesis that shape-constrained SR leads to improved predictive accuracy and instead indicates that there are convergence issues with our approach of including shape constraints. We will discuss this in more detail in the following sections. The visualization also shows that the solutions with shape constraints have better extrapolation behavior and conform to shape constraints.

### 5.3 Goodness-of-Fit

Table 4 shows the median NMSE of best solutions over 30 runs obtained by each algorithm, with and without shape constraints, on the test sets. The training results can be found in the supplementary material.

The table has six quadrants: the left block shows the results without side information, the right block shows the results with shape constraints, the top panel shows results for RW instances, the middle panel the results for the FDE and FYE instances, and the bottom panel the results for the same instances including 0.25% noise.

Analyzing the detailed results we observe that the best result of all models without information is better for 18 of 42 instances (RW: 2, no noise: 6, noisy: 10). While the best result of models with information is better for 14 of 42 instances (RW: 2, no noise: 3, noisy: 9).

Within both groups there are algorithms with significantly different test errors
(without info: *p*-value: 0.0103, with info: *p*-value: $6.467\xb710-6$,
Friedman's rank sum test with Davenport correction). Pairwise comparison
of results without info shows that GPC is better than GP
(*p*-value: 0.011) and AML (*p*-value: 0.043)
using Friedman's rank sum test with Bergman correction. For the problem
instances with noise, AML produced the best result for only 1 out of 19^{4} instances. For the instances without noise, the
results of AML are similar to results of GP, GPC, and ITEA. Pairwise comparison
of results with info shows that SCPR is better than the other algorithms and no
statistically significant difference was found between GP, GPC and FIIT. The *p*-values for all pairwise tests are shown in Table 5.

without info . | with info . | ||||||
---|---|---|---|---|---|---|---|

GP . | GPC . | ITEA . | AML . | GP . | GPC . | FIIT . | SCPR . |

n/a | 0.011 | 0.325 | 0.499 | n/a | 0.151 | 0.353 | 0.002 |

0.011 | n/a | 0.325 | 0.043 | 0.151 | n/a | 0.054 | 0.000 |

0.325 | 0.325 | n/a | 0.353 | 0.353 | 0.054 | n/a | 0.028 |

0.499 | 0.043 | 0.353 | n/a | 0.002 | 0.000 | 0.028 | n/a |

without info . | with info . | ||||||
---|---|---|---|---|---|---|---|

GP . | GPC . | ITEA . | AML . | GP . | GPC . | FIIT . | SCPR . |

n/a | 0.011 | 0.325 | 0.499 | n/a | 0.151 | 0.353 | 0.002 |

0.011 | n/a | 0.325 | 0.043 | 0.151 | n/a | 0.054 | 0.000 |

0.325 | 0.325 | n/a | 0.353 | 0.353 | 0.054 | n/a | 0.028 |

0.499 | 0.043 | 0.353 | n/a | 0.002 | 0.000 | 0.028 | n/a |

Comparing the results with and without constraints for each algorithm
individually, we find that the results are in general worse when using
constraints. GP is better than GP (info) for 27 instances. GPC is better than
GPC (info) for 32 instances. ITEA is better than FIIT for 19 instances. For the
RW data sets, ITEA managed to find the best models on two out of the four
instances. For *Flow Stress*, ITEA returned a solution that
produced numerical errors for the test set. This is not the case when we include
the shape constraints, as we can see on the top-right quadrant. In this
situation, FI-2POP-IT was capable of finding expressions that did not return
invalid results for the test set.

### 5.4 Computational Overhead

Another important impact of introducing constraints that should be considered is the computational overhead introduced to each approach.

For ITEA the execution time is approximately doubled when using the constraint handling. This is because parameter optimization is much easier for the ITEA representation and the calculation of the partial derivatives is a simple mechanical process as shown in Section 3.5.1. For GP and GPC the execution time factor is approximately 5 when including shape constraints. The increased execution time for GP results from the additional effort for building the partial derivatives for each solution candidate and for the interval evaluation. We observed that the increase in execution time is less extreme for problem instances with a large number of rows where the relative effort for symbolic derivation of solution candidates becomes smaller.

## 6 Discussion

The results presented in the previous section largely corroborate our initial assumptions for shape-constrained symbolic regression. First of all, when we do not explicitly consider shape constraints within SR algorithms, we are unlikely to find solutions which conform to expected behavior. We showed that the results produced by the two newly introduced algorithms in fact conform to shape constraints. Our assessment of the extrapolation and interpolation behavior of SR models highlighted the bad extrapolation behavior as well as occasional problems even for interpolation. The improved results when including shape constraints support the argument to include interval arithmetic to improve the robustness of SR solutions (Keijzer, 2003; Pennachin et al., 2010).

However, our results also show that including shape constraints via interval arithmetic leads to SR solutions with higher prediction errors on training and test sets. While the increased error on the training set is expected, we hoped we would be able to improve prediction errors on the test set. Assuming that the constraints are correct, this should hypothetically be possible because the shape constraints provide additional information for improving the fit on noisy data sets. Instead, we observed that the results got worse when using information for tree-based GP with and without local optimization (GPC). There are several possible explanations such as slower convergence, more rapid loss of diversity, or rejection of feasible solutions because of the pessimistic bounds produced by interval arithmetic. Another hypothesis is that the positive effect of shape constraints becomes more relevant with higher noise levels. We are not able to give a conclusive answer for the main cause of the higher prediction errors with side information and leave this question open for future research.

Comparison with AutoML as implemented by auto-sklearn showed that GP with parameter
optimization (GPC) produced better test results than AutoML
(*p*-value: 0.011) over the benchmark set without shape constraints.
However, AutoML does not support monotonicity constraints and, because of that, we
cannot use it to compare with the results using side information. Therefore, we
compared the results of our proposed algorithms with shape-constrained polynomial
regression (SCPR). The results show that SCPR performs better than the evolutionary
algorithms for this benchmark set, which indicates that we can find a good
approximation of many of our benchmark instances using polynomials.

An advantage of SCPR is that it is formulated as a convex optimization problem that can be solved efficiently and deterministically by highly-tuned solvers. A drawback of SCPR is the potentially large model size. The number of terms of a homogeneous $n$-variate polynomial of degree $d$ is $n+dn$. Our experiments found that polynomial degrees of up to eight were required to find a good fit. The studied problems had five variables on average, which led to more than a thousand terms in our polynomial models. The models produced by the evolutionary algorithms were, on average, much smaller as we used a limit of 50 nodes for GP expression trees and 6 terms for ITEA/FI-2POP-IT.

## 7 Conclusions

In this article, we have introduced shape-constrained symbolic regression which allows the inclusion of prior knowledge into SR. Shape-constrained symbolic regression allows us to enforce that the model output must be within given bounds, or that outputs must be monotonically increasing or decreasing over selected inputs. The structure and the parameters of the symbolic regression model are, however, still identified by the algorithm.

We have described two algorithms for shape-constrained symbolic regression which are extensions of tree-based genetic programming with optional local optimization and the Interaction-Transformation Evolutionary Algorithm that uses a more restrictive representation with the goal of returning simpler expressions.

The extensions add the ability to calculate partial derivatives for any expression generated by the algorithms and use interval arithmetic to determine bounds and determine if models conform to shape constraints. Two approaches for handling constraint violations have been tested. The first approach simply adjusts fitness for infeasible solutions, while the second approach splits the population into feasible and infeasible solutions.

The results showed the importance of treating the shape constraints inside the algorithms. First of all, we have collected more evidence that without any feasibility control, it is unlikely to find feasible solutions for most of the problems. Next, we verified the efficacy of our approach by measuring the frequency of infeasible solutions and reporting the median numerical error of our models. The modified algorithms were all capable of finding models conforming to the shape constraints. This shows that the introduction of shape constraints can help us find more realistic models. However, we have also found that the extended algorithms with shape constraints produce worse solutions on the test set on average.

For the next steps, we intend to analyze in detail the causes for the worse solutions with shape-constrained SR. The bounds determined via interval arithmetic are very wide and might lead to rejection of feasible solutions as well as premature convergence. This is an issue that could potentially be solved by using more elaborate bound estimation schemes such as affine arithmetic or recursive splitting. Other possibilities for the constraint-handling include multiobjective optimization and penalty functions. Alternatively, optimistic approaches (e.g., using sampling) or a hybridization of pessimistic and optimistic approaches for shape-constrained regression can be used to potentially improve the results. Additionally, it would be worthwhile to study the effects of constraint-handling mechanisms on population diversity in more detail.

## Acknowledgments

This project is partially funded by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), grant number 2018/14173-8. Some of the experiments (ITEA) made use of the IntelRAI DevCloud, for which Intel®provided free access.

## Notes

^{2}

We have initially also used XGboost, a popular implementation of gradient boosted trees, because it has previously been found that it compares well with symbolic regression methods (Orzechowski et al., 2018) and it supports monotonicity constraints. However, the results were much worse compared to the symbolic regression results which indicates that the method is not particularly well-suited for the problem instances in the benchmark set which are smooth and of low dimensionality.

## References

*SIAM Journal on Applied Algebra and Geometry*

*Modern compressible flow: With historical perspective*

*Fundamentals of aerodynamics*

*Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation (GECCO)*

*Proceedings of the Genetic and Evolutionary Computation Conference*

*Genetic and Evolutionary Computation Conference*

*Genetic Programming Theory and Practice X*

*Expert Systems with Applications*

*2018 IEEE Congress on Evolutionary Computation*

*Information Sciences*

*Evolutionary Computation*

*Proceedings of the Genetic and Evolutionary Computation Conference Companion*

*Complex Systems*

*Advances in neural information processing systems*

*Journal of Machine Learning Research*

*Statistical Science*

*Journal of Machine Learning Research*

*Journal of the ACM (JACM)*

*Evolutionary Computation*

*AIP Conference Proceedings*

*European Conference on Genetic Programming*

*European Journal of Operational Research*

*Genetic Programming and Evolvable Machines*

*Genetic programming: On the programming of computers by means of natural selection*

*IEEE Transactions on Biomedical Engineering*

*European Conference on the Applications of Evolutionary Computation*

*Annual Reviews in Control*

*IEEE Transactions on Evolutionary Computation*

*Statistical Science*

*Genetic Programming Theory and Practice IX*

*Proceedings of the 10th Annual Conference Companion on Genetic and Evolutionary Computation*

*International Conference on Parallel Problem Solving from Nature*

*Complex Systems*

*IEEE Transactions on Evolutionary Computation*

*Proceedings of the Genetic and Evolutionary Computation Conference*

*Journal of Computational and Graphical Statistics*

*Evolutionary Computation*

*Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation*

*Proceedings of the International Conference on Learning Representations*

*Genetic programming*

*Nature Machine Intelligence*

*Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation*

*International Conference on Computational Intelligence in Music, Sound, Art and Design*

*COMAD*

*Genetic Programming Theory and Practice II*

*Thirty-First AAAI Conference on Artificial Intelligence*

*2019 IEEE Congress on Evolutionary Computation*

*Technometrics*

*Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)*

*Science Advances*

*Computer Methods in Applied Mechanics and Engineering*

*Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation*

*The Annals of Statistics*

*Journal of Computational Physics*