Abstract
Conventional genetic programming (GP) can guarantee only that synthesized programs pass tests given by the provided inputoutput examples. The alternative to such a testbased approach is synthesizing programs by formal specification, typically realized with exact, nonheuristic algorithms. In this article, we build on our earlier study on CounterexampleBased Genetic Programming (CDGP), an evolutionary heuristic that synthesizes programs from formal specifications. The candidate programs in CDGP undergo formal verification with a Satisfiability Modulo Theory (SMT) solver, which results in counterexamples that are subsequently turned into tests and used to calculate fitness. The original CDGP is extended here with a fitness threshold parameter that decides which programs should be verified, a more rigorous mechanism for turning counterexamples into tests, and other conceptual and technical improvements. We apply it to 24 benchmarks representing two domains: the linear integer arithmetic (LIA) and the string manipulation (SLIA) problems, showing that CDGP can reliably synthesize provably correct programs in both domains. We also confront it with two stateofthe art exact program synthesis methods and demonstrate that CDGP effectively trades longer synthesis time for smaller program size.
1 Introduction
Genetic programming (GP) is an inductive program synthesis technique, in which desired program behavior is defined by a set of inputoutput test cases. While this kind of specification is usually easy to provide, it is by definition incomplete—in general nothing (or at best, very little) can be guaranteed about program behavior for other inputs. Guaranteeing correctness by enumerating all inputs is impossible, except for toy examples. This is limiting, as many applications require a guarantee of correct program behavior for every possible input (which may be infinite in number), or to ensure that a certain property is always met. Examples include hardware design, safetycritical systems, and finding mathematical structures with certain properties.
An alternative to testbased specification is to encode the semantics of a program in some formal logic. Program behavior is then reasoned about within that logic, confronting it with a formal specification which defines the desired behavior. When thus conducted, the reasoning certifies program correctness. If the answer is positive, the program is guaranteed to behave as desired for all inputs allowed by specification, thereby addressing the problem of incomplete testing. Otherwise, a counterexample can be constructed that exemplifies a flaw in the program. Program verification is nowadays a mature branch of research in computer science, offering a range of efficient tools which facilitate reasoning about program properties (Section 2).
Verification cannot be used directly for synthesis, as it requires a program to work with—it cannot come up with program candidates on its own. In Krawiec et al. (2017), we proposed counterexampledriven genetic programming (CDGP), an approach in which a GP algorithm serves as such a generator. The method, detailed in Section 3, submits the candidate programs to verification, collects the counterexamples produced whenever a program fails to meet the prescribed specification, and uses them as test cases, thereby eliciting the fitness to guide the GP search process. To the best of our knowledge, CDGP is the first GPbased approach utilizing counterexamples obtained via formal verification.
This study extends Krawiec et al. (2017) in several ways. Firstly, the original CDGP occurred in two variants, namely conservative and nonconservative, which varied in the policy used to decide when a (potentially costly) call to the verifier should be made. Here, we unify those variants and control this aspect with a continuous parameter, which also allows us to explore intermediate strategies of engaging the verifier. Secondly, we provide a more rigorous and systematic approach to evaluation modes, properties of the specificationbased problems, and their consequences on the workings of CDGP (Section 3.2). Thirdly, we extend CDGP beyond the integer domain, making it applicable to programs that operate on character strings. Fourthly, in the experimental part (Section 5), we apply CDGP to a broader suite of benchmarks, and provide a more indepth presentation of results and their analysis. Last but not least, we confront CDGP with formal synthesizers (Section 5.7), which leads to interesting insights about their advantages and disadvantages compared to CDGP and GP in general.
2 Formal Verification
In formal approaches to program verification and synthesis, it is usually assumed that the desired behavior of a program is given in the form of a contract, a pair of logical formulas: a precondition, $Pre$, the constraint imposed on program input, and a postcondition, $Post$, a logical clause that should hold upon program completion.
For the Max2 problem (2), inputs $in$ to the program would be two integer variables $x$ and $y$. If the solver produces an assignment to $in$ that validates (3), $p$ does not behave as specified and is thus incorrect. The variable assignment in question, called a model in propositional logic, forms a counterexample. Otherwise, $p$ meets the contract and is thus correct. Crucially, the solver performs verification without actually running the program, because the properties of the output can be logically inferred from the properties of the input and those of the program code.
In practice, the solver must be equipped with an additional abstraction layer, a theory, in order to be able to reason in terms of, for instance, integer arithmetic (which we already assumed in the above Max2 example). This leads to the concept of satisfiability modulo theories (SMT) used both in program verification and synthesis (Jha et al., 2010; Gulwani et al., 2010; Srivastava et al., 2010). For Max2, the theory of linear integer arithmetic (LIA) (Barrett et al., 2016) may be used.
2.1 Bridging TestBased and Formal Specifications
Testbased synthesis and specbased synthesis may seem irreconcilable due to the different ways in which they specify the desired behavior of programs. In fact, they can be united, and illustrating this can be useful in explaining the rationale behind CDGP.
For the Max2 problem (2), the following set of inputoutput pairs might form a testbased specification:
$x$  $y$  $max(x,y)$ 
0  0  0 
1  0  1 
4  3  4 
$\cdots $  $\cdots $  $\cdots $ 
$x$  $y$  $max(x,y)$ 
0  0  0 
1  0  1 
4  3  4 
$\cdots $  $\cdots $  $\cdots $ 
3 CounterexampleDriven Genetic Programming
Example ^{1} illustrates that the feedback obtained from verification (whether applied to formal specification or to tests) can be only twofold: success or failure. Program verification is thus of little help, at least in such a simple scenario, for searchbased synthesis algorithms (such as GP) that require more graded guidance through the search space. Though in principle the postcondition could be rephrased to calculate also the number of passed constraints (cf. Johnson, 2007; see also related work in Section 4), this can be done much more easily (and more efficiently) by just running the program on tests, as practiced in GP. Therefore, rather than trying to elicit richer feedback from verification, in CDGP we rely on conventional GP fitness. Formal verification is used only to decide whether a given candidate program is correct, while the counterexamples become new test cases and provide the algorithm with a search gradient.
Figure 1 presents the conceptual diagram of CDGP. The GP search module is a conventional GP algorithm (equivalently, any generateandtest metaheuristic) equipped with arbitrary selection and search operators. The generated solutions are evaluated by the Testing module by running them on the set $Tc$ of test cases collected during the run. After evaluation, some candidate programs are sent to the Verification module, which performs verification using an SMT solver and pushes the resulting counterexamples to $Tc$, gradually increasing the test base. Note that technically counterexamples consist only of program inputs and are not therefore fullyformed tests; we detail and handle this issue in Section 3.2, but for the time being use these terms interchangeably.
3.1 Evaluation and Verification
The complete evaluation stage is realized by function CDEval shown in Algorithm 1. CDEval consists of both testing and verification and is launched once per generation. In contrast to conventional GP evaluation that utilizes only a set of tests $Tc$ (which is initially empty here), CDEval accepts also a formal specification $(Pre,Post)$. Upon completion, CDEval returns the evaluated population $P$ and the updated set of tests $Tc\u222aT$. Eval($p,Tc$) is the conventional fitness function that returns the number of tests from $Tc$ that are passed by $p$. Verify($p,(Pre,Post))$ executes the solver and returns the counterexample resulting from verifying $p$ on the specification $(Pre,Post)$. We illustrate Verify with the following example.
After execution of checksat, the getvalue statement is used to retrieve the values of $x$ and $y$ found by the solver, and those values form a counterexample. The returned counterexample depends on solver tactics; the Z3 solver (de Moura and Bjørner, 2008) that we use in the experimental section responds here with (x=1, y=0). The reader can verify that the result of $p$ for this input is indeed incorrect with respect to the specification. When a correct program is verified (i.e., (ite (>= x y) x y) or any semantically equivalent program), then the solver returns unsat, the search process stops, and the program is returned.$\u25a1$
In the first generation of CDEval (Algorithm 1) $Tc=\u2205$, so all programs in $P$ receive zero fitness and the attendant selection of parent programs is random. Nevertheless, this first generation will typically discover a few counterexamples, which provide for some degree of discrimination of programs in the second generation. In this way, the verification outcomes supply CDGP with an increasingly finegrained fitness function and more precise search gradient.
Which of the evaluated programs should be subject to verification is an important design choice, to which we pay special attention in this study. In the conservative variant which we introduced in Krawiec et al. (2017), we verified only the programs that passed all test cases previously collected in $Tc$. This variant is not only computationally efficient (verification can be costly), but also arguably most natural, as a program that does not pass all available tests cannot by definition be correct. However, by requiring all tests to be passed, the conservative scheme can lead to stagnation: it can happen that the SMT solver produces a test which is particularly difficult, and the GP algorithm may struggle to generate a program capable of passing it (and all other tests in $Tc$ simultaneously). As a result, many generations may elapse before GP produces a program good enough for the next verification and $Tc$ becomes augmented by the resulting counterexample. Such a course of events can be particularly harmful in the initial stages of a CDGP run, when $Tc$ is small and thus provides little search gradient.
To mitigate this problem, in this study we extend CDGP with the fitness threshold$q$ of the ratio of passed tests from $Tc$ required to apply verification (line 5 of Algorithm 1). For the conservative variant, $q=1.0$. The other extreme is setting $q=0.0$ (dubbed nonconservative in Krawiec et al. (2017)), which causes all evaluated programs to undergo verification, regardless of fitness value, and is thus rather costly in computation. We anticipate that setting $q$ to intermediate values can be beneficial, avoiding the abovementioned stagnation on one hand, and excessive cost of verification on the other.
3.2 Turning Counterexamples into Tests
As signaled above and illustrated in Example ^{2}, counterexamples are instances of program input ($in$), and as such do not form tests that require also the associated correct program output ($out$). Therefore, we allow $Tc$ to hold two types of tests:

Complete tests A complete test is a test of the form $(in,out)$. It is equivalent to the notion of test used in conventional GP and can be evaluated by executing the program on $in$ and comparing the obtained result to the expected output $out$. CDGP uses this mode of evaluation for all complete tests.

Incomplete tests An incomplete test is a test of the form $(in,null)$. The expected output for this test is not defined explicitly. This can happen for example if there are many (potentially even infinitely many) correct outputs for $in$.
In conventional testbased GP, incomplete tests are useless, as they do not say anything about the desired program behavior. In the specbased CDGP, programs can be still evaluated on them by sending an appropriate query to the SMT solver, which we demonstrate in the following example.
For this query, the solver returns unsat, because $p$ does not meet the specification for the considered input. For some other inputs however (e.g., (x=0, y=0)), this incorrect program may pass the specification, causing the solver to return sat.$\u25a1$
Though the possibility of testing programs on incomplete tests is appealing, it has a critical downside: calls to the solver are computationally costly, and the above approach becomes prohibitively expensive in the presence of many incomplete tests. Therefore, CDGP transforms incomplete tests into complete ones whenever possible. This, however, requires meeting certain formal properties that we detail in the following.
3.2.1 SingleOutput Property
As we aim at collecting complete tests in $Tc$ and so avoiding costly solver costs to determine output's correctness (Example ^{3}), we must require that a given input has the singleoutput property, that is it has only one correct corresponding output. If we ignored that aspect and associated an arbitrarily selected correct output $out$ with a given $in$, the resulting test $(in,out)$ could unfairly fail many programs that return a different correct output for $in$.
To address this issue, prior to applying CDGP to any given problem, we use the SMT solver to determine whether the singleoutput property holds for it. To illustrate, Listing 3 presents the SMT query that verifies this property for the Max2 problem. When queried, the solver will search for such values of $x$, $y$ for which it is possible to find two different outputs $out1$, $out2$ that meet the specification. If the solver returns sat, such outputs were found and the problem does not have the singleoutput property. If the solver returns unsat, then the problem has singleoutput property. Occasionally, the solver may return unknown, signaling that either the property cannot be verified in a given logic or that the available computational resources (time, memory) have been exhausted. In such cases, we assume that the singleoutput property does not hold.
More precisely, the singleoutput property can be defined in two ways: globally, as a property of a problem (i.e. for all inputs), or locally, as a property of an input (given problem). Verity of the former implies the latter. However, the absence of the global singleoutput property does not imply that single output cannot be determined for specific inputs. Therefore, in case the singleoutput property does not hold globally, CDGP attempts to find out if it holds locally for every new counterexample, which is explained in Section 3.2.3.
3.2.2 SingleInvocation Property
If the function to be synthesized is called always with the same arguments in formal specification, the problem has the singleinvocation property (Reynolds et al., 2017). Checking this property is relatively simple and can be done by syntactic analysis of the specification.
3.2.3 Finding Outputs for Incomplete Tests
It should be clear from the two previous subsections that the desired output for an incomplete test can be unambiguously determined using a solver if both of the abovementioned properties hold. We summarize this observation in the following theorem:
To create a complete test $(in,out)$ from an incomplete test $(in,null)$, $in$ must have the singleoutput property, and the problem (specification) must have the singleinvocation property.
The practical upshot of the above observation is that a problem without singleinvocation property would force calling the solver whenever testing a program on any (incomplete) test. Though CDGP can handle such problems, we do not consider them in the experimental part, because calling the solver for each program generated by GP and each test in $Tc$ is prohibitively expensive.^{1}
Determining the local singleoutput property for an input of the incomplete test is realized by the same query which is responsible for searching for the correct output. This query, presented in Listing 4, simply represents the output to be produced by a program as a single variable, and solves for its value for the provided inputs. The resulting value of the output is then combined with the input to form a complete test. The key observation is that we can search for the correct output twice, the second time excluding the answer we obtained the first time. If the answer for the second query is unsat, then the input has the singleoutput property and a complete test is created. The second query has an additional constraint (assert (distinct out value)), where value is the output obtained in the first query.
It may also happen that the first query returns unsat. This means that the specification is contradictory and no program can satisfy it, and thus the search process would end with the appropriate message.
3.3 Representation of Solutions
As programs in CDGP need to be verified by the SMT solver, they can use only the semantics (types, operators, instructions, etc.) available in the background theory supported by the solver, for example LIA or Strings (SLIA). In principle, any programming language could be used to represent programs in CDGP, given the appropriate converter to SMTLIB language (Barrett et al., 2015, 2016), which is accepted as an input language by most modern SMT solvers. In this study, for simplicity, we decided to evolve programs directly as SMTLIB expressions. SMTLIB is a functional language and is similar in many aspects to LISP, so the traditional treebased GP was most adequate.
4 Related Work
Formal methods for program synthesis have a long history, preceding heuristically informed stochastic methods such as GP by some decades (Cohen, 1994). The literature for formal approaches to synthesis (and verification) is correspondingly vast; for recent overviews see Boca et al. (2009) and Almeida et al. (2011). In contrast, we are aware of only few program synthesis approaches which combine formal techniques with heuristic search. To our knowledge, the earliest work combining the aspects of evolutionary search and formal approaches was that by Haynes et al. (1996), where GP was used to produce entailment proofs, an initial step for potentially using it to automate the specification refinement process.
An approach due to Johnson (2007) incorporated modelchecking with the specification of the task expressed via computation tree logic (CTL) to evolve finite state machines, and was used to learn a controller for a simple vending machine. The fitness was computed as the number of CTL properties which were verified to hold for a given program. A similar approach by He et al. (2011), the Hoare logicbased GP, computes fitness as the number of postcondition clauses which can be inferred from the precondition and the program being evaluated. Instead of modelchecking, the Hoare logic (Hoare, 1969) is used for the specification of the task and verification.
From 2008, Katz and Peled (2008, 2014, 2016) authored a series of papers combining modelchecking and GP in which they progressively refine their MCGP tool based on linear temporal logic (LTL). They use enhanced modelchecking to impose a gradient on the fitness function by distinguishing several levels of passing an LTL property (met for all inputs, met for only some inputs, met for no input). Apart from that, this approach is very similar to the two previously described. It is worth noting that Katz and Peled (2014) also considered briefly using SMT solver for verification in the similar way as modelchecking, and even similarly to CDGP utilized counterexamples to provide for more granular fitness. However, they only reported trying to solve a simple problem, and seemingly abandoned this line of research after that. By utilizing counterexamples, this initial work is the most similar to CDGP out of all related works mentioned here.
The use of coevolutionary GP to synthesize programs from formal specifications was researched by Arcuri and Yao (2007, 2014). They maintained joint populations of tests and programs within a competitive coevolution framework. Fitness of programs was calculated using a heuristic that estimated how close a postcondition was from being satisfied by the program's output for specific tests. While allowing the synthesis of programs with GP from a formal specification, such an approach provides no guarantees that program deemed correct by their method will be consistent with the specification for all possible inputs.
In the emerging area of genetic improvement (the modification of preexisting program code via search), there have been a number of recent articles incorporating formal approaches. Kocsis et al. (2016) report a 10,000fold speedup of Apache Spark database queries on terabyte datasets. In Burles et al. (2015), a 24% improvement in energy consumption was achieved for Google's Guava collection library by applying the “Liskov substitution principle,” the formal cornerstone of objectorientation. Some recent work has also used category theory to perform formal transformations on datatypes (Kocsis and Swan, 2014, 2017) in order to join together parts of a program which are otherwise unrelated, a technique applicable to “Grow and Graft Genetic Programming” (Harman et al., 2014).
There are also many wellknown systems that perform synthesis under the broad heading of Inductive Logic Programming (Muggleton, 1994). In particular, IGOR II (Hofmann, 2010; Hofmann et al., 2008) is known to perform well on a range of problems. As extended by Katayama (2012), it combines an “analytic” approach based on analysis of fitness cases with the generateandtest approach more familiar to the GP community.
An alternative to specbased synthesis is “program sketching” (SolarLezama et al., 2006), a technique whereby a program contains “holes” which are automatically filled in (e.g., using an SMT solver) with values satisfying a specification. However, the approach has limited scalability since the exact search method used has exponential runtime in the number of variables. More recently, evolutionary program sketching (EPS) has been proposed (Błądek and Krawiec, 2017). EPS is presented as a GP alternative that evolves partial programs and then uses an SMT solver to complete them, attempting to maximize the number of passing test cases. For the small set of benchmarks under consideration, EPS outperforms conventional GP (e.g., in the number of optimal solutions found).
5 Experiments
In the following sections, we describe the experimental framework, including information about benchmarks, some implementation choices, tested configurations of CDGP, and baselines. Then, in Section 5.6, we analyze the results of the experiments and the characteristics of CDGP dependent on its settings. In Section 5.7, we confront CDGP with exact, nonheuristic algorithms of program synthesis. The source code of CDGP, along with specifications of problems, is available at: https://github.com/kkrawiec/CDGP.
5.1 Benchmark Suite
We consider a range of specbased synthesis benchmarks of varying difficulty and characteristics, representing two domains: the theory of linear integer arithmetic (LIA) and the theory of strings (SLIA, strings and linear integer arithmetic) (Barrett et al., 2016). Part of the benchmarks come from the SyGuS repository maintained for the annual “SyntaxGuided Synthesis” competition (Alur et al., 2015, 2013). We detail the domains, grammars, and benchmarks in the following, first for LIA and then for SLIA.
5.1.1 LIA Benchmarks
In LIA benchmarks, presented in Table 1, the task is to synthesize a program with a signature I$n\u2192$I, where I stands for integer type and $n$ is program's arity. Max, Search, and Sum come from the SyGuS competition (Alur et al., 2015, 2013); the remaining benchmarks are of our own design. Some benchmarks (IsSeries, IsSorted, Search) interpret input arguments as a fixedsize ordered sequence of type I. In the IsSeries and IsSorted tasks, the program is required to return 1 if the arguments form, respectively, an arithmetic series or are sorted in ascending order, 0 otherwise. In the Search benchmark, a correct program should find the 0based index of the last argument in an “array” of length $n$ formed by the remaining arguments (which are constrained by a precondition to be sorted). Hence, for instance Search2(3,7,1)=0, Search2(3,7,4)=1, and Search2(3,7,10)=2, where index in the benchmark's name refers to the size of “array.”^{2}
Name  Arity  Semantics 
CountPos  2, 3, 4  The number of positive arguments 
IsSeries  3, 4  Do arguments form an arithmetic series? 
IsSorted  4, 5  Are arguments in ascending order? 
Max  4  The maximum of arguments 
Median  3  The median of arguments 
Range  3  The range of arguments 
Search  2, 3, 4  The index of an argument among the other arguments 
Sum  2, 3, 4  The sum of arguments 
Name  Arity  Semantics 
CountPos  2, 3, 4  The number of positive arguments 
IsSeries  3, 4  Do arguments form an arithmetic series? 
IsSorted  4, 5  Are arguments in ascending order? 
Max  4  The maximum of arguments 
Median  3  The median of arguments 
Range  3  The range of arguments 
Search  2, 3, 4  The index of an argument among the other arguments 
Sum  2, 3, 4  The sum of arguments 
The grammar for LIA programs includes two types, Int (I) and Boolean (B) (see Figure 2). To avoid multiplying the input variables by themselves (and so building programs that involve nonlinearity), we introduce an additional type C for integer constants. The corresponding production also explicitly defines the range of integers that can be used in the programs generated by CDGP.^{3}
Search $k$ is the only group of LIA benchmarks in our suite without the global singleoutput property (Section 3.2.1) because the desired output is not defined for arrays which are not sorted, and thus any output is correct for such inputs. The singleinvocation property holds for every benchmark in our suite.
To illustrate, Listing 5 presents the specification of the Max4 benchmark expressed in the SyGuS language (Raghothaman and Udupa, 2014).^{4} The synthfun statement defines the signature of the function to be synthesized. The constraint commands define the specification and are combined with logical conjunction by the solver. The declarevar commands declare universally quantified variables, which can be then used in the constraints. Preconditions can be defined by implications with conditions on the values of the variables. In this example there are no implications, so this specification consists of only the postcondition—the precondition is empty, that is, the inputs to Max4 are only required to belong to the type Int.
5.1.2 SLIA Benchmarks
The SLIA benchmarks, presented in Table 2, are based on those from the “Programming by Examples” track in SyGuS competition. The original benchmarks are all testbased, and our benchmarks are extended to the simplest formal specification that generalises the original set of tests. For example, the original benchmark “drname” included inputoutput pairs: ("Nancy FreeHafer", "Dr. Nancy"), ("Andrew Cencici"), "Dr. Andrew"), ("Jan Kotas", "Dr. Jan"). The corresponding formal specification states that: a) the first token of the output is "Dr." and b) the second token of the output is equal to the first whitespacedelineated token of the input. The other SLIA benchmarks are similarly defined.
Name  Arity  Semantics 
drname  1  Extract first name from full name and prepend it with “Dr.” 
firstname  1  Extract first name from full name 
initials  1  Extract initials name from full name 
lastname  1  Extract last name from full name 
combine  2  Combine first and last name into full name 
combine2  2  Combine first and last name into first name followed by initial 
combine3  2  Combine first and last name into initial followed by last name 
combine4  2  Combine first and last name into last name followed by initial 
phone  1  Extract the first triplet of digits from a phone number 
phone1  1  Extract the second triplet of digits from a phone number 
phone2  1  Extract the third triplet of digits from a phone number 
phone3  1  Put first three digits of a phone number in parentheses 
phone4  1  Change all “” in a phone number to “.” 
Name  Arity  Semantics 
drname  1  Extract first name from full name and prepend it with “Dr.” 
firstname  1  Extract first name from full name 
initials  1  Extract initials name from full name 
lastname  1  Extract last name from full name 
combine  2  Combine first and last name into full name 
combine2  2  Combine first and last name into first name followed by initial 
combine3  2  Combine first and last name into initial followed by last name 
combine4  2  Combine first and last name into last name followed by initial 
phone  1  Extract the first triplet of digits from a phone number 
phone1  1  Extract the second triplet of digits from a phone number 
phone2  1  Extract the third triplet of digits from a phone number 
phone3  1  Put first three digits of a phone number in parentheses 
phone4  1  Change all “” in a phone number to “.” 
The grammar for SLIA programs, shown in Figure 3, includes two types: String (S) and Int (I).^{5} To realize the functionality requested by the benchmarks, relatively simple capabilities are required: splitting a string into words; extracting the first letter form a word; concatenating strings, and combining the input string(s) with some constant characters. However, different benchmarks require different character constants; for instance, the drname benchmark requires the “.” character. Therefore, the SLIA grammar is adapted to individual benchmarks by including the required characters via the $constants$ term.
SLIA benchmarks are mostly guarded by some preconditions and thus the global singleoutput property is not met, the only exception being the combine benchmark series. The singleinvocation property holds for every SLIA benchmark.
5.2 Search Operators
We guarantee that programs initialized and bred within a run belong to the given domain by using a typed variant of GP and conforming with a theoryspecific grammar.

Initialization recursively traverses the derivation tree from the starting symbol of the grammar and randomly picks expressions from the righthand sides of productions. Once the depth of any node of the program tree reaches 4, the operator picks the productions that immediately lead to terminals whenever possible. If the depth exceeds 5, the tree is discarded and the process starts anew.

Mutation picks a random node in the parent tree, and replaces the subtree rooted in that node with a subtree generated in the same way as for initialization. To conform to the grammar, the process of subtree construction starts with the grammar production of the type corresponding to the picked location (e.g., if the return type of the picked node is I, generation of the replacing subtree starts with production I of the grammar).

Crossover draws a random node in the first parent program, and builds a list $L$ of the nodes in the second parent that have the same type. If $L$ is empty, it draws a node from the first parent again and repeats this procedure. Otherwise, it draws a node from $L$ uniformly and exchanges the subtree rooted there with the subtree drawn from the first parent. This process is guaranteed to terminate, since both parent trees always feature at least one node of the type associated with the root node (I for LIA and S for SLIA) and the root nodes are also allowed to be swapped.
To control bloat, a program tree resulting from any of these search operators is considered feasible unless its height exceeds 12. Should that happen, the program is discarded and the search operator is queried again. Additionally, whenever a tournament selection or deselection is used, it includes lexicographic parsimony pressure (Luke and Panait, 2002), that is, in case of a tie on fitness, the smaller program is preferred.
5.3 Communication with the Solver
Communication with the solver is realized via the SMTLIB standard (Barrett et al., 2015), recognized by most contemporary SMT solvers. We employ the wellknown Microsoft Z3 SMT solver (de Moura and Bjørner, 2008), one of the most performant and widelyused noncommercial solvers. This choice was arbitrary; that is, no Z3specific features were used.
Our implementation of Verify$(p,(Pre,Post))$ in Algorithm 1 translates the tree representation of an evolved program $p$ into a function definition in the SMTLIB language, combines it with the contract $(Pre,Post)$, and calls the solver to verify whether $p$ meets $(Pre,Post)$ (see Section 3.1 for more details). The runtime of the solver may vary with the size of the verified program and its structure. In the experiments conducted here, the average time the solver needed for verifying a single program ranged from 0.03s to 0.35s, depending on CDGP variant and fitness threshold. However, occasionally it took much longer, up to 30s. For this reason, we cap the time of a single run to 1 hour, which becomes our additional stopping criterion, atop of the maximum number of generations.
5.4 Configurations
Table 3 presents the settings of our evolutionary algorithm. Our default configuration involves tournament selection, a common choice for GP that proved useful in many past studies. However, recall that the working set of tests $Tc$ is initially empty and may grow slowly. With a handful of tests, the fitness function (Eval in Algorithm 1), which counts the fraction of passed tests, can assume only a few distinct values and has thus little discriminatory power. Ties on fitness are likely, which causes tournament selection to act at random and weakens the selective pressure.
Parameter  Value 
Number of runs  25 
Population size  500 
Maximum height of initial programs  5 
Maximum height of trees inserted by mutation  5 
Maximum height of programs in population  12 
Maximum number of generations  100000 
Maximum runtime in seconds  3600 
Probability of mutation  0.5 
Probability of crossover  0.5 
Tournament size  7 
Parameter  Value 
Number of runs  25 
Population size  500 
Maximum height of initial programs  5 
Maximum height of trees inserted by mutation  5 
Maximum height of programs in population  12 
Maximum number of generations  100000 
Maximum runtime in seconds  3600 
Probability of mutation  0.5 
Probability of crossover  0.5 
Tournament size  7 
Therefore, we consider an alternative setup equipped with lexicase selection (Helmuth et al., 2015). In each selection event, lexicase starts with a pool of all programs from the population. A random test $t$ is drawn from $Tc$ without replacement, and programs that do not pass $t$ are discarded from the pool. Drawing tests and discarding programs from the pool is repeated until only one program is left, in which case it is selected; if all tests have been used, or the current test would reduce the pool to the empty set, the winner is drawn uniformly from the remaining pool. We do not use lexicographic parsimony pressure in configurations that involve lexicase selection.
In Section 3, we presented CDGP as a generational evolutionary algorithm. Note that for the initial population, the solver is called for each program being evaluated, as each such program formally passes all tests in $Tc$, which is initially empty (no matter what the fitness threshold $q$ is). Given that the population is quite sizeable (500, Table 3), this may lead to high computational cost. More importantly however, each such evaluation will produce a counterexample. Many of them, though unique, can be redundant in $Tc$, i.e. verify the same property of programs. For instance for the Max4 benchmark, the inputoutput pairs ((2,1,1,1),2) and ((2,0,1,1),2), even though technically distinct, essentially test the same characteristics of programs, that is, their capability of returning the first argument if it happens to be greater than all the remaining ones.
Arguably, neither of the above is desirable, so we consider the steadystate evolutionary algorithm as an alternative to the generational one. In that variant, an iteration (generation) consists in first discarding a poorlyperforming program from the population (using a negative tournament selection of size 7), and then breeding a new program with a randomly chosen search operator (mutation or crossover). The program created in this way replaces the removed one in the current population and is subject to evaluation. As a result, it may undergo verification if required by Algorithm 1, and the resulting counterexample $tc$ is immediately added to $Tc$. If $tc$ is new to $Tc$, the fitness values of all programs in the population are updated by applying them to $tc$, so that they are consistent with the contents of $Tc$. To make the comparison between steady state and generational configurations fair, for steady state we multiply the maximum number of generations by the population size (500), so that the maximum number of evaluated solutions in both cases is the same.
The key feature of the steady state approach is thus that fitness values of all programs in the population are updated promptly, as soon as new tests arrive in $Tc$. We anticipate this to make search process more reactive and potentially more efficient.
Additionally, we assess the impact of fitness threshold$q$ on evolution by testing CDGP on the range of its values: ${0.0,0.25,0.5,0.75,1.0}$. As was discussed in Section 3.1, high values of $q$ result in lower number of performed verifications, while low values provide for better gradient of the fitness function. We find the tradeoff between those two effects worth a closer investigation.
We here summarize the differences in experimental configuration with Krawiec et al. (2017). Differences include the presence of lexicographic parsimony pressure (only for tournament selection), initial population not being verified in the steadystate variant, and smaller interval of integers for drawing random tests in the control approach GPR (described later). In this new series of experiments, all algorithms maintain 500 candidate solutions. Moreover, we also use a newer release of the Z3 solver, which may impact the computing time and characteristic of returned counterexamples.
5.5 Baseline: GPR
Our baseline is GPR (GP Random), which proceeds as CDGP, except for line 9 in Algorithm 1, where it adds to $T$ a randomly generated test, rather than the counterexample returned by the solver. In this way, the dynamics of GPR are similar to the conservative variant of CDGP; that is, the test base gets extended when a program in population manages to pass all of them. As in CDGP, multiple new tests may be added to $Tc$ in a single generation, duplicates in $Tc$ are eliminated, and $Tc$ may grow indefinitely during a run. We use GPR only with $q={0.75,1.0}$, because (i) CDGP fares best for these values and (ii) setting $q$ to lower values leads to exorbitant numbers of tests in $Tc$. By comparing CDGP with GPR, we intend to determine whether synthesizing tests from a specification makes CDGP any better than generating them at random.
In GPR, we create random tests by drawing program inputs at random. For LIA benchmarks, we draw numbers uniformly from $[100,100]n$ where $n$ is the input arity of synthesized function. We anticipate that the width of this interval is not critical, given that in most benchmarks (except for Sum and IsSeries) the functions to be synthesized should interpret their inputs as ordinal variables.
5.6 Performance Analysis
We discuss the results for LIA and SLIA benchmarks separately due to their volume and varying characteristics. In Tables 4, 5, and 6 we present respectively the success rate, the endofrun size of $Tc$, and the runtime in seconds, for individual variants of CDGP and GPR on the LIA benchmarks. Recall that for specbased synthesis, a success means synthesizing a provably correct program that is logically consistent with the specification (in contrast to a conventional GP which is concerned with passing supplied tests).
CDGP is clearly more likely than GPR to synthesize correct programs, which confirms that counterexamples are more useful than random inputs. Due to the curse of dimensionality, covering the input space becomes increasingly difficult in higher dimensions and GPR's performance quickly degrades with the growing cardinality of input. CDGP is affected by this phenomenon too, but to a much lesser extent. We also hypothesize that randomly drawing inputs which test certain “corner cases” (e.g., an array of the same repeated value in IsSorted) is particularly unlikely. In CDGP, to the contrary, the gradually increasing quality of programs forces the solver to come up with more and more sophisticated tests.
Using the fitness threshold $q$ to control when programs should be verified is clearly beneficial when compared to the extremes, that is, to the conservative variant ($q=1$) and nonconservative one ($q=0$). Setting $q$ to 0.75 turns out to be optimal here. As anticipated in Section 3, we hypothesize that the conservative approach is too demanding and tends to wait too long for new tests to arrive, depriving itself of potentially valuable search guidance. This is confirmed by the endofrun size of $Tc$ (Table 5), which is typically between one and two orders of magnitude smaller than for nonconservative variant.
Verifying all evaluated programs in the nonconservative variant ($q=0$) is also suboptimal. What comes as a bit of a surprise is that this does not seem to lengthen the runtime (Table 6)—presumably, if all programs in a population are being verified, many of them have low fitness, and their incorrectness can be quickly proven by the solver. There must be thus another reason why the success rate for $q\u22640.5$ is systematically worse than for $q=0.75$. We posit that many tests collected in these settings may be in fact redundant, i.e. examine the same properties of programs (recall the earlier example of mutually redundant tests for the Max4 benchmark). Because CDGP cannot detect such redundancy, such pairs of tests (and consequently groups of tests) can coexist in $Tc$. The obvious consequence is that $Tc$ grows large and slows down the evaluation. Even though this does not seem to be challenging given the time budget available here, the presence of many redundant tests decreases the relative importance of the “essential” ones. For the setups equipped with tournament selection, the contribution of nonredundant tests to the overall fitness is low, and so is the likelihood that they affect selection. In lexicase selection, the probability that such tests will reduce the pool of solutions at some point is low.
Though $q=0.75$ seems to provide the right balance, this is not to say that this value should be considered optimal. We speculate that setting $q$ to values closer to, yet still smaller than 1 may have a similar effect.
Concerning the type of evolutionary algorithm engaged, the results invalidate the hypothesized superiority of steadystate evolution thanks to updating solution fitness online, right after a new counterexample arrives to $Tc$. The possible reason is that steadystate runs tend to collect noticeably fewer tests than the generational variant.
Statistical analysis corroborates the above observations. We employ the Friedman's test for multiple achievements of multiple subjects (Kanji, 2006) on the success rate of all 28 configurations shown in the tables (20 for CDGP and 8 for GPR). The pvalue $2.2\xd71038$ strongly indicates that at least one configuration performs significantly different from the remaining ones. The following table presents them ordered by the average rank, from best to worst (G/S = generational/steady state, T/L = tournament/lexicase):
GL$75$ 4.6  SL$75$ 5.1  GT$75$ 5.2  ST$75$ 6.9  GL$5$ 7.5  GL$1$ 8.3  GT$1$ 9.0  GL$25$ 9.5  SL$5$ 10.4  GL$0$ 11.2  SL$1$ 11.7  GT$5$ 12.6  SL$0$ 12.7  ST$5$ 14.2 
SL$25$  GPRGL$1$  GPRGT$1$  GT$25$  GT$0$  GPRGL$75$  ST$25$  ST$1$  GPRGT$75$  ST$0$  GPRST$1$  GPRSL$75$  GPRSL$1$  GPRST$75$ 
14.5  16.6  17.1  17.8  18.8  19.2  20.4  20.7  21.0  21.3  21.8  22.2  22.4  23.0 
GL$75$ 4.6  SL$75$ 5.1  GT$75$ 5.2  ST$75$ 6.9  GL$5$ 7.5  GL$1$ 8.3  GT$1$ 9.0  GL$25$ 9.5  SL$5$ 10.4  GL$0$ 11.2  SL$1$ 11.7  GT$5$ 12.6  SL$0$ 12.7  ST$5$ 14.2 
SL$25$  GPRGL$1$  GPRGT$1$  GT$25$  GT$0$  GPRGL$75$  ST$25$  ST$1$  GPRGT$75$  ST$0$  GPRST$1$  GPRSL$75$  GPRSL$1$  GPRST$75$ 
14.5  16.6  17.1  17.8  18.8  19.2  20.4  20.7  21.0  21.3  21.8  22.2  22.4  23.0 
All CDGP configurations with $q=0.75$ clearly rank at the top, followed closely by the CDGP configurations for other values of $q$. The GPR control configurations, on the other hand, gather at the bottom of the ranking, with a few exceptions of CDGP configurations that use the extreme $q$ of 0 or 1.
To determine the significantly different pairs of configurations, we conduct posthoc analysis using symmetry test (Hollander et al., 2013). The analysis reveals that all CDGP configurations with $q=0.75$ are better than all GPR configurations $(p<0.05)$, except for ST$75$ that is not significantly better than GPRGL$1$ and GPRGT$1$. A number of other configurations of CDGP (GL$25$, GL$5$, GL$1$, GT$1$) also tend to be statistically better than four or more GPR configurations (particularly than those GPR configurations that use tournament selection).
Though the average success rates for the optimal $q=0.75$ are very similar for both tournament and lexicase selection, the latter typically provides better rates for the remaining values of $q$. We may thus conclude that lexicase has once again proved its usefulness, corroborating many other studies and our results from Krawiec et al. (2017). This is even more impressive given that the lexicase algorithm is actually quite costly in execution compared to tournament selection, which becomes reflected in the average number of generations elapsed—1069 versus 102.5 for the generational variant, and 2.8 million vs. roughly 100 thousand for the steadystate variant. As a consequence, lexicase runs typically evaluate an order of magnitude fewer solutions than the tournament runs—yet, despite that, perform on a par or better.
In Tables 7, 8, and 9, we present the success rate, endofrun size of $Tc$, and the runtimes for the string domain SLIA. We do not run GPR this time, as there is no obvious way of automatically generating plausible tests for these benchmarks, which are for the most part guarded by preconditions. A character string generated at random is very unlikely to pass the precondition, and consequently test any program property that would relate to a given task.
The overall success rates for the SLIA benchmarks turn out to be slightly smaller than for the LIA domain. We observe a similar pattern of sensitivity to the $q$ threshold as for LIA benchmarks: best success rates for fitness thresholds around 0.75, and smaller sizes of $Tc$ for higher fitness thresholds. Interestingly, this time the steadystate variant is noticeably better, which is striking, as the number of tests collected there is often very small, in single digits. This indicates that the good performance of these configurations is more due to visiting a large number of candidate solutions (again, often one or more orders of magnitude more than for the generational variant) than to usefulness of tests elicited by CDGP from formal verification.
We scrutinize these results statistically using the same apparatus as for the LIA benchmarks, running the Friedman's test (Kanji, 2006) on the success rate of configurations shown in the tables, the number of which is this time 20. The pvalue amounts to $3.5\xd71014$ and so indicates significant differences. The configurations rank as follows (G/S = generational/steady state, T/L = tournament/lexicase):
ST$1$  SL$75$  ST$75$  SL$1$  ST$5$  GL$75$  GT$1$  SL$5$  GL$1$  GT$75$  ST$25$  GT$5$  GL$5$  SL$25$  GL$25$  SL$0$  GT$25$  ST$0$  GT$0$  GL$0$ 
5.4  5.8  6.2  7.0  7.5  8.6  9.0  9.3  9.4  9.5  10.3  10.7  11.0  12.1  12.7  13.6  15.3  15.3  15.3  15.8 
ST$1$  SL$75$  ST$75$  SL$1$  ST$5$  GL$75$  GT$1$  SL$5$  GL$1$  GT$75$  ST$25$  GT$5$  GL$5$  SL$25$  GL$25$  SL$0$  GT$25$  ST$0$  GT$0$  GL$0$ 
5.4  5.8  6.2  7.0  7.5  8.6  9.0  9.3  9.4  9.5  10.3  10.7  11.0  12.1  12.7  13.6  15.3  15.3  15.3  15.8 
As for LIA, the CDGP configurations with $q=0.75$ tend to rank at the top, however, this time accompanied by a few configurations with $q=1$. The superiority of the steadystate approach is evident. However, posthoc analysis using the symmetry test (Hollander et al., 2013) reveals that most of pairwise differences are statistically insignificant. For instance, even though ST$1$ and ST$75$ top the ranking, each of them significantly outranks only the four CDGP configurations from the very bottom of the ranking (GL$0$, GT$0$, GT$25$, ST$0$). The moderate number of pairwise significant differences was however expected, given that there are no dramatic differences between success rates for SLIA benchmarks—the average success rates range in $[0.54,0.82]$ (Table 7).
In summary, CDGP equipped with tournament selection and admitting programs for verification only if they pass at least 75% of tests is the configuration that tops the success rate on our benchmark suite. This holds for both the generational and steadystate variant, though the latter is noticeably faster than the former on SLIA problems and thus may be preferred in practice.
5.7 Comparison with Formal Approaches
We compare CDGP with two exact solvers for program synthesis: EUSolver (Alur et al., 2017), and CVC4 (Barrett et al., 2011). CVC4 is the latest in the “Cooperating Validity Checker” series of SATbased solvers, developed over the last 30 years. It is wellknown that SMT solvers do not perform well in proving universally quantified expressions to be satisfiable. CVC4 therefore supports refutationbased synthesis, for which a model of the function to be synthesized is obtained from the proof that the negation of the synthesis formula is unsatisfiable.
Since naïve enumerative approaches to program synthesis do not scale, EUSolver seeks to provide scalable enumeration via a divide and conquer approach that separately enumerates a) predicates for partitioning the inputs and b) small expressions which are correct on a subset of inputs. The problem of combining predicates and expressions is then treated as a multilabel decision tree learning problem. By working with a probability distribution over labels, EUSolver can take advantage of standard informationgain heuristics to induce compact trees.
We apply CVC4 and EUSolver to the LIA benchmarks, and CVC4 to the SLIA benchmarks (EUSolver cannot handle formal String specifications). As the exact algorithms are deterministic, we run them only once on each benchmark. Both methods need only a fraction of a second to synthesize a correct program for all LIA benchmarks; the average runtime of EUSolver is 0.4s (max 1.5s), and for CVC4 it is hardly measurable (below 0.01s). We may conclude that, in terms of efficiency, CDGP is no match for the exact algorithms in the LIA domain. In the SLIA domain, however, purely formal string specifications proved hard for CVC4, which managed to find a correct program only for 2 (namecombine, namecombine3) out of 13 benchmarks.
There are, however, other metrics that make the comparison more interesting. Figure 4 juxtaposes the bestofrun programs produced by the exact methods to the average sizes of those synthesized by CDGP (in the generational variant with tournament selection and $q=0.75$). We factor these results by benchmark class and present them as a function of instance size, i.e. the number of inputs. The sizes of programs produced by CVC4 and EUSolver grow very fast with instance size, close to exponentially (note the log scale of the vertical axis). For CDGP, on the other hand, the growth is moderate and closer to linear. However, we should add at this point that CDGP simplifies the bestofrun solution using the SMT solver in a semanticpreserving manner. While the simplification utility is not a part of the SMTLIB standard (Barrett et al., 2015), it is present in some SMT solvers. In Z3, we use the simplify command that checks locally whether a subexpression can be replaced with a shorter one (e.g., (+ 1 1) would be rewritten as 2). The results shown in Figure 4 count that aspect in, so one might argue that they can be biased in favor of CDGP. On the other hand, the preference for shorter programs is to some extent built into CVC4 and EUSolver by design.
6 Discussion
The results indicate that counterexamples collected from verification in CDGP prove more useful as tests than the inputs constructed at random in GPR. On one hand, this was expected, because, in contrast to counterexamples, random tests are not derived from the problem specification and in this sense convey less problemspecific knowledge. On the other hand, SMT solvers follow sophisticated search tactics, reportedly built on years of expert experience and as such involving certain search biases. It is thus not obvious that counterexamples they identify should be effective when used as “search drivers” (Krawiec, 2016) in a stochastic synthesis process.
On the other hand, it is fair to say that the effectiveness of GPR is quite high, particularly on the simpler benchmarks. The success rate of this baseline approach could form a measure of problem difficulty, which does not seem to trivially correlate strongly with input arity; compare for instance the staggering differences in success rates for CountPos4 and Search4 (Table 4). This is, however, not to say that GPR could form a competitive alternative to CDGP.
The reader familiar with contemporary software engineering has likely noticed that CDGP can be seen as an automatic analog to testdriven software development (Beck, 2002), where a software developer iteratively constructs tests of gradually increasing difficulty that detect flaws in the current implementation and so help improving it. This analogy holds also for other counterexampledriven methods (Jha et al., 2010; SolarLezama et al., 2006), and naturally brings to mind the coevolutionary metaphor, as posited in related works (Katz and Peled, 2016). Indeed, a natural followup of this study could involve borrowing the developments from coevolutionary algorithms, in particular coevolving the tests alongside with programs, and using measures like distinctions or informativeness to maintain them (Ficici and Pollack, 2001).
7 Conclusions
This contribution builds upon our original study on counterexampledriven genetic programming (Krawiec et al., 2017), a method for synthesizing programs from specifications. We extended CDGP with a fitness threshold parameter that controls the frequency of program verification, found that setting it to a nonextreme value (0.75) tends to systematically improve the odds of successful synthesis, and proposed an explanation for this observation. We introduced a rigorous conceptual framework for turning counterexamples into tests, based on the welldefined notions of singleoutput property and singleinvocation property. We updated and improved several technical internals of the method and applied it to a larger suite of benchmarks, showing, among others, its capability to synthesize both integerbased (LIA) and stringprocessing (SLIA) programs. Last but not least, we compared CDGP to two stateoftheart exact methods of formal program synthesis; CDGP, albeit slower, has been shown to produce shorter programs.
With this work, we also hope to help bridge the gap between the testbased and specbased synthesis. As we argued in Section 2.1, these two paradigms, though often perceived as disparate, have certain commonalities and their marriage can be mutually beneficial. Testbased synthesis, by opening to formal specifications, may gain correctness guarantees. Specbased synthesis, faced with the combinatorial explosion of systematic search, may benefit from including heuristics as a search guidance, and thanks to that scale better. From a broader perspective, such “middle ground” approaches address one of the fundamental—if not existential—questions of program synthesis, that is, how should user intent (Gulwani, 2010) be expressed? It is clear that tests and specifications are just the extremes of a conceivably rich spectrum.
CDGP in its current form is admittedly not free from certain challenges. The main shortcoming of the approach presented in this article are search operators. The way CDGP exploits knowledge obtained through the use of a solver is far from sophisticated, to say the least. The search operators, taken verbatim from the standard GP, are unaware of how programs interact with tests. It seems thus desirable to make search operators better informed by the verification process, which we find the most promising further research direction.
Acknowledgments
I. Błądek and K. Krawiec acknowledge support from grant 2014/15/B/ST6/05205 funded by the National Science Centre, Poland. J. Swan acknowledges the support of the EU H2020 SAFIRE Factories project.
Notes
In contrast, note that the solver is called for verification (Algorithm 1) at most once per evaluated program (the number of tests in $Tc$ has no impact on the number of solver's calls used for verification).
The original LIA grammar does not define such a range, and formal synthesis methods can thus generate programs with arbitrary integers. This becomes relevant when comparing CDGP against formal methods in Section 5.7.
In the original SLIA grammar of the SyGuS competition there was also a production for the Boolean (B) type, but it was not used in other productions, and consequently it was never utilized in our experiments.