In the multi-billion dollar formulated product industry, state of the art continues to rely heavily on experts during the “generate, make and test” steps of formulation design. We propose automation aids to each step with a knowledge graph of relevant information as the central artifact. The generate step usually focuses on coming up with new recipes for intended formulation. We propose to aid the experts who generally carry out this step manually by providing a recommendation system and a templating system on top of the knowledge graph. Using the former, the expert can create a recipe from scratch using historical formulations and related data. With the latter, the expert starts with a recipe template created by our system and substitutes the requisite constituents to form a recipe. In the current state of practice, the three steps mentioned above operate in a fragmented manner wherein observations from one step do not aid other steps in a streamlined manner. Instead of manually operated labs for the make and test steps, we assume automated or robotic labs and in-silico testing, respectively. Using two formulations, namely face cream and an exterior coating, we show how the knowledge graph may help integrate and streamline the communication between the generate, the make, and the test steps. Our initial exploration shows considerable promise.
We encounter formulated products many times in our daily lives. Products like personal, home, industrial care, pharma and health care, coatings (paints) and surfaces (lubricants, adhesives), and confectionary foods and drinks are pervasive in their use. The formulated products industry is an expanding global market of around 1400B Euro . Despite this scale, the state of the art in designing formulated products relies heavily on the experts' experiential knowledge.
The design of formulated products involves distinct steps. First, the expert needs to arrive at a feasible recipe for the product. This step involves searching and selecting ingredients, weights, possible mixtures, and recipe steps containing process actions at certain conditions. Experts carry out the activities manually, i.e., searching ingredients with specific functionalities, combining them with other ingredients, and deciding upon what is done to them. The following two steps involve making the product and testing it for its intended purpose and consist of considerable experimentation. Overall, significant manual intervention at every step leads to considerable time to market, many times in months and years, and enormous cost. These steps are individually carried out in silo [2, 3, 4, 5, 6, 7]. It is possible to imagine digitalization and automation in each of the steps mentioned above. Reliance on experts and the siloed nature of formulations design means they do not fully benefit from automation in any steps.
We propose aided formulation recipe generation by storing information relevant to formulations as a knowledge graph and creating recommendation and template generation and substitution systems on top of this knowledge graph. Additionally, we show that if formulation making and formulation testing had automated or in-silico realizations, it is possible to use the knowledge graph as the connecting link between the three steps, reducing the siloed nature and benefiting from observations in each step. Our specific contributions are as follows:
We aid the expert in generating formulation recipes in two different ways:
The expert may design the formulation from scratch, receiving recommendations using the knowledge graph on ingredients, mixtures, and weights, actions to be applied to ingredients, and conditions at which to apply the actions .
The expert may start with a templatized recipe of the intended formulated product. We show how such a template can be created using the knowledge graph. In this case, the expert substitutes representative ingredients, also in aided manner, to achieve the same result.
Following the generate step, as stated above, we expect the make and test steps to take place using automated robotic labs and in-silico manner, respectively. With two examples, a cosmetic formulated product and a paint formulated product, we show how to use the knowledge graph to streamline sharing of information among the three steps.
As illustrated in Figure 1, using a knowledge graph as a central connecting artifact between aided formulation generation, automated formulation making, and in-silico testing, we aim to reduce the over-reliance on experts and help integrate the largely siloed steps in formulation design.
Integrated generate, make, and test for formulated products.
Of the two ways to generate formulations, namely starting from scratch with the intended product and starting with a template of the intended product, we covered the former in . In Section 3, we recount it to contrast it with the latter. Both ways require the extraction of several different kinds of information and storage as a knowledge graph, as shown in Figure 1 and various analyses to be performed before proceeding with the generation of recommendations.
We assume interfacing systems in both the robotic lab and in-silico testing to exchange information with the knowledge graph. The generate step provides the base recipe to the robotic lab system representing the make step and receives a validated recipe in return, as shown in Figure 1. We explain this in detail in Section 4. Similarly, in-silico testing provides information about tests conducted in the form of physiochemical attributes of ingredients and mixtures to the generate step to store in the form of heuristics. These heuristics can be queried in the future when iteratively conducting the generate, make, and test steps. In Section 4, we show two different formulated products: face cream and an exterior coating carried through these steps in the manner described above. Section 5 concludes the paper. We begin next with related work.
2. RELATED WORK
2.1 Formulated Products
A formulated product is a product with well-defined target properties and contains a minimum of two selected, processed, and combined ingredients [4, 9, 10, 11, 12]. The word formulation is referred to by many senses of the word; for instance, it may mean a recipe, i.e., a list of ingredients with processing steps. It also denotes the act of formulating, meaning the combination of processes used for mixing and conditioning of active, protective, or stabilizing ingredients and the know-how that enables the selection of ingredients. Finally, it also indicates the actual blend of ingredients.
Their ubiquitous usage in our daily lives makes the formulated products industry a vast enterprise with a multi-billion-dollar turnover① each year. Chemical products go through design and development through mainly manual heuristic-rule-based, trial-and-error experiment-based approaches [5, 6, 7]. As a subset, this applies to formulated products as well. These products contain ingredients that undergo a step-by-step procedure that may include heating, cooling, stirring, and mixing to obtain specific target properties, both physical and chemical .
Individual ingredients may provide active functionality, as in active pharmaceutical ingredients in medicine, and enable enhanced delivery (especially in skin-applicable formulated products such as many cosmetic products or pharmaceutical pastes) or as a protective or stabilizing agent [9, 14, 15].
Many ingredients used in formulated products are multi-functional [15, 16]; for instance, Cetyl Alcohol is used as an emulsifier and a thickening agent in cosmetic products. In contrast, as a food additive, it is used as a flavoring agent. Even though an ingredient may be multi-functional, it is generally used for its primary functionality for specific formulated products. The enormous number of ingredients available to make various kinds of formulated products pose the following concerns [10, 12, 15, 16]: If a specific functionality such as an emulsifier is needed, which representative ingredient to use?; Should the ingredients be processed separately or as parts of a mixture (generally referred to as phase in formulation texts); How will the choice of ingredients and relative quantities affect the properties of the product, esp. sensory attributes?; What steps to follow in what order to arrive at the final product?. Such complexity has led many researchers to propose formulation design frameworks which we review next.
2.2 Formulation Design Frameworks
Many authors have come up with formulation design frameworks. A generic framework for chemical product design , also applicable to formulated products, starts with the identification of consumer needs , translating these needs to chemical/physical properties [2, 18, 19, 20], to manufacturing that product [3, 11, 21]. Design frameworks focusing on formulated product design discuss all three steps in formulated product design, namely generate, make, and test [3, 5, 15, 17, 20, 21, 11] but often without considering automated or robotic labs for the make step and without in-silico realization of the test step.
Nearly all design frameworks discuss process design with equipment and their operating conditions and optimal economic parameters.
Based on our interactions concerning digitalization and automation with some of the largest formulated product companies, we feel the need to consider automated make and in-silico test steps in concert with the aided generate step in formulated product design.
2.3 Automated “Make”
Automate d laboratories refer to labs where robots carry out various tasks right from picking up specific ingredients as per the recipe, from a particular shelf/container to measuring the correct quantity of the material to be added to performing the appropriate action (mixing, cooling, heating, etc.)
The concept of high-throughput (HT) screening has been around for more than five decades [22, 23, 24] and has been applied across wide range of domains like electrolytes , catalysts , and biomedical research .
Most recently, there have been specific mentions to high throughput formulation engines fully automated robotized systems which can make and test 100s of formulations per day. Evonik's High Throughput Equipment (HTE)② is 2 meters high, occupies 120 square-meter of area, has 13 robots performing various tasks, and can churn out 120 formulations on an average in 24 hours.
On the other hand, Materials Innovation Factory has recently launched a bespoke facility, viz. Formulation Engine③ with an investment of around 3 million pounds to enable entirely automated making and testing of formulations. It is a modular facility allowing up to 6 separate processing and testing stations to be connected.
The concept of autonomous or self-driving laboratories is comparatively a more recent concept, where automation exists in conjunction with artificial intelligence (AI). AI suggests what experiments to perform next based on past learning experiences and robotic platforms merely follow instructions. Various research groups have explored this concept for accelerating scientific discovery in a myriad of applications like battery electrolytes , thin-film materials [29, 30], inorganic compounds [31, 32], scientific instruments , clean energy , natural products , and alloys .
ChemOS is another recent attempt for a generic architecture for autonomous laboratories . It consists of orchestration software with fundamental layers of database management, experiment scheduling, and designing, feedback, etc. which can coordinate the overall workflow in a typical autonomous lab.
Gromski et al. , and Steiner et al. , described an abstraction called Chemputer, which mirrors the working of a chemist and enables linkage to physical operations of the automated robotic platform. They also described a program called Chempiler to produce machine-level instructions for the synthesis robot.
A common takeaway is that most of these works essentially solve (or have modules that solve) the experiment design problem by posing it as an optimization problem. The objective function usually describes deviation in the actual and desired property of material/formulation under consideration. The algorithm searches for the most optimum set of inputs (weight fractions, volumes, process parameters, etc.), which minimize the objective function, in turn recommending the next instance of the experiment to be performed. This contrasts with manually operated experiments, that include a factorial number of (design of) experiments (DOEs), thereby consuming considerably more resources and time.
2.4 In-silico “Test”
Testing is one of the most crucial steps in realizing a formulated product as it decides whether the formulation goes to production. Testing aims at such purposes as evaluating the product against the customer brief and for compliance/registration④. Testing is conducted at various stages across the formulated product design lifecycle- (a) at the ingredient level and (b) at the product (formulation) level and may result in changes (either in choice of ingredient or process parameters, etc.). The tests conducted can be categorized broadly into tests for evaluating (a) physiochemical properties such as density, viscosity, pH levels, conductivity, and surface tension, (b) safety properties such as toxicity and flammability, (c) efficacy properties like release profile and bioavailability, (d) stability properties such as phase separation and interaction with the environment, and finally but most importantly e) product brief related properties.
All these tests are performed using specific instruments⑤ by following the specific procedure, which may differ as per standards being followed, such as ASTM⑥ and ISO⑦, which detail the experimental setup and the conditions for the tests.
Although most of these tests are conducted manually (or by robots in the recent attempts of autonomous labs), there have been various attempts at either coming up with empirical relations, mathematical models, or exploring modelling and simulation (based on first principles, multi-scale) to reduce the time and resources utilized during some of these tests.
Our digital skin platform⑧ uses multi-scale modelling and virtual reality (VR) for transdermal pharmaceutical and cosmetics delivery . The platform leverages micro-and macro-scale modelling  and is capable of in-silico testing . It can emulate the physicochemical properties of human skin, thereby facilitating the study of how constituents of new formulations are transported through its layers .
We believe that the best way to enable the integration of generate, make, and test activities is to store the knowledge required to design formulated products and share it across the three steps. We review the different kinds of information of interest for this next.
2.5 Kinds of Knowledge in Formulated Products Industry
Kinds of knowledge required for formulated products design.
Depending on the type of formulated products, such as cosmetic products or paints and coatings, several specific products exist within. For instance, cosmetic products may be antiperspirants and deodorants, baby products, bath and shower products, beauty aids, creams, fragrances and perfumes, hair care products, insect repellents, lotions, shampoos, shaving products, soaps, sun care products, and so on . Similarly, paint products could be coatings and topcoats, coil coatings, enamels, exterior and interior paints, lacquers, primers, sealers, stains, texture paints, etc. .
Individually each of these could be considered a category in itself. For instance, creams could be cleansing cream, cold cream, night cream, anti-inflammatory cream, anti-wrinkle cream, and so on. Product functionality captures the specific purposes of the product. It is possible to carry out analyses at the required level by storing the details of formulation types and categories.
For individual products, ingredients assume the most vital role. Since individual recipes may contain references to various synonyms of ingredients, storing this information is necessary. The next critical piece of information is ingredient functionality. Formulations are the combination of functionalities offered by the ingredients used; for instance, creams, in general, tend to contain an emulsifier, an emollient, and a thickener. Process steps are the actions applied to ingredients in standalone or as phases/mixture to obtain the target product. We discuss ingredient synonyms, functionalities as well as process steps later in Section 3.
The ingredient/mixture properties, as well as microstructure descriptors, describe physical properties. Equipment and properties of equipment, along with operating conditions and economic data on fixed and operational costs, are referred to in the make step. We discuss some of these later in Section 4.
Product brief could be considered a qualitative function of the ingredients, often describing consumer preferences with sensory and other fuzzy properties . These are often mapped to a combination of ingredient functionalities, ingredient/mixture properties, and microstructure descriptors and treated as heuristics to arrive at a suitable qualitative requirement [15, 20, 46, 47].
In the next section, we show how we process and analyze various pieces of information shown in Figure 2. Storing this information as a knowledge graph enables recommending ingredients, weights, process actions, and conditions to the expert as he or she generates a formulated product candidate. It is also possible to generate a template for a specific formulated product. We discuss both these ways of generate step in the following.
3. KNOWLEDGE GRAPHS
We discussed how we extract various formulation constituents in . In the following, we briefly touch upon how we extract the relevant information and present it as a knowledge graph. We detail out the formulated products data we use to experiment. We then discuss the kinds of analyses necessary to generate new formulations in an aided manner.
We presented a human-in-the-loop way of generating desired formulations in , which we revisit to contrast with another way of generating formulations. This other way uses generic templates created for specific product types as the starting point. Although the core requirements from a system standpoint remain the same, the template-based approach enables a global view of the requisite constituents for a specific type of a formulated product. We present these and other key differences in these two ways of generating new formulations.
We begin by reviewing information extraction from our earlier work  next.
3.1 Information Extraction
As detailed in , we use regular expressions for extracting the ingredients, their weights, phases, and product name.
To extract recipe actions, we use an indicator list of verbs. We compile the list using the formulated products data detailed in Section 3.3. Verbs such as maintain, heat, add, stir, moisturize, cool, extract, demineralize, mix, disperse, blend, emulsify, select, distill, and chelate assume an essential role as process actions.
These verbs help separate the text representing the ingredients (the part containing the ingredient list has an absence of action verbs) and the text containing the procedure, which we call recipe text when used along with sentence boundary detection⑨. As illustrated in Figure 3, the ingredients occur in the part of the text that is NOT a set of sentences (whereas the recipe text is). The verbs also help in processing the recipe text to extract actions and conditions based on subject-verb-object structures of the sentences in the recipe text where the verb is one of the verbs from the list. We create Action-Mixture/Ingredient-Condition or (A-M-C) triples from every sentence. The collection of A-M-C triples from the recipe text is similar to the concept of the action graph described in several works such as [48, 49, 50, 51, 52].
Extraction of formulation constituents.
We treat the ellipsis problem, references to ingredients and/or phases/mixtures at earlier stages , using a stacking mechanism on top of information extractions techniques. Figure 3 shows an example formulation text along with the A-M-C structures obtained via open information extraction (Open IE⑩) and dependency parsing⑪, respectively. For a detailed discussion and validation of the formulation constituents' extraction, the reader is requested to refer to .
We discuss the extraction of additional information such as synonyms and functionalities of ingredients from online sources in Section 3.3. Extracting these details does not require specialized information extraction techniques and relies mainly on packages for accessing Web pages⑫ and pulling requisite data⑬.
We refer to the concepts in the domain model in Figure 4 in Section 3.4 when discussing the analyses, and we need to conduct to enable aided generation of formulations. We refer to the top-level type of formulated products such as cosmetic products or paints as the FormulationType. A specific category of such as creams or lotions within a FormulationType of cosmetic products is referred to as FormulationCategory. Within a FormulationCategory like creams, there could be numerous formulations such as face creams, anti-aging creams, and acne creams, as a Formulation. Ingredient instances partake in Formulation instances.
Graph domain model for formulated products data.
3.2 Graph Representation
Earlier in Figure 2, we showed the kinds of knowledge necessary for formulated products design. We also briefly touched upon the information extraction techniques we use to extract relevant data from offline and online sources in the previous section. Using the details presented in Figure 2, we can determine what we can refer to as a domain model or a schema for a graph database.
The bottom of Figure 4 shows formulation constituents and other details from offline and online sources of information about formulated products. The top of Figure 4 shows the schema underlying the graph database we use to store the extracted details, which we derive from Figure 2. Numbers 1–6 show how various extracted details map to specific concepts in the schema. Note that currently, we do not extract ingredient/mixture properties like values of solubility parameter, viscosity, and levels of spreading. Although with the proposed integration of generate, make, and test steps for formulated products, depending on the requirement, these details can be similarly obtained from online sources. Or these details can be computed and obtained from in-silico models as needed, as discussed later in Section 4.
We experiment with Neo4j⑭ and RDF⑮, which implement labeled property graphs and triple stores, respectively. For querying the Neo4j database, we use the Cypher⑯ query language, and for triple store representation, we use SPARQL⑰ (Figure 5). It is possible to implement the domain model or the schema shown at the top of Figure 4 in multiple ways in both Neo4j and RDF. For instance, in , we describe a Neo4j implementation in which we do not deduplicate the ingredients and use separate lists of ingredient synonyms and functionalities. In the RDF-based implementation, we include these details in the graph itself while also deduplicating ingredient details. The same ingredient appears in several formulations and is stored only once and referred. We show an example of such an implementation arrangement in Figure 6 in Section 3.4.
Examples of queries based on domain model in Figure 4 in Cypher and SPARQL.
Relating ingredient synonyms and functionalities.
Figure 5 shows example queries in both Cypher and SPARQL on top of Neo4j and RDF databases, respectively. The first query returns action graphs or collections of A-M-C triples from “all purpose” cream instances in the database. The second and third queries return all formulations containing the ingredient Cetyl Alcohol and its weights, respectively. Our observation is that it is easier to form SPARQL queries compared to Cypher queries. In contrast, it is easier to represent complex schema in a labeled property graph database like Neo4j. Since it is possible to convert either kind of graph to the other , the choice between the two becomes qualitative [54, 55] and boils down to ease of implementation in a particular context.
3.3 Formulated Products Data
We demonstrate our approach using 410 cream formulations obtained from Volumes 1 to 8 of the book Cosmetic and Toiletry Formulations by Flick, E. W. (1989–2014)  and 67 paint formulations obtained from another such book on paints by the same author .
A total of 2,633 ingredients exist in 410 cosmetic formulations, out of which 1,086 are unique, and 333 ingredients repeat more than once. In the case of paint formulations, we found 303 unique ingredients.
IUPAC (International Union of Pure and Applied Chemistry) name, synonyms, Pub- Chem CID, and the uses section from Wikipedia.
Chemical formula and PubChem CID from PubChem.
We obtain the link to Chebi and ChemSpider from Wikipedia and collect synonyms from these sites.
We also extracted ingredients functionality, esp. for cosmetic products from other online sources㉒. On average, every ingredient was found to have 12 synonyms and at least 2 functionalities associated with it.
Having collected these data and stored them as a graph, we need to carry out several additional analyses before generating formulations. We discuss these analyses next.
3.4 Analyses for Formulated Product Variants
The primary method of coming up with a new formulation for a given formulated product type with specific properties consists of manually deciding and gathering the requisite ingredients based on their functionalities and applying specific actions at specific conditions.
We should be able to relate the functionalities of an ingredient established previously to any of its synonyms. This requirement is more of an implementation-level detail. We address it by storing the ingredient and its synonyms and functionalities (as shown for an example ingredient in Figure 6) and referring to a single ingredient node whenever its synonyms are part of a formulation.
To enable the generation of formulations, we need to carry further analyses described next:
Phase Naming Reconciliation Formulations collected from several sources are likely to name the phases or mixtures in different ways. We need to reconcile the correct naming of phases to relate the ingredients and functionalities to correct phases. We assume phases to be correct when the phase names are named (as in water phase or oil phase in cosmetic formulations) in most formulations of a FormulationCategory available in the data, as against phases such as phase A and phase B. If the data contain no formulations with appropriately named phases or contain a formulation with no phases, we obtain the most common named phases by consulting with the expert before starting the reconciliation.
We infer the correct phases whenever they are missing, using the mentions of the same ingredients in other formulations where they are part of a specific phase (such as water phase or oil phase in case of creams or other cosmetic products). We show an example of phase naming reconciliation in Figure 7.
In Figure 7, vol∗ indicate different formulations. Having seen the ingredients Propylene Glycol and Cetyl Alcohol in water and oil phases respectively in other formulations of the same FormulationCategory, it is possible to ascribe them to specific phases when the formulation text does not explicitly refer to these phases or refers to them with arbitrary phase naming, such as Part A, as shown in Figure 7. It is also possible to find a particular ingredient such as a fragrance that is used in addition to other ingredients in named phases but not as part of these phases. We call such ingredients parts of additional phases.
Instruction Sequencing In generating a new variant from scratch or using a template to kick-start this process, we need to order the sequence of instructions correctly. To establish an ordered sequence of instructions, we use the A-M-C triples obtained from the recipe texts of formulated products of a specific FormulationCategory. We compute the counts of Action, Mixture, and Condition instances occurring at specific positions and assign the highest rank to the maximum count at a particular position.
When constructing the ordered sequence, we choose the action at positioni that has the highest rank. In the same way, we choose a mixture and condition for positioni. If the highest rank for a positioni is for more than two actions, we break the tie by taking the highest rank of action-mixture pairs.
Functionality Ranking in Phases As indicated above, ingredients (representing specific Functionality) occur in specific phases. Once we reconcile the phases' names, it is possible to map specific functionalities to the ingredients in the named phases. Many ingredients represent multiple functionalities. To associate a specific functionality from many possible functionalities of an ingredient, we assume that in a particular kind of a formulated product, an ingredient is often used for its primary functionality. In contrast, its secondary functionalities are of no consequence or optional as far as that specific kind of formulated product is concerned. Therefore, we rank these functionalities of an ingredient based on the percentage of functionality occurrences in various phases.
We calculate functionality counts (via ingredients) and take the percentage of the functionalities' occurrence in various phases. This step returns the functionalities that map to mainly water, oil, and additional phases for cosmetic products. We then categorize the functionalities into primary and secondary functionalities per phase based on count percentage; primary functionalities are the ones that occur more than 80% of the time, while secondary functionalities occur with less than 80% of the formulations. We use the 80% threshold based on our observations from recipe variants of many formulated products of the same FormulationCategory.
Figure 8 shows an example of computing primary and secondary functionalities for face creams. We conduct this analysis mainly for the template-based approach, although it is often helpful to the expert to know which functionalities are primary in a specific FomulatonCategory even when generating the formulation from scratch.
Ingredient Correlations To help the expert decide which representative ingredients to choose in combination either in a standalone manner or as members of named phases, we perform several correlation analyses.
Figure 9 shows the co-occurrences of ingredients or lack thereof, esp. in the same phase or mixture. Such clusters of ingredients are useful because the membership of an ingredient within a specific phase can inform the choice of other ingredients. Functionalities such as emollients and viscosity controlling dominate in cream formulations, whereas solvents happen to be prominent in paint formulations. Water, Propylene Glycol, Fragrance, Triethanolamine and Cetyl Alcohol are the most common ingredients in cream formulations, whereas Water (deionized), Aqueous Ammonia, and plasticizers like Santicizer 160 are prominent in various paint formulations.
Similarly, Heat, Add, and Cool are some of the most frequently occurring actions in cream formulations, while Add, Blend, Mix, and Mill are the most frequent actions in paint formulations.
We also relate functionalities of ingredients to specific kinds of formulations and thereby to formulation categories. For instance, massage cream instances tend to contain antistatic, binding, buffering, and denaturant functionalities prominently . Similarly, chamomile cream instances tend to contain functionalities such as bulking, humectant, and plasticizer, among others.
Weight Ranges of Ingredients Weight ranges need to be associated with finalized ingredients, either when generating a new variant of a FormulationCategory from scratch or when using a template (since in generating a template, we need to associate weight ranges to the ingredients). We calculate the minimum and maximum values of quantities for each ingredient Ij and present as a weight range wmin – wmax.
With these analyses in place, generating new variants of a FormulationCategory either from scratch or starting with templates is relatively straightforward. In the next section, we show how we achieve these.
Ranking functionalities of ingredients per phase; Example of face cream ingredients.
Correlations in ingredients, functionalities, and actions for formulations (at the top, for Creams ; at the bottom, for Paints).
3.5 Generating Formulation Variants from Scratch
At this point, we essentially transform a predominantly manual set of steps into an aided set of steps as described below:
Given a specific kind of FormulationCategory, the formulation design system queries the functionalities it usually contains and shows them to the expert.
For each functionality, the system presents to the expert all the Ingredient instances associated with it and the weight ranges of these instances.
The expert finalizes the set of ingredients by consulting primary and secondary functionalities that this FormulationCategory generally contains and clusters of ingredients that occur together in the named phases of historical formulation of this FormulationCategory.
The system then queries the actions and conditions generally performed on each finalized ingredient in a standalone manner or as a part of a mixture from the A-M-C structures from the stored instances and applies instruction sequencing. The expert finalizes the sequence of instructions.
For a detailed walk-through of a face cream variant generation from scratch, we request the reader to refer to .
3.6 Product Template Generation
Generating a template for a specific FormulationCategory is now easily achievable, given that the formulation data already contain few instances of that FormulationCategory. Assuming that all the previously listed analyses have been performed for the constituents of a specific FormulationCategory, we generate a template for it as follows:
The templating system queries the primary and secondary functionalities for that FormulationCategory.
The system then queries the ingredients representative of the various functionalities and associates the ingredients' weight ranges to each functionality. If the weight ranges differ drastically for the ingredients representative of the functionality, the system adds a marker indicating this situation.
The system arrives at an ordered sequence of instructions using the A-M-C structures of the formulations as described earlier.
A known Face Cream template (on the left) and the generated template (on the right).
To aid the expert in instantiating the template, we display related information to the expert, as shown in Figure 11. This information shows possible candidates for substitution over functionalities in the template, along with other ingredients known to co-occur and the description of the functionalities.
Aid to experts in instantiating product template.
Observations for isopropyl palmitate (ISP).
3.7 Current Limitations and Future Work in Aided “Generate” Step
Our current implementation of the aided generate step has the following limitations and ways forward:
As discussed in , the information extraction techniques we use tend to fail in specific linguistic cases of recipe text, like gerund verb forms (mixing in “disperse … using high-speed mixing”) and passive instructional sentences. Although such cases are rare, this problem leads to incomplete A-M-C structures. Since formulated product recipes are similar to cooking recipes, we are working on a transfer learning approach. We plan to train an Open IE model on a large cooking recipe data set [57, 58] and then fine-tune it to our data.
When computing weight ranges, we consider the weights given for an ingredient in various formulations. Since the weights are often in some proportion in a particular recipe, we need to keep track of these proportions (such as in a specific type of cream, the emulsifier is x parts, an emollient is y parts, and the thickener is z parts). We plan to present these proportions as additional information to the expert. This forms part of our ongoing work.
We categorize functionalities into primary and secondary functionalities based on occurrence percentage, which is influenced by historical recipes in our data set and may be incorrect. It is possible to correct an incorrect categorization by consulting with an expert in a one-time check.
3.8 Addressing Concerns in Formulation Generation
3.8.1 Human-machine Coordination and Assistance of Experts
We implement the human-in-the-loop formulation generation considering both i) the expert's inputs and ii) the assistance provided to the expert. We can consider two main stages of expert interaction: building the knowledge graph (before using the knowledge graph in formulation generation) and the formulation generation process. Before the population of the knowledge graph, we consult with the expert at various stages, such as phase naming reconciliation (Section 3.4), asking the expert the known proportions of ingredients as opposed to weights (Section 3.7), and checking functionality categorization with the expert (Section 3.7). These are the steps where the expert is supposed to provide inputs which reflect in the knowledge graph.
During the formulation generation process (Section 3.5), the system assists the expert at various stages. The expert using the system's information and his/her knowledge and experience chooses from the suggested ingredients, weights, and process steps. Thus, our system both inquires the expert to enrich the knowledge graph and interacts with the expert during the formulation generation process.
3.8.2 Controlling Weights and Proportions of Ingredients
Although ingredient weights and proportions are core secrets for the formulated product companies, we propose using offline sources such as historical formulations to collect the information on weights and proportions of ingredients. We describe the details in Section 3.1 and Figures 3 and 4. Figure 4 distinguishes the various constituents obtained from online and offline sources and it shows that we obtain the ingredient weights/proportions from offline sources. Online sources generally do not contain this information as they describe information specific to an ingredient irrespective of the type of formulated product in which it can be used.
Another critical concern is deciphering the possible effect of one ingredient over the other ingredients' functionalities. We approach this concern in two ways. First, we use historical formulations from published sources as in Section 3.4 to compute various correlations. Second, later in Sections 4.2 and 4.3, we propose to use in-silico testing, wherein physics-based modelling (typically at molecular scale) can be used to investigate the co-relation of different ingredients by performing simulations with varying ingredient proportions and assessing their effect on desired properties. It is possible to add this information to the knowledge graph as additional information for an ingredient to be available for the expert's consideration during formulation generation.
3.8.3 Choice between Formulation Generation Methods
We offer the two methods for formulation generation (described in Sections 3.5 and 3.6, respectively) to the expert who can use either one or both methods simultaneously. The template-based approach gives the expert a comprehensive view of the functionalities needed for the formulation. The method to generate formulation from scratch offers suggestions at every step of generation. Suppose the expert is not aware or has not worked on the formulated product type under consideration. In that case, we expect the expert to refer to the template as a ready reckoner. If the expert has experience dealing with the formulated product type under consideration, he/she may choose to use the step-by-step formulation generation. Since we do not enforce one method's use over the other and let the expert choose, we believe that a decision-making mechanism is not required to decide which method to use. The expert can utilize both methods for their intended purpose.
4. INTEGRATING GENERATE, MAKE, TEST STEPS IN FORMULATION DESIGN
The realization of any formulated product usually takes place by following a 3-step process viz. (a) generate (b) make and (c) test. The generate step comes up with the base recipe for a given product on the desired properties intended for the product, as shown in the previous section.
In a typical new product development cycle, the base recipe as constructed at the generate step is merely a starting point. It needs to be refined further on various aspects (e.g., what is the lowest proportion in which the costliest ingredient in the formulation can be used still maintaining the same effect). The make step involves making the formulation in the lab with a range of ingredient proportions and conditions, followed by conducting requisite testing for each formulation to assess various desired and required properties.
The design and test step results are scrutinized to identify the most optimum formulation (possibly 1/100's or 1/1000's), which would satisfy the cost-performance trade-off. This optimum formulation is the one that is finally considered for scale-up and becomes the final product.
Traditionally, all three steps have been performed manually in an experience-guided, document-centric, and experiment-heavy manner resulting in stretched cost and low efficacy and agility. With increasing demand and consumer awareness, reducing turnaround times, and stricter regulations, the formulated product industry needs to embrace digitalization in all three steps to derive real value.
With the base recipe for a cosmetic/coating formulation obtained from the generate step as the starting point, the following sections attempt to exemplify how digital intervention at the make and test steps would transform the traditional trial and error approach of formulated product design to a knowledge guided approach.
4.1 Automated “Make” of a Formulated Product Variant
The automated make step requires creating a detailed design of the experiment chart. Such a chart consists of various combinations of proportions of ingredients and conditions associated with actions; each combination corresponding to a unique formulation, which needs to be synthesized by the robotic equipment and tested subsequently.
There are five ingredients in the current case of a variant of face cream (five factors), as shown in Figure 13. We obtained this recipe by carrying out the generate step as detailed in . Considering only the minimum and maximum values of weights for each ingredient (leaving out the action conditions for the simplicity of explanation), there are only two levels, which amounts to a minimum of 25 experiments to be carried out.
Integrated Generate, Make, and Test for a Face Cream variant.
To carry out a more detailed exploration, weight ranges for each ingredient could be divided into several intervals (equal or unequal); for example, 10%∼35% for isopropyl myristate could be divided into five intervals difference of 5%. For instance, even if we consider two intermediate levels for each ingredient (excluding mix-max, hence total four levels), the complete factorial design of experiments (excluding conditions for actions) for the face cream formulation would amount to 45 = 1,024 experiments.
It is important to note here that, to truly reject or accept a particular combination of weight proportion and action condition, it would be required to subject the corresponding formulation to testing; thus, 1,024 formulation make experiments also translate to an equivalent number of testing evaluations.
These numbers provide enough motivation for the industry to adopt high throughput formulation-making procedures (automated laboratories) (automated make) and, more so, exploit the concept of autonomous laboratories in recent times. These numbers represent the worst-case scenario in which no heuristic, experience, or prior knowledge is available regarding ingredients, their effects, and roles in the final product's properties.
On the other hand, if we were to follow the autonomous route, which, as explained earlier, poses the problem of finding the best combination of ingredient proportion and action conditions as an experimental design problem, we can expect to reach the most optimum formulation in comparatively a smaller number of trials. Many methods can solve this problem of finding the global optimum of a non-convex objective function viz. random search [59, 60], systematic grid search , and recently Bayesian optimization-based techniques [62, 63, 64].
In the face cream case, consider that the objective function to be minimized is the difference between the desired property (as derived from the product brief) and the obtained property (as calculated from the corresponding test procedure for the property (e.g., viscosity, density, and pH). Given a starting value for each decision variable (which can be any value including or between the range (upper-lower bounds), the random search algorithm suggests the next set of decision variables. These can be used to prepare the formulation according to the recipe, followed by conducting the requisite tests. The above procedure is repeated till the error between the desired and obtained property is minimized. The same discussion applies to the external coating variant shown in Figure 14.
Integrated generate, make, and test for an External Coating variant. Note: In-silico polymer: multi-scale modelling - adapted and redrawn from  Copyright (2019) Taylor and Francis.
4.2 Completing Integrated Design of a Face Cream Variant with In-silico Testing
Human skin is a layered organ (epidermis, dermis, hypodermis, etc.). The topmost layer of the epidermis viz stratum corneum (SC) acts as a barrier and exhibits selective permeation behavior. Thus, cosmetic and toiletry formulations (creams, soaps, etc.), especially those used for topical applications, often find a need to contain supporting ingredients (e.g., permeation enhancers) that assist the active ingredient(s) in breaching the skin barrier. Hence, to design effective enhancers, it becomes necessary to understand various aspects of these chemicals' interaction with the SC.
At Tata Consultancy Services (TCS) Research, we have developed an in-silico model of human skin based on a multi-scale modelling framework . This framework can simulate the interaction of molecules of interest with human skin, thereby providing insights to various crucial mechanisms at the molecular level and predicting properties of interest to aid in screening/designing these molecules.
For instance, in the face cream variant, shown in Figure 13, isopropyl myristate is the second most required ingredient and is supposed to function as an emollient. However, studies show that it also depicts permeation enhancer properties [65, 66].
Recently, Gupta et al.  have performed detailed coarse-grained (CG) molecular dynamics (MD) simulations to calculate partition and diffusion coefficients of permeation enhancers in the SC lipids. The study involved enhancers from different families viz fatty acids, esters, and alcohol at different concentrations 1%, 3%, and 5% w/v.
Figure 12 summarizes their study's findings in terms of key observations. They also provide the enhancement ratio and partition coefficient (log P) obtained from simulations as property values used as selection criteria for enhancers . They also mention other observations that give more specific details about the enhancers' interaction mechanisms with the SC lipid constituents. Figure 12 summarizes their observations for isopropyl palmitate (ISP).
Apart from such specific observations, they also state a very generic and key observation that “small hydrophobic molecules partition well into the skin lipid layer and do not agglomerate. On the other hand, bigger hydrophobic molecules partition well and disturb the lipid layer packing significantly, but they sometimes form small clusters and limit permeation by diffusion rate.”.
Since cosmetic products are majorly used for beautification (anti-aging, anti-wrinkle, etc.), they contain active ingredients that interact with skin's mechanical properties (viscoelasticity, young's modulus, etc.). To understand the effects of ingredients on skins' mechanical behavior, Jayabal et al.  have recently developed a 1D viscoelastic model applied to experimental data for obtaining viscosity and modulus of elasticity of the skin. The authors also demonstrate that the same model can be extended to predict skin behavior when applied with a polymer layer on top. Polymers are often used in cosmetic formulations as thickening agents, emulsifying agents, creating protective films or barriers, and so on.
Figure 13 shows that property values (enhancement ratio, log P, diffusion coefficient, young's modulus, viscosity, etc.) obtained from these simulations should be added as additional information about ingredients in the knowledge graph. In contrast, the generic and ingredient specific qualitative observations could be used as heuristics which can be presented to the formulation chemist at the time of ingredient selection or template filling, thus genuinely achieving the knowledge-guided design of formulated products.
4.3 Completing Integrated Design of an External Coating Variant with In-silico Testing
External coatings (applied on automotive, airplanes, bridges, houses, machinery, etc.), although primarily used for decorative purposes (enhancing appearance by imparting color and/or gloss to a surface), are also designed to act as a protective cover for the surface beneath .
There are three major environmental factors viz heat, moisture, and radiation apart from rain, snow, microbial attack, mechanical stress, etc., against which external coatings are designed to provide protection. External coatings are usually applied in various stages across multiple layers and in a particular order, with each layer differing in composition, thickness, and purpose .
These coating systems are intended to retain their functionalities for the entire service life of the object on which they are applied. Thus, it is a norm to evaluate these formulations against the environmental factors by conducting elaborate weathering tests to assess their long-term performance.
Natural weathering tests are carried out in specific locations where formulations are exposed to extreme environmental conditions and are monitored for as long as five years㉔. However, current product development lifecycles are too short to allow waiting times of the order of years to get results from an experiment. Hence, accelerated weathering tests are more sought after; wherein one can induce environmental factors at a controlled rate and study the similar phenomenon in a time frame of days compared to years .
However, scientists believe that accelerated tests may not capture the precise degradation mechanisms during natural weathering. Environmental factors tend to interact with the external coating systems (especially with the binder and pigments) and degrade them over time, thus deteriorating performance.
One possible way to address the above challenges, that of prolonged testing (natural weathering) and unreliable results (accelerated weathering), is to model the degradation phenomenon by first-principle simulations or statistical methods (MD, density functional theory (DFT), Monte Carlo (MC), kinetic Monte Carlo (KMC), etc.). Makki et al. demonstrated the capability of a multi-scale simulation approach to imitate the photo-degradation process for a model polyester urethane coating system .
To account for the fact that the spectrum of time scales is substantial, across which the degradation takes place (chemical reactions—picoseconds and mechanical failure—years), they propose to couple an event-driven method (KMC) with a time-driven method (Dissipative Particle Dynamics (DPD) to accurately capture the physics.
These simulations allow calculations of important thermo-mechanical properties., e.g., glass transition temperature, storage modulus, loss modulus, thermal expansion coefficient, and crosslinking density at different degradation times.
Hinderliter et al. developed an MC-based methodology to model erosion of coating surface due to photon flux and utilized surface topography and chemistry changes to predict macroscopic properties like gloss, fracture toughness, and wetting angle . An improved proximity-based molecular dynamics technique has been demonstrated for modelling crosslinking of thermoset polymers . The authors used their methodology to calculate important material properties like glass transition temperature, stiffness, strength (stress vs. strain curves), and Poisson's ratio and capture the effects of curing temperature and crosslinking degrees on these properties.
Crosslinking imparts structural integrity and helps create a barrier for transporting foreign species (chemicals, moisture, etc.). Unreacted moieties in crosslinked coating resins lead to inhomogeneities and phase separations in the crosslinked coating film. At TCS Research, we have developed a multi-scale modelling-based approach to predict hyperelastic properties of elastomers at bulk level . The approach involves obtaining microscopic properties by carrying out MD simulations of the crosslinked system and providing them as inputs to constitutive models for predicting the stress-strain response of the elastomer under different loading conditions.
Figure 14 shows that the properties and heuristics from ingredient interaction with the environmental factors can be added as additional information about ingredients in the knowledge graph and used as described above. For instance, the properties enlisted in the figure would help one understand different aspects of coating's performance, e.g., with knowledge of glass transition temperature, one can comment on water uptake, barrier properties, etc. At the same time, the variation in storage and loss modulus allows understanding the viscoelastic behavior. With the presence of in-silico polymer, acquiring such knowledge becomes inexpensive as compared to conducting lengthy experiments. It helps the formulation chemists to make informed decisions at generate step, thus accelerating product development.
4.4 Summarizing Integrated Formulation Design with Knowledge Graph
The current state of practice involves the manual generation of recipes, manually operated making of candidate formulations, and third-party testing (as is the prevalent practice in many formulated product companies). In contrast, we proposed two instantiations of the aided generate, robotic make, and in-silico test steps, integrated by treating the knowledge graph as the central artifact. The generate step aids the expert in creating formulation variants while considering test step observations. The robotic make step stores the validated recipe in the knowledge graph. The knowledge graph continues to grow with these additional pieces of information, which often go unaccounted for in the current state of practice.
With this arrangement, we believe that the three steps in formulation design remain mindful of other steps and observations made in them. As mentioned earlier in Section 1, we assume interfacing systems between the three steps and the knowledge graph. Depending on the realizations of automated make and in-silico test, the implementation details may vary for such interfacing. However, with the aided, automated, and in-silico versions, respectively, the generate, make, and test steps can individually refer to the observations as knowledge in the next round of experiments.
4.5 Addressing Concerns in Integrated Formulation Design
4.5.1 Expert Contribution in Formulation Optimization
As discussed in Sections 4.2 and 4.3, the in-silico tests described involve performing physics-based simulations by modelling the system according to the intended outcome. Computational chemists, who are experts in modelling and simulation, need to decide on various factors such as the appropriate technique (molecular dynamics, Monte Carlo, density functional theory) and corresponding simulation parameters (time step, potential). Additionally, the experts need to interpret the simulation results by postprocessing the output from these physics-based models targeted at formulation optimization.
4.5.2 Verification of Information in Knowledge Graph
We are aware that various information pieces added to the knowledge graph (described in Section 3 and Sections 4.2 and 4.3) need to be verified. Currently, we rely on the correctness of information in published handbooks (offline) and the correctness of information about ingredients in specialty sites (online). In ongoing and future work, we plan to work on knowledge graph fact-checking and verification methods.
4.5.3 Validation of Proposed Experiments
We believe that the current manual steps, when aided with the knowledge graph and described in Section 3 and this section, will lead to cost reduction. Since currently, we have only a proposal for integrating the generate, make, and test steps, we have no experimental validation of cost reduction. This is part of our ongoing and future work.
Historical formulations and related data spread over offline and online resources present an opportunity to aid the expert in generating new formulations. We presented an approach to create and analyze a knowledge graph to provide recommendations for building a formulation recipe from scratch and using the template created to arrive at a viable formulation variant.
The knowledge graph we built can also be used as a connecting artifact between the generate, make, and test steps in formulation design. We show two instantiations of such an arrangement. In the first instantiation, the expert generates a face cream variant using the recommender system on top of the knowledge graph. The make step obtains the recipe and makes the formulation as described. The in-silico test step applies the digital skin model to test the requisite properties and feeds the observations back to the knowledge graph where the expert can refer to them as heuristics the next time. In the second instantiation, the expert starts with a template and follows the rest of the steps. In both examples, automated labs and in-silico models replace the traditional manually operated make and test steps. Our approach enables aided generate step with automated/robotic make and in-silico test in formulation design. We believe that such examples pave the way to an aided formulation design, where observations of each step can supplement and aid the other steps, esp. in iterated design steps.
S. Sunkle (firstname.lastname@example.org), D. Jain (email@example.com), B. Rai (firstname.lastname@example.org), and V. Kulkarni (email@example.com) conceptualized and ideated the integrated formulation design. S. Sunkle and D. Jain wrote and reviewed the whole paper. K. Saxena (firstname.lastname@example.org), A. Patil (email@example.com), T. Singh (firstname.lastname@example.org), and S. Sunkle developed the knowledge graph, the recommender, and template generation systems.
EU Formulation Network: Formulated Products at https://formulation-network.eu/about/objectives
Formulation Engine: Materials Innovation Factory https://bit.ly/37RiD3r
EU Regulation for Testing of Cosmetic Products https://bit.ly/3mPGggP
Product Testing Instruments https://www.anton-paar.com/in-en/products/
American Society for Testing and Materials https://www.astm.org/
International Organization for Standardization https://www.iso.org/standards.html
The TCS Digital Skin Twin https://www.tcs.com/tcs-digital-skin-twin-platform
Spacy Sentence Boundary Detection https://spacy.io/usage/spacy-101
AllenNLP Open IE https://demo.allennlp.org/open-information-extraction
Spacy Dependency Parser https://spacy.io/usage/linguistic-features/#dependency-parse
Extensible Library for Opening URLs https://docs.python.org/3/library/urllib.request.html
Scrapping Information from Web Pages https://pypi.org/project/beautifulsoup4/
Neo4j Graph Database https://neo4j.com/
Cypher Query Language https://neo4j.com/developer/cypher/
SPARQL Query Language for RDF https://www.w3.org/TR/rdf-sparql-query/
E.g., cetyl alcohol entry at Wikipedia https://en.wikipedia.org/wiki/Cetyl_alcohol
Cosmetics Ingredient Functionalities: https://bit.ly/3rCXnGz
Available at MakingCosmetics https://bit.ly/3aPrODt
Evaluations of Coatings https://bit.ly/3ryZCKJ
DATA AVAILABILITY STATEMENT
The data sets generated during and/or analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.