We evaluate and analyse a framework for evolutionary visual exploration (EVE) that guides users in exploring large search spaces. EVE uses an interactive evolutionary algorithm to steer the exploration of multidimensional data sets toward two-dimensional projections that are interesting to the analyst. Our method smoothly combines automatically calculated metrics and user input in order to propose pertinent views to the user. In this article, we revisit this framework and a prototype application that was developed as a demonstrator, and summarise our previous study with domain experts and its main findings. We then report on results from a new user study with a clearly predefined task, which examines how users leverage the system and how the system evolves to match their needs. While we previously showed that using EVE, domain experts were able to formulate interesting hypotheses and reach new insights when exploring freely, our new findings indicate that users, guided by the interactive evolutionary algorithm, are able to converge quickly to an interesting view of their data when a clear task is specified. We provide a detailed analysis of how users interact with an evolutionary algorithm and how the system responds to their exploration strategies and evaluation patterns. Our work aims at building a bridge between the domains of visual analytics and interactive evolution. The benefits are numerous, in particular for evaluating interactive evolutionary computation (IEC) techniques based on user study methodologies.
Information visualization transforms data into interactive visual representations to amplify human cognition (Card et al., 1999). It typically deals with abstract, nonspatial, and high-dimensional data (Chen, 2005).1 Visualizations can be described as explanatory when the aim is to communicate insights already known from the data, or it can be exploratory when the focus is on the dynamic discovery process of insights hidden in the data. This process is relatively unpredictable because of the lack of a priori knowledge of what the user is looking for. The analyst’s role in this case is to organise, test, develop concepts, look for trends, and define hypotheses (Grinstein, 1996). When the search space is large, as is often the case for multidimensional data sets, the task of exploring and finding interesting patterns in the data becomes tedious. Dimension reduction techniques reduce the search space but often require specifying objective criteria for filtering views prior to user exploration. Other techniques (Brown et al., 2012; Endert et al., 2011) steer the exploration to interesting areas of the search space based on information learned during the exploration. This seems more adapted to the free-form nature of exploratory visualization, but in many cases it requires an internal representation of the user or at the very least a mechanism for predicting what the user is interested in.
One way to infer information about users is to examine their interaction logs. Research in user modeling (Fischer, 2001) has shown that much information can be inferred about users from their interactions, such as their exploration strategies, personality characteristics, and cognitive traits (Brown et al., 2014). Research in the field of visual analytics, concerned with building interactive visual interfaces to facilitate analytical reasoning (Thomas and Cook, 2005), showed that feeding knowledge about the user back into the visualization pipeline can have many advantages. For example, it can improve the overall layout of the visualization, reduce the complexity of a model, and help to steer the exploration of large search spaces toward more pertinent views of the data (Endert et al., 2012; Brown et al., 2012; Boukhelifa et al., 2013).
In previous work, we introduced a framework for evolutionary visual exploration (EVE)2 that combines visual analytics with stochastic optimisation to aid the exploration of multidimensional data sets. Starting from a set of data dimensions, an interactive evolutionary algorithm (IEA) progressively evolves nontrivial viewpoints in the form of linear and nonlinear dimension combinations. These views are built using the classical evolutionary loop of selection of parent dimensions, and the recombination and mutation of these dimensions. The crossover and mutation operators support the exploratory process by introducing users to new views that may help them discover new relationships or interesting structures in their data. The criteria for evolving new dimensions is not known a priori and is partly specified by the user via an interactive interface. Our method leverages automatic tools to detect interesting visual features and human interpretation to derive meaning, validate the findings, and guide the exploration without having to grasp advanced statistical concepts.
The contributions of this article are threefold: (1) a summary of related work in the topic of guided visual exploration; (2) results from a new user study that examines in detail how users leverage the system and how the system evolves to match their needs when a clear task is specified; and (3) a detailed methodology for evaluating an EVE system that takes into account both the effectiveness of the underlying algorithm and users’ behaviour with regard to how they explored the search space and how they evaluated views. This methodology can be applied to other interactive evolutionary computation (IEC) systems.
2 Related Work
Seo and Shneiderman (2005) discuss two different approaches for exploring multidimensional data sets; axis-parallel and non-axis-parallel projections, and the trade-off between the simplicity of the former and the power of the latter. Axis-parallel projection methods use existing dimensions as axes of the projection plane, and thus produce familiar and comprehensible projections. Non-axis-parallel projection methods use linear or nonlinear combinations of two or more dimensions for the axis of the projection plane, and have the advantage of a larger projection space, which can potentially reveal structures not visible in the axis-parallel projection space. It is, however, harder to search for useful projections in such an extended space. Furthermore, combined dimensions are not always easy to interpret. In our article, we use both axis-parallel projections, to explore the original dimensions space, and non-axis-parallel projection for a more extended search. We hypothesise that this may be useful to users who are, for instance, familiar with PCA (principal component analysis) type of analysis. Throughout the article we refer to non-axis-parallel projections as “combined dimensions.”
Related work is organised in three sections: (1) a brief overview of quality metrics used to describe specific properties of data projections, including metrics we use in this work as part of the automatic evaluation of scatterplots; (2) a general introduction to IEC and the role of visualization in this context; and (3) a brief review of work on guided visual exploration from different perspectives (interaction, optimisation, and dimensionality reduction).
2.1 Quality Metrics
Faced with the overwhelming possibilities of exploration paths in multidimensional visualization, researchers in the field designed quality metrics that automatically evaluate the various projections of the data, in the hope of focusing user search on the most promising views. In a comprehensive survey, Bertini et al. (2011) used the data flow model to classify quality metrics into three types: metrics that draw information from the data space (i.e., data dimensions and values),3 from the image space (i.e., the views and rendered images presented to the user), or from both. We add to this list metrics that operate at the user level taking into account both user task and perception.
Among metrics calculated at the data space are clustering and outliers. The rank-by-feature framework (Seo and Shneiderman, 2005), for instance, visualises an optimal set of features according to a user-selected quality metric such as correlation or uniformity. They use axis-parallel projections to produce 1D or 2D views and colour brightness to denote ranking scores. Among image-based metrics are scagnostics (Wilkinson and Wills, 2008), which describe measures of interest for pairs of dimensions based on their geometrical appearance on a scatterplot. A mixed metrics approach consists of combining information from the data and image space at the same time. Peng et al. (2004), for example, combine data features such as correlation information with view features such as axes adjacency to measure clutter as a result of reordering visualization axes (Bertini et al., 2011).
Among perception-based metrics, Albuquerque et al. (2011) attempted to find a quality measure for scatterplots where the goodness value assigned to the visualization is based on human observations from paired comparison studies in which users were asked to analyse clusters and separate labeled classes. Our approach differs in that the “goodness” of a view is defined both by an automated set of measures and user evaluation. We use nine scagnostics measures designed by Wilkinson and Wills (2008) as one of our quality metrics to quantify the amount of pattern that exists in a scatterplot. This is the aspect of our system that lends itself to the image-centric quality metrics approach. Additionally, we use a data metric corresponding to the complexity of a proposed dimension and a user metric relating to user satisfaction with a view (see Section 3.2).
Scagnostics4 are based on geometric graphs that are calculated from areas, perimeters, and lengths of these graphs. They include nine measures to characterise scatterplots (Fig. 1) and are useful for quickly discovering regularities and anomalies in scatterplot matrices. The underlying algorithm detects different types of point distributions, including multivariate normal, log normal, multinomial, sparse, dense, convex, and clusters. It does so by binning, detecting outliers, and computing measures based on the following three statistical properties: shape for convex, skinny, and stringy distributions; trend for monotonic distributions; and density for skewed, clumpy, outlying, sparse, and striated. These measures have proven statistical properties and are computable for moderately large data sets (Wilkinson and Wills, 2008).5
2.2 IEC and Visualization
Interactive evolutionary computation (IEC) corresponds to evolutionary computational models where humans, via suitable user interfaces, play an active role, implicitly or explicitly, in evaluating the outputs evolved by the evolutionary computation. Applications of IEC are varied, ranging from art to science (Lutton, 2006; Fukumoto et al., 2010; Takagi, 1998). IEC lends itself very well to art applications, such as melody or graphic art generation, where creativity is essential, because of the subjective nature of the fitness evaluation function. For scientific and engineering applications, IEC is interesting when the exact form of a more generalised fitness function is not known or is difficult to compute, for example, for producing a visual pattern that would interest a particular user. Here, the human visual system, together with the emotional and psychological responses of the user, can outperform a pattern detection or learning algorithm.
Visualization has been used in IEC both as representation and exploration tools to help users better evaluate the output of interactive evolutionary algorithms (Hayashida and Takagi, 2000; Llorà et al., 2006). The relation between the visual part and the algorithmic component of the IEC can be characterised using the three-level integration framework of Turkay et al. (2014). In the first level, visualization is mainly a presentation tool (e.g., statistical analysis software such as R); the second level refers to semi-interactive methods (Johansson and Johansson, 2009; Perer and Shneiderman, 2009) where user interactions are typically limited to parameter tuning or altering the “data domain,” for instance, via data filtering and aggregation; and the third level refers to tight integration where the coupling is achieved in a seamless and flexible way (Nam and Mueller, 2013; Ingram et al., 2010).
The tight coupling of the visual and algorithmic parts of an IEC is difficult to achieve despite efforts to design good user interfaces, as human interaction with these systems usually raises several issues, mainly linked to the user bottleneck problem (Poli and Cagnoni, 1997), human fatigue and slowness. Various solutions have been considered (Poli and Cagnoni, 1997; Takagi, 1998; Banzhaf, 1997), such as reducing the population size (micro-EAs), constraining the search space to focus on a priori interesting areas, and deploying approximated user models (also called surrogate functions) to filter obvious bad solutions (Lutton et al., 2005) and only present to the user the most interesting individuals of the population. This model can be learned from past interactions (Lutton et al., 2005). Among the various strategies to address this issue, we choose to use a small population size of suggested dimensions, and to deploy an approximated user model (the surrogate function) based on a series of geometric measures modeled by the scagnostics distributions (Wilkinson and Wills, 2008).
2.3 Guided Visual Search
Chen and Hagen (2010) argue that interaction alone (e.g., zooming and detail on demand) cannot address the challenges posed by complex visualizations, such as the management of large search spaces. Equally, computational tools cannot address these challenges alone, as problem solving remains a human activity. There is a growing body of work in the visualization domain and related disciplines to address these issues, combining knowledge from the visualization process itself (e.g., user preferences and chosen visualization parameters) with computational analysis. Typically, information, other than the data being visualised, is fed to the visualization pipeline. This information can be topological, statistical, geometrical, semantic, or other forms of data captured from interactive and learning algorithms (Chen and Hagen, 2010).
The idea of taking user interactions into account and learning from them is not new, although getting rich information that benefits the user in an exploratory context is challenging. Endert et al. (2011) talk about semantic interaction, where the aim is to get meaning from interactions and use this meaning to close the visual analytics sense-making loop. Their Visual to Parametric Interaction technique (V2PI) describes a framework where the sense-making visualization pipeline becomes bi-directional and users are embedded in the pipeline: users learn from visualizations and the visualizations adjust to expert judgment (Leman et al., 2013).
Work that focuses more on the computational part includes parameter space exploration and optimisation, and is directly related to ours. Matković (2008; 2011), for instance, tried to interactively find an optimal combination of input parameters for a complex diesel engine injection system using visual analysis techniques. In visual parameter space analysis, a typical research goal is the optimisation of the output of model parameters by identifying reasonable input parameter settings while keeping the human in the loop (Sedlmair et al., 2014).
Behrisch et al. (2014) proposed a feedback-driven framework for user exploration of large multidimensional data, which combines both automatically calculated metrics and user feedback. Their notion of “pertinence” is defined by visual relevance and is learned interactively using machine learning techniques (naive Bayes classifier). Their approach is focused on the smooth integration of computational and interactive analysis. Similarly, Stolper et al. (2014) proposed a progressive framework where analytical algorithms are continuously adapted to produce partial results to support what they call “progressive insight.”
Work on interactive dimensionality reduction relates to ours. For example, Johansson and Johansson (2009) and Fernstad et al. (2013) combine user-defined and automatically calculated metrics (namely, for correlation, outliers, and clusters) in order to filter data dimensions. However, our approach is more interactive in that the user does not have to explicitly specify the weights of the quality metrics, but these are learned from and during the exploration. In this sense, the exploration of data and the underlying model are not separate steps in our case. In the same spirit, Brown et al. (2012) implemented the dis-function tool, where multidimensional data are projected onto a 2D scatterplot using multidimensional scaling (MDS), and the users can then move incorrectly positioned points to other similar points in order to reflect their understanding of the data. A feature weight optimization is then applied to calculate a new distance function that is used to reproject the data (by applying PCA to a pairwise distance matrix). These three examples are similar to our work, in that they try to overcome the usability issue mentioned by Turkay et al. (2014) and present in most of the intelligent visual analysis tools, where significant statistical literacy and understanding of the underlying computational methods are required.
When it comes to combining IEC with visual exploration, Mouradian et al. (2012) use a genetic algorithm to create projections of multidimensional data. However, unlike our work, their method focuses on generating linear projections only and tries to preserve as much as possible a predefined data quality metric. More similar to our method, Malinchik and Bonabeau (2004) use an IEC to perform exploratory data analysis, combining computational search with human evaluation. They show how IEC can be used to evolve two-dimensional linear projections that bring insight about the data. Similar to dis-function (Brown et al., 2012), they used the IEC to evolve a distance function in attribute space in order to produce the most compelling or interesting clusters to the viewer using a parametric clustering algorithm. However, they also focused only on linear combined-dimension projections and did not conduct a formal user study to evaluate their work.
3 Evolutionary Visual Exploration
Our proposed framework (Cancino et al., 2013; Boukhelifa et al., 2013) combines visual analytics with stochastic optimisation to aid the exploration of multidimensional data sets characterised by a large number of possible views or projections. Starting from dimensions whose values are automatically calculated by a PCA, an interactive evolutionary algorithm progressively builds (or evolves) nontrivial viewpoints in the form of linear and nonlinear dimension combinations, to help users discover new interesting views and relations in their data. The criteria for evolving new dimensions are not known a priori and are partially specified by the user via an interactive interface. Pertinence of views is modeled using a fitness function that plays the role of a predictor: (1) users select views with meaningful or interesting visual patterns and provide a satisfaction score; and (2) the system calibrates the fitness function—optimised by the evolutionary algorithm—to incorporate user input and then calculates new views.
In order to validate our method, we embedded a genetic engine into an existing scatterplot matrix visualization system (Elmqvist et al., 2008) that manages the various projections of the data. The prototype system interface is described in Section 3.1, and the genetic engine and search space as implemented in EvoGraphDice is examined in Section 3.2. Section 3.3 discusses issues related to diversity management.
EvoGraphDice uses GraphDice (Elmqvist et al., 2008; Bezerianos et al., 2010) to manage the various projections of the data. Views are organised in a scatterplot matrix (SPLOM) of 2D projections (Fig. 2a). Users can select a view from the SPLOM (highlighted cells have a green background), which is then displayed in the main plot view (Fig. 2b). They can also perform brushing and linking using a lasso tool to select data points, such that the selection and highlighting of data points in one view is reflected in all other views. EvoGraphDice displays the dimensions proposed by the IEA as additional rows (and columns) in the SPLOM, ranked by their fitness evaluation value such that the higher the system evaluation, the higher the y-position of the proposed dimensions in the matrix. The system initially displays dimensions returned by a PCA, after which the user can evolve new dimensions by pressing the “evolve” button (Fig. 2d).
The proposed views are displayed in a yellow background to differentiate them from other cells where the intensity of the colour is proportional to an a priori assessment of user interest. Thus, the darker the colour, the more pertinent the view. The system provides an initial score (1 to 5) for each new view, but the user can adapt this score using a slider (Fig. 2d). User-evaluated cells are flagged using a small black square to distinguish them from system-evaluated cells. EvoGraphDice can be initialised at any time using the “restart” button, which resets parameters of the IEA. Users can save views (Fig. 2f) and bring them back into the SPLOM if they have been replaced during the exploration. The current population is also displayed as a table (Fig. 2h), where each row corresponds to a combined dimension described by a mathematical expression and various components of the fitness function such as the scagnostics measures. The user can limit the dimension search space using a widget (Fig. 2i), which results in a system reset similar to pressing the “restart” button. They can also edit an individual using a “dimension editor” (Fig. 2j).
3.2 Search Space and Genetic Engine
EvoGraphDice relies on a genetic programming engine to evolve a population of combined dimensions. We describe the main components of this genetic engine, which has classical features. What is original, and complex, is the user interaction, which is indirect: The objects the genetic engine evolves are not directly evaluated by the user, but through the views they generate in conjunction with the other dimensions. Strictly speaking, the algorithm evolves 2D projections in an implicit manner. This has an impact on the learning algorithm used for building the surrogate function.6
3.2.1 Search Space
The space searched by the genetic engine is the set of all dimensions that can be built by combining the initial dimensions xi with operators and constants, encoded as trees according to the classical genetic programming (GP) framework (Koza, 1992). These combinations can be complex mathematical expressions containing quadratic, exponential, or logarithmic terms (the evolved expressions can be any combination using +, −, , , , , operators, real constants, and initial dimensions).
3.2.2 Genetic Engine
We have chosen to evolve a small set of combined dimensions, yi, in order to allow the user to examine all individuals of the population at a glance over the SPLOM: if n is the number of initial dimensions, a population of another n combined dimensions is evolved. At each iteration, that is, each time the user clicks on the “evolve” button, a new generation is produced by applying the selection/crossover/mutation operators and presented to the user, whose judgment (evaluation) is explicitly collected via a slider.
A set of a priori interesting dimensions has been chosen as a starting point. A PCA analysis (Smith, 2002) is performed on the original data set, and the corresponding n linear combinations form the initial population.
3.2.4 The Fitness Function
Evaluating the fitness of the suggested views requires taking into account user interactions and internal metrics. The user interaction criterion tries to adapt user preferences in the fitness function, while the internal metrics evaluate the relation between variables. The fitness function to be optimised by the genetic engine is a sum of three terms:
- A surrogate function fsc, that plays the role of a predictor and helps the system to better adapt to user needs. It is based on scagnostics measurements computed for every cell of each combined dimension yi (the xj being the initial dimensions), and the corresponding fitness term is a linear combination of the highest values of the scagnostics () of each scatterplot cell () (these cells correspond to the yellow cells in Figure 2, that is, the views of the combined dimensions yi with respect to the initial dimension xj): ). Then, as soon as n (the number of original dimensions) interactions are recorded, wk are updated via a simple multilinear regression on the m past user interactions ( corresponds to the length of the memory of the system).
A user evaluation term that is an average of the user evaluation for each cell corresponding to yi (range of 1 to 5 from “bad” to “excellent”).
3.3 Diversity Management
The evolutionary mechanisms naturally tend to concentrate the population around good solutions. For small population sizes particularly, there is a risk of premature convergence if no diversity preservation is performed. We choose to use a very simple mechanism, similar to the crowding factor scheme (Mengshoel and Goldberg, 2008): each time a new dimension is generated by the stochastic operators (mutation or crossover), its distance to each individual of the current population is computed. If yi is too close to one of the individuals, it is replaced by a random individual. The distance is a Euclidean distance on the scagnostics vectors, the ones precisely used for the computation of the surrogate function of Equation (1). In other words, this is a phenotypic distance. The distance threshold that governs this mechanism can be tuned by the user. This allows a full range of exploration/exploitation compromises: if the distance threshold is large, we get a quasi-random search, while if it is 0, we get a genetic engine with no diversity management.
EvoGraphDice allows users to fully configure the relevant EA parameters, as shown in Figure 2i. These parameters are crossover and mutation rates to set the probabilities for the GP crossover and mutation operators (default for both); replacement rate refers to the proportion of individuals to be replaced at each EA iteration (default ); minimal distance corresponds to the crowding factor for diversity management (default );7fitness criteria weights to tune the weights of each fitness component criterion (by default user evaluation has weight of 2 and complexity 1);8search variable and dimension space for restricting the search to a subset of variables specified by the user (by default all dimensions are selected); data subsets, used to restrict the set of points of the scatterplot on which the fitness is calculated. The search can thus be performed only on a subset of the data corresponding to a selection query made by the user.
4 Case Studies with Expert Users
This section summarises a user study with domain experts that we conducted in previous work (Boukhelifa et al., 2013). We wanted to evaluate the usability and utility of EVE by trying to answer these three questions: (1) Is our tool understandable and can it be learned? (2) Are experts able to confirm known insight in their data? (3) Are experts able to evolve views that contain new insight or allow them to generate a new hypothesis? For this study, we did not analyse the behaviour of the IEA nor compare user evaluation strategies, since our study subjects worked on their own data sets, which were different in type, number of dimensions, and research questions, making an in-between subject study comparison infeasible. Instead, we chose a qualitative observational study methodology (Carpendale, 2008; Meyer et al., 2012; Sedlmair et al., 2012) that better suited our evaluation needs.
4.1 Participants, Tasks, and Data
We evaluated our prototype with five domain expert users (two female), ages 27–42 years (mean 34.2 years). Experts were academics and practitioners who had multidimensional data sets related to their domain of expertise (scientific simulation, medicine, and geography) and were interested in further exploration. They consisted of one graduate student, three senior researchers, and one medical surgeon. Participants had previously explored their data sets using graphical tools (e.g., Excel and JMP) or used statistical methods (PCA and regression analysis) but felt there was more to discover in their data than was identifiable by their current tools. Experience with advanced multidimensional visualization tools varied from none to prior use of GraphDice or other SPLOM-based tools (two experts). None of the participants had previously used dimension combination to analyse their data, but three had performed PCA-type analysis. Each session lasted on average hours.
Participants were asked to carry out two main tasks: (T1) Show using the tool what they already know about their data, hypotheses, and questions they wanted answered; and (T2) explore their data in light of these hypotheses and research questions. The main study ran in two parts: a training part similar to the game task described in Section 5.1, and then an open exploration part where participants loaded their own data sets and explored freely. At the end, participants filled in a short questionnaire rating aspects of the tool (on a 5-point Likert scale), such as the ease of performing the two main tasks, open-ended questions regarding their exploration strategy, and helpful features of the tool. Log data of user interactions were gathered for further analysis (see Table 1).
|Expert .||T1 .||T2 .||Q .||Data .||Size .||D .||Limit-Search .||Evolve .||Eval .||OVisits .||NVisits .||Insight .|
|Expert .||T1 .||T2 .||Q .||Data .||Size .||D .||Limit-Search .||Evolve .||Eval .||OVisits .||NVisits .||Insight .|
4.2 Summary and Discussion of the Results
All participants in this study were able to easily confirm prior knowledge about their data except for one expert who found this task challenging because of the lack of data aggregation that her type of analysis requires. Overall, participants confirmed known correlations, clusters, or outliers in their data. In the remainder of this section, we summarise our study findings, highlighting new insight, successful tasks, and exploration strategies. We end this section with a brief discussion on the limitations of our framework from a user evaluation point of view.
4.2.1 Insight Generation and Tasks
If we include hypothesis formation as part of insight generation, similar to Saraiya et al. (2005), EvoGraphDice helped our participants generate new insight in the form of distinct observations about the data (four experts), new hypotheses (one expert), and better formulation of research questions (four experts). Distinct observations found by the experts were clustering, linear, or nonlinear relations, and similarly to generated hypotheses, they always linked a dimension in the original data set and a new proposed dimension. The subjective evaluation of ease of task T2 (Table 1) shows most experts found it easy to reach new insight: 1 “very easy,” 3 “easy,” and 1 “not easy.” Unsurprisingly, those who reached a concrete new finding scored the tool highly in comparison to those who did not.
The found solutions were regarded by the experts as interesting because they had one or more of the following properties: (1) a visual pattern such as those modeled by the scagnostics measures; (2) a simple formula involving few dimensions; (3) a selective choice of dimensions corresponding to an unformulated hypothesis or an inherent aspect of their data model; and (4) a domain value. Regarding the latter point, not all participants were able to state the immediate domain value, but in general they stated that EvoGraphDice helped them (a) interact visually with their data (expert 3), (b) try out alternative scenarios by editing dimensions (experts 1, 2), (c) think laterally (expert 2), (d) quantify a qualitative hypothesis (expert 1), and (e) formulate a new hypothesis or refine an existing one (experts 1–4).
4.2.2 Exploration Strategies
Overall, participants followed the same exploration pattern consisting of first examining the original dimensions, then inspecting and evaluating the first generation of the proposed dimensions (returned by the PCA), followed by one or more iterations of the following steps: (1) limit the search space, (2) select and rank cells, (3) evolve, and (4) interpret and verify. However, the frequency of using some tools (e.g. “evolve” versus “limit the search space”) varied depending on whether the expert had an a priori focused hypothesis (i.e., a research question involving typically 3–4 dimensions). We observed that the looser the initial hypothesis, the more often they tried to change the search space; and the more focused the hypothesis, the more generations they inspected. Indeed, these two strategies of exploration and exploitation are supported by EAs (Banzhaf, 1997), where on the one hand users want to explore new regions of the search space, and on the other hand they also want to look into solutions (combined dimensions) close to one region of the search space. However, our study also highlighted limitations to our approach.
4.2.3 Issues and Limitations
There are issues to using our framework, mainly related to the types of data sets we are able to visualise, the understandability of generated combined dimensions, and issues related to algorithm convergence. Most relevant to this article are the convergence issues (for a detailed discussion on the first two issues, see Boukhelifa et al., 2013). As we are dealing with a small population of dimensions evolved during only a few generations, the algorithm cannot be considered as having converged in the classical sense. Theoretical analysis considers two main mechanisms that govern the behaviour of EAs: focus (convergence or exploitation) and diversity (random search or exploration). In their most classical uses, that is, computationally expensive optimisation, the exploitation mechanism is privileged and the exploration component is used only to ensure the robustness of the results. In the interactive framework, where creativity or new feature discovery are sought, the same mechanisms operate but with a different balance: exploration capability seems to have a bigger impact. Additionally, talking about convergence for EVE systems is even more difficult, as usually the users themselves do not clearly know what they are searching for. We have noticed from this study that the guided exploration ability of EvoGraphDice is exploited in different ways by experts: some explored, thus it can be said they focused on the random search ability of the algorithm; while others exploited with longer runs ( generations), thus focused on guided search (convergence). In both cases, the IEA provides a unified framework for users who sometimes are interested in focused search, for example, if they know what they want, or in explorative suggestions if they do not have a priori precise hypotheses. The question of convergence and issues related to the search and exploitation mechanisms of EvoGraphDice are further analysed in a more controlled study where users had a precise and well-defined task to perform.
5 Experimental Analysis of User and Algorithm Behaviour
We conducted a study to collect data about user interactions and the fitness function. In particular, we wanted to understand user strategies in solving an exploratory task, and the IEA’s convergence focusing on the learning behaviour of the algorithm across generations and its ability to adapt to user focus.
5.1 Participants and Tasks
We ran our study with participants (5 female), ages years (mean years). Participants were researchers from two different institutions who had limited experience with SPLOM-based visualizations (only three had previous experience with SPLOMs).
The task was designed as a game. A 5D data set was synthesised with two enclosed curvilinear dependencies between two variables (x0 and x1) and random data for the rest of the dimensions. Participants were asked to evolve a scatterplot where it is possible to separate the two curves in Figure 3 (left) with a straight line (equivalent to separating the two corresponding convex hulls). They were given about minutes to complete the task. Ten participants successfully separated the two curves, while the remaining participants evolved views very close to a correct solution within the allocated time.
We generated two levels of difficulty for the game, where difficulty relates to the amount of enclosure between the curves—the bigger the overlap area between the curves, the more difficult it is to find a solution. Participants started with level 0 and depending on their performance, they also did a level 1 session. When participants could not reach a solution in minutes, we prompted them to restrict the search space to dimensions x0 and x1 to help them find a solution more quickly. Participants stopped the game when they found a solution, or when they felt the tool was no longer proposing interesting views (for struggling participants, a minimum exploration time of minutes was always respected). With the exception of two users, all participants successfully evolved a view separating the two convex hulls for level 0 (average time to find a solution was minutes), but only 6 out of the participants who tried level 1 managed to find a solution (average time minutes), as this level proved to be too difficult.
5.2 Data Collection
Log data gathered and analysed (Table 2) include three types of information related to (1) user interactions with the tool, such as cell selections and evaluations via the slider; (2) genetic engine status at each generation, such as details about the individuals in each generation, including their fitness components and scagnostics scores, and the cells these individuals participated in; and (3) the overall learned scagnostics weights. A total of log files were collected and stored in a database. However, in the following section we only analyse data corresponding to level 0, as it was the session performed by all participants. In total, we analysed 12 game sessions (one per participant).
|For each generation||9 scagnostics weights, wk|
|For each individual||Generation number|
|Genome (math. formula)|
|Surrogate function term,|
|Complexity term, fc|
|Average user evaluation,|
|For each evaluated cell||Generation number|
|Predicted evaluation (= x Cell scagnostics)|
|For each generation||9 scagnostics weights, wk|
|For each individual||Generation number|
|Genome (math. formula)|
|Surrogate function term,|
|Complexity term, fc|
|Average user evaluation,|
|For each evaluated cell||Generation number|
|Predicted evaluation (= x Cell scagnostics)|
5.3 Data Analysis
We carried out four different types of analysis for this game scenario, two examining user behaviour and two the algorithm’s. In particular, we wanted to study whether people are attracted to important data variables for their task or particular visual patterns in the views, and whether they are distracted by the interface or system suggestions (analyses 1 and 2). We also wanted to assess in detail the exploitation and exploration capabilities of the system by investigating the system’s ability to take user evaluations into account and examining population diversity (analyses 3 and 4).
The types of analysis were as follows:
User strategy analysis to understand the different approaches users took to solve the task.
User focus analysis to highlight hot spots in the user interface and assess user evaluation strategies.
Convergence analysis to assess the algorithm’s ability to steer the exploration toward a focused area of the search space.
Diversity analysis to assess the richness and variability of solutions provided by the algorithm.
For comparison purposes, and since the number of generations per exploration session differed between users, we divided the generations into three bins, corresponding to the start, middle, and end of the game session. We tried to get the same bin size for all groups when possible, and at the very least ensure that the start and end bins always had the exact same size when integer division by 3 was not possible.
5.3.1 Learning User Strategy through Scagnostics
Our observational study (see Section 4) revealed that users followed different strategies to solve the game task. We were interested in characterising these strategies, and comparing successful and unsuccessful game attempts. To achieve this, we looked at the scagnostics weights distribution along generation bins for all users that explored at least three generations. Figure 4 displays the types of scagnostics of various exploration sessions for level 0 of the game including sessions where the user restricted the search space (Figs. 4d, 4e, 4m, 4o). In such cases, and since the scagnostics weights are preserved from one subspace to another, we concatenate the generations of the consecutive search spaces. The success rate for this game level was 83% with an average exploration session of 19 generations and a standard deviation of 18 generations (for successful users). The high success rate implies that the algorithm is behaving correctly, that is, overall it is learning from user interactions and providing pertinent views to the user. However, the high variability in how quickly users found a solution may be influenced by the type of the searched visual pattern and the stability of the exploration strategy.
Type. We first looked at the three highest scagnostics weights for all sessions at any bin (Fig. 4). It appears that overall, skinny, convex, and sparse distributions are the dominant patterns of exploration (Fig. 4a). For the task of separating two curved lines, these scagnostics distributions correspond to the following strategies: skinny, trying to straighten the curves (33% of game sessions); convex, preserving the original shapes but the curves are slightly separated laterally (25% of game sessions); and sparse, trying to introduce holes in the view (41% of game sessions). These patterns can also be observed for overall success games (Fig. 4b) and to a slightly lesser extent for unsuccessful sessions (Fig. 4c), where the dominant scagnostics on average are skinny, sparse, and outlying. Since the types of dominant scagnostics are very similar regardless of outcome (although we acknowledge the lack of data for failed sessions for level 0), we hypothesise that the discriminating factor that determines convergence (and its speed) may be more related to the stability of the exploration strategy in relation to the start, middle, and end bins. As we discuss in the next section, users having a more stable strategy early in the exploration session seem to converge more quickly.
Stability. We then looked at the stability of exploration strategies, which we define as the user’s persistence in searching for the same visual pattern from one bin to the next. We identified four levels of persistence (from stability 0 to stability 3), where the level number refers to the average co-occurrence of top scagnostics types (i.e., the highest three) between consecutive bins. Thus stability 0 strategies have no co-occurrences of scagnostics types between bins 1 and 2, and between bins 2 and 3 (none for this study level); stability 1 strategies have at most one co-occurrence on average (session Fig. 4n); and so on for stability 2 (sessions Figs. 4d, 4e, 4h, 4k, 4o) and stability 3 (sessions Figs. 4f, 4g, 4i, 4j, 4l, 4m). When comparing successful and unsuccessful sessions it seems that successful ones had more stable strategies (stability 2 or stability 3), whereas unsuccessful sessions had a more erratic behaviour (namely, Fig. 4n). The session shown in Figure 4o is an exception in that although a solution to the game was not found, the exploration strategy was stable enough (stability 2). It may well be that the solution was just around the corner (after 23 generations, the user gave up). Looking at only successful attempts, it appears that sessions having stability 2 converged to a solution after on average 24 generations,9 whereas sessions for stability 3 converged after on average 16 generations.10 Although we do not have many examples of sessions with low stability levels, session (n) in Figure 4 shows an unclear strategy having only stability 1, which might have led to nonconvergence and thus to a failed game outcome.11 Although variance across participants is large, it seems that on average the more stable the strategy, the sooner the convergence.
Finally, we investigated whether users who limited the search space (sessions d, e, m, o; see Fig. 5) tended to have a less stable strategy than those who did not. It appears that on average sessions where the user did not limit the search space had a more consistent strategy, with stability 3 observed in five sessions versus only one such session in the limited search space condition. However, looking at the dominant scagnostics weights for each subspace, it appears that on average users who limited the search space were more likely to keep the same method of exploration between subspaces but not by much (five out of the nine subspace changes did not involve a change of target visual pattern described by the top three scagnostics types).
Summary. This analysis showed that EvoGraphDice allows for different types of exploration strategies centred around three dominant scagnostics (skinny, convex, and sparse) that appear to be relevant for the game task. We also found that the stability of the exploration strategy may be an important factor for determining the outcome of the exploration task and the speed of convergence, since successful game sessions had a more consistent strategy when compared to the unsuccessful ones, and they converged more quickly on average. Moreover, users who limit the search space are likely to keep their exploration strategy, implying that perhaps the most important reason for changing a search space is to focus on a specific set of data dimensions.
5.3.2 User Visitation and Evaluation
We analysed user focus of attention in terms of their cell visitation and evaluation patterns. The visitation patterns are examined so as to verify that participants focused more on cells with proposed dimensions that included the original target dimensions ( located in the first two columns of the SPLOM; see Fig. 2). Similarly, the evaluation patterns were examined to verify that participants ranked cells with the desired variables higher. Patterns to the contrary would indicate that participants’ attention was attracted elsewhere, which could indicate either that our system failed to provide interesting views, or maybe that there was an interface bias.
Visitation. We counted how many times each user visited a cell by selecting it in the yellow matrix quarters of proposed dimensions.12 We mapped this count to colour intensity; thus the more visited the cell, the more intense its background colour (Fig. 6). In the same cell, we also drew a line graph to show the number of visits across participants per generation for all sessions (Fig. 6(I)), for success (Fig. 6(II)), and for failure (Fig. 6(III)), where visit counts are normalised between the minimum and the maximum values per cell.
As expected, users had a greater tendency to visit the cells where the proposed dimension contained either of the original dimensions x0 or x1 (i.e., columns x0 and x1); 70% of all cell visits were in these two columns. We performed a statistical test to compare the percentage of visits on each column. As our data do not follow a normal distribution, we conducted a Kruskal-Wallis nonparametric test. The analysis revealed a significant effect of column on number of visits (, ). A post hoc pairwise comparison using a Mann-Whitney test showed indeed significantly more visits in cells of column x0 and x1 (mean percentage of visits and , respectively, i.e., more than half of all visits), compared to visits in all other columns (mean percentage of visits , , ). The fact that these two columns are placed in a prominent position in the matrix (i.e., the first columns on the left) may have encouraged participants to visit them. There was, however, no difference in visits between columns x0 and x1, or between the other columns (all ).
As we do not have enough data for failure attempts for level 0, we refrained from statistically comparing evaluation patterns per outcome. Nevertheless, a visual comparison indicates that overall visitation patterns do not seem to differ for success or failure (focusing on average visitation per cell, i.e., colour saturation). However, we can see that for success there is a tendency to focus more on the first two columns (containing the original dimensions x0 and x1); whereas for failure visitation patterns are more scattered and include column x3 and sometimes x4. As users do not seem to find interesting patterns in the first two columns, their search seems to extend to other dimensions, despite the fact that these will most likely contain noise.
In the first two columns, participants also visited more cells that are highly placed in the bottom left matrix quarter (normalised average visitation values for pertinent cells, from highly placed to bottom placed, equal to 0.9, 0.8, 0.7, 0.7, and 0.6). This last point is not a surprising finding; rather it is likely that cells having these dependent variables also show an interesting pattern and are thus more likely to be visited. These cells are also highly ranked by the system, as indicated by their position in the matrix (the higher the row in the proposed dimension, the higher its fitness value, as described in Section 3.1). The declining slope of the graphs corresponds mainly to the different generation runs of game sessions; there are only a few sessions with more than generations, with most of the sessions having fewer than generations. We have also observed that with time participants based many of their evaluations on quick visual inspections of the SPLOM cells rather than explicitly selecting cells and viewing their contents in the main scatterplot view (effectively making their cell visits count decrease). Indeed, as participants gained confidence in using the tool, they tried to optimise their interactions with the system, presumably to avoid user fatigue.
Evaluation. We also looked at cell evaluations with a similar visualization to the cell visitation plots, but here we mapped the average user satisfaction score to background colour saturation rather than to average cell visits. Thus, more highly scored cells are more intense. The scatterplot in Figure 7(I) shows which user evaluation scores appeared over consecutive generations for each of these cells. These are distinct evaluation scores rather than averages.
We observed again that more cells are highly ranked in the first two columns than the rest (often ranked low). A Kruskal-Wallis nonparametric test revealed a significant effect of column on user evaluation score (, ). A post hoc pairwise comparison using a Mann-Whitney test showed that participants assigned significantly higher evaluation scores in cells of column x0 and x1 (means and out of 5, respectively) compared to the values given to other columns (means , and ). There was no difference in evaluation values between column x0 and x1, nor between the other columns (all ). As we do not have enough data for failure attempts for level 0 of the game, we refrained from statistically comparing evaluation patterns per outcome.
We then grouped user evaluations into three bins corresponding to the start, middle, and end of the exploration session, and we plotted the box plots for all participant scores across bins, for all sessions (Fig. 7(II)), for success (Fig. 7(III)), and for failure (Fig. 7(IV)). Our goal with this box plot visualization over time (bins) was to examine if variability across evaluations tended to stabilise, indicating that the system consistently proposed suggestions that were preferred by users. Looking at the spread of box plots, we can observe that overall most user evaluations for nonpertinent columns (i.e., that do not include x0 or x1) are usually ranked consistently low regardless of outcome, irrespective of bin. Conversely, irrespective of bin, the range of evaluations is larger for the pertinent cells (i.e., cells with proposed dimensions that include x0 or x1 in the first two columns), perhaps because of the large diversity of proposed solutions, and consequently diversity of evaluation strategies adopted by our participants.
Summary. This analysis shows that users are more likely to visit cells showing dimensions relevant to their task. Moreover, these cells are on average ranked highly by the user. Since for this game, the main dimensions relevant to the task appear on the top left side of the proposed cells, users intuitively started navigating that way. What we are seeing in the results is probably a mixture of task relevance and intuitive navigation, as the relevant original dimensions are placed in a prominent position in the matrix.
5.3.3 Algorithm Convergence
A different type of analysis centers on the algorithm’s convergence to a desirable solution or an interesting subspace. We examined the rate of concordance between user scores of evaluated cells and their predicted values, which are calculated from the current scagnostic weights learned at each generation and the scagnostics values of the corresponding cell (see Eq. (1)): . Averages of actual user evaluations (1 to 5) throughout all the exploration stages (i.e., for bins 1, 2, and 3) have been plotted against the average system predictions in Figure 8 for all sessions (I), for successful (II), and for failed game sessions (III).
Overall (Fig. 8(I)) system predictions tend to consistently underevaluate, but in a way that follows the ordering of user scores. Thus an ascending pattern can be observed respecting the order of evaluation scores (so, for example, cells scored 5 by the user are also ranked the highest by the system on average, and so on). Although for individual cells user and system evaluations may differ (whether they preferred the proposed cell or not), on average the system prediction order tends to follow that of the user. A Pearson correlation test revealed a weak positive relation between user evaluation scores and predicted ones (, ).
Visual inspection of values for success (Fig. 8(II)) and failure (Fig. 8(III)) seems to indicate a tendency for the system to recognise plots as mostly good or bad (notice the difference of scores below 2 and over 3), where bad plots are given a score of 1 or 2. For successful attempts (Fig. 8(II)), an ascending pattern is also clearly observed, but the system has fairly similar predictions for cells evaluated by users in the middle scores (3, 4). This may be due to specific users being inconsistent in their search or ranking strategies in these middle-range scores. For failed attempts (Fig. 8(III)), an ascending pattern is present, although the system had more trouble with high scores (where the order is reversed between 4 and 5). This may be due to specific users being inconsistent in their scoring strategies, or lack of data points for this evaluation score. Nevertheless, even here the distinction between good and bad graphs is clear.
As mentioned, a large variety of behaviours can be observed regarding user scoring strategies (Fig. 9). Some participants have a coarse scoring strategy (e.g., Fig. 9(II), e), tending to lump evaluation scores to fewer levels; others provide more fine-tuned ratings covering the five scoring levels (e.g., 9(II), m) or a combination of the above at various stages of exploration. What can be observed is that for the successful game outcome, more participants adopted the fine-tuned evaluation strategy than the coarse one: sessions (d, g, j, k, l, m) for fine-turned and (e, f, h, i) for coarse. Interestingly, those who adopted a fine-tuned evaluation strategy did not converge more quickly (in 26 generations on average versus only 9 for coarse).
We next looked at convergence at the generation level, with the predicted values averaged per generation to observe the progression of the algorithm predictions. We focused this exploration on bin 3, that is, the last part of the exploration for all users. We chose to look at bin 3 only because the number of generations across participants differs, and thus they each reach a consistent strategy at different generations. Focusing on bin 3, we can assume that users have a clear strategy at that stage and thus system predictions should clearly follow. These data are plotted as line graphs in Figure 10, where the x-axis corresponds to the generation number (from first generation to last generation in bin 3), and the y-axis refers to the average system prediction. We plot a system prediction line for each actual user evaluation score level (1–5). Note that successful game sessions reached 8 generations at most in the third bin, while unsuccessful ones reached 10.
The order of predicted levels is fairly consistent with that of the user evaluations for successful sessions (Fig. 10(I)), as the ordering of the predicted levels is similar to the user evaluation (e.g., predicted values for cells evaluated by users as 3/5 are higher than the predicted values for cells evaluated by users as 2/5). For failed sessions (Fig. 10(II)), this pattern is noisy after a few generations for most evaluation levels. For successful participant g (Fig. 10(III)), who uses all evaluation values across generations, we see clearly the system evaluating good and bad cells, whereas a failed participant n (Fig. 10(IV)) has a coarse scoring strategy with missing evaluation values, and system predictions fluctuate as it tries to follow the user’s changing strategy (see also Fig. 4n).
Summary. Our analysis shows that on average the surrogate function follows the order of user ranking of scatterplots fairly consistently, even though users seem to take different search strategies (see Section 5.3.1) as well as different evaluation strategies that are either coarse, tending to lump evaluation scores to fewer levels, more fine-tuned covering the five score levels, or a combination of both at different stages of the exploration. Our results seem to suggest a link between user evaluation strategy and outcome of exploration and speed of convergence, where users taking a more consistent approach (either fine-tuned or coarse) seem to converge more quickly.
A final type of analysis focused on the diversity mechanism of the EA, which ensures that each suggested individual is different enough from others in the current population (see Section 3.3). Since we particularly characterise scatterplots in terms of their visual appearances using the scagnostics distributions, we wanted to examine how diverse the proposed views were with regard to their dominant visual pattern and whether this difference can be observed more strongly at a particular stage of the exploration session (with regard to start, middle, or end bin).
We consider 13 diversity factors, consisting of the nine scagnostics distributions (we take values for each generation rather than scagnostics weights) and four factors related to the fitness function evaluation: the overall fitness value, the user evaluation, the complexity evaluation, and the scagnostics evaluation. We used two metrics to quantify the diversity with regard to each of these factors:
The mean difference MD, which measures the statistical dispersion of views by calculating the average absolute difference between each two individuals in the current population.
The Shannon index H (Shannon, 1948), an indirect diversity measure that describes the average degree of uncertainty of predicting the species of an individual picked at random from the community, factoring in both the abundance and evenness of the species in that community. As with the MD measure, a high H value implies a more diverse population.
We found that the Shannon index H was not discriminating enough for our small population data set (each generation having only five individuals); therefore the diversity analysis is based solely on the mean difference MD metric.
To give an overview of the diversity information that we calculated, we created a heat map visualization starting from a matrix where columns represent the diversity factors and rows represent generation bins (Fig. 11). We concatenate all game sessions vertically, first successful sessions from (d) to (m), then unsuccessful ones (n) and (o), for level 0 of the game. We had in total 248 generations and 1,275 individuals. A dark horizontal line indicates the start of a new game session, and summary information (the mean) is shown across factors both horizontally and vertically, denoted by factor. The X_mean column refers to the cross factors average values, and the corresponding X_mean’ is the average of all the X_mean values.
Since we are interested in comparing the diversity of generations in relation to the exploration stage, we report the average mean difference per generation bin, BMD, where each game session is split into start, middle, and end bins (Fig. 11). Therefore each cell corresponds to a generation bin and is coloured to reflect the amount of diversity in that generation using a linear scale, such that the more saturated the colour, the more diverse the population (i.e., the bigger the BMD). Black cells represent nondiverse generation bins where individuals have identical values for that diversity factor. This visualization is similar to the diversity map by Pham et al. (2010), where rows are attributes of a multidimensional data set and columns are attribute value buckets.
Overall, it appears that the most diverse fitness factor is scagnostics and the least diverse is user evaluation regardless of outcome. The former observation may be due to the highly variable nature of the scagnostics factor given that it comprises various components (the nine scagnostics measurements and weights that are being updated over time). Out of the nine scagnostics diversity factors, clumpy and monotonic show more variations than the other factors (see Fig. 11, bottom, primed factors). With regard to the low diversity in user evaluations, this might be attributed to the lumping effect (see Section 5.3.3) where some users tended to use fewer ranking scores.
Looking at the individual factors and focusing at the scagnostics distributions, it appears that these factors are diverse at different times of the exploration. On average, most factors are diverse at the start of the exploration (five out of nine scagnostics: clumpy for sessions (d, e, f, n, h, i, j, k, l, m, o), striated for sessions (d, e, f, n, i, j, k, l, m, o), stringy for sessions (d, f, n, i, j, k, l, m), skinny for sessions (d, e, f, h, i, j, k, l, m), and convex for sessions (d, e, f, n, i, j, k, l, m)); one scagnostics type is more diverse in the middle (monotonic for sessions (d, e, g, n, i, l, m)); and only three scagnostics are more diverse at the end of the exploration (outlying for sessions (d, g, n, h, i, k, m, o), skewed for sessions (e, g, n, i, j, k, m, o), and sparse for sessions (d, e, g, h, l, m, o)).
These findings may be difficult to interpret at this level without examining the different exploration strategies. In principle, if the user adopts a particular strategy, say favouring sparse distributions, the system should converge to provide more of these solutions, especially toward the end of exploration for successful sessions, and thus the BMD for this diversity factor should decrease. This can be indeed observed for six out of the ten successful game sessions (see Figs. 4 and 11), with sessions (i, j, k) having convex as the highest scagnostics for bin 3, session (m) with clumpy as highest scagnostic for bin 3, and sessions (e, f) for skewed. These sessions all saw their BMD drop in bin 3 (at least from bin 2) for their respective dominant scagnostics (Fig. 11)
Summary. The diversity analysis shows that in terms of the visual pattern, the IEA provides more diverse solutions at the beginning of the exploration session (bin 1) before slowly converging to a more focused search space (in bin 2 or 3 for the success outcome) for most sessions. These effects correspond to the exploration component (random search) and the exploitation component (focus) of the genetic engine.
6 Discussion and Future Directions
To fully evaluate an IEA system, we feel a collection of analysis methods is needed, both user-centered, observing the utility and effectiveness of the system for the end user, and algorithm-centered, analysing the algorithmic behaviour of the system. To this end we previously conducted an observational study with experts analysing their own data using our system (Section 4), and a new controlled user study with synthetic data to analyse different aspects of the algorithmic behaviour and its use (Section 5).
In the observational study, our experts were able to verify known patterns in their data, and generate new insights using our tool. As discussed, because of the SPLOM representation of EvoGraphDice, the system can visually handle data sets with relatively few data dimensions and cannot handle at all data types such as time series (an issue raised by one expert). Nevertheless, the algorithmic part of the system has no such limitations. It remains future work to develop visualizations that can express temporal combinations of dimensions proposed by an IEA. Another issue that was raised is the scalability of our matrix representations to a large number of dimensions. Aside from using known dimensionality reduction techniques (such as clustering), there is a need for further research on how to select appropriate visual representations of the original and proposed dimensions, and potentially how to adapt these views on the fly based on the underlying data. As such, we believe there still is a lot of potential in continuing the dialogue between the visualization and IEA communities.
In our new experimental analysis, we were able to compare the use of the system across different users in a more controlled setting and scenario. Our analysis of strategies (see Section 5.3.1) shows that participants adopted different strategies, and in particular different exploration patterns, for successful and unsuccessful sessions. The analysis of the content of the surrogate function, via the observation of the variation of the learned weights of the scagnostics measurements, highlights a difference in users’ focus of attention (i.e., searched visual pattern). For successful game sessions, there are clearly two main strategies: one tending to “unfold” the curved shapes by favouring linear scagnostic measurements (e.g., middle solution in Fig. 3), the other trying to spread the figures laterally by favouring sparse or skinny scagnostics (e.g., right solution in Fig. 3). Our user strategy analysis also showed that stability may be an important factor for determining the outcome of the exploration task and the speed of convergence, since successful game sessions had a more stable strategy when compared to the unsuccessful ones.
The choice of visualization and the order of dimensions relevant to the task may have influenced the way users visited the new views or evaluated them. Our user focus analysis showed that users were more likely to visit cells that included dimensions relevant to the task (in our case, x0 and x1 columns) although not all users were conscious about this choice (from our observational study). It is likely that users found interesting patterns in these cells more than others, which also explains why these were ranked higher than the other cells. When these areas of the SPLOM did not show any interesting pattern, the search extended to other areas of the matrix. Nevertheless, there is always the possibility that the placement of proposed dimensions in the interface affects users’ choices and attracts their attention. More work is needed to delineate the influence of interface design on evaluation strategies with regard to different visual search and analytics tasks.
The surrogate function that approximates user evaluations is clearly not able to embed the explicit aim of the game (i.e., separating the convex hulls of two geometrical subsets), as it only performs calculations on the whole set of points of the scatterplot. However, our analysis suggests that the surrogate function is able to predict users’ ranking order of scatterplots fairly consistently and is discriminative enough to allow various search strategies (e.g., favouring linear, convex, or sparse distributions to solve a specific task) as well as different evaluation strategies that are either coarse or more fine-tuned. More extensive experimental analysis is needed in order to characterise and generalise these strategies for tasks other than the game task studied here.
EvoGraphDice seems to exhibit a learning behaviour controlled by the diversity component of the genetic engine, which aims on the one hand to provide a diverse set of solutions and on the other hand to converge quickly to a more pertinent subspace. The diversity analysis (see Section 5.3.4) shows that on average the EA provides more diverse solutions at the beginning of the exploration session (bin 1) before slowly converging to a more focused search space (in bin 2 or 3 for successful exploration sessions). These effects correspond to the exploration component (random search) and the exploitation component (focus) of the genetic engine. These two mechanisms are transparent to the user. However, it would be interesting to provide users with a metavisualization of their exploration paths highlighting stages where they explored and others where they exploited, and to allow them to roll the system back to previous exploration stages. Such visualization tools may give users a better understanding of their exploration behaviour and help them establish a more stable strategy.
Our general approach for steering visual exploration has the following characteristics: (1) Intuitiveness, a visual approach to interact with data requiring no prior statistical knowledge. (2) Interactivity. Rather than fitting the data to predefined shapes in a static manner, using an IEA, the user can dynamically steer the exploration process towards a pattern of interest. These patterns can involve dimension concatenations that are not obvious at the outset of the exploration. (3) Adaptability. The system can adjust to the user's change of interest over time. However, there are also limitations to our approach related to the fitness function design (including the surrogate function), the IEA implementation, and user-related issues, some of which we hope to address in future research informed by a deeper analysis of our collected log data.
6.1 Extensions to the Fitness Function
The main challenge in guided search is to determine what views of the data are interesting to the analyst. Currently, our fitness function has three components: the surrogate function, a complexity term, and a user evaluation term. Other terms to help approximate user interest could be considered (in place of or in addition to) such as data related quality metrics (e.g., variance) or perception-based metrics such as for correlation perception (Rensink and Baldridge, 2010) or similarity perception (Albuquerque et al., 2011). Moreover, these different terms may have varying weights depending on the task at hand and user’s domain knowledge of the data, emphasising either the automated components or the interactive term.
6.2 Robustness of the IEA
In general, we feel that the speed of convergence of the IEA depends on many factors, including the size of the search space, the complexity of the sought pattern, the number of evaluated scatterplots, how often the users changed their focus, and target search pattern. All these variables make it difficult to predict a convergence ratio or speed. As discussed in Section 4.2, this is not easy to study, as there is no unique solution to converge to; rather the optimisation is dynamically adapted to follow user interest over time. Visualizing past exploration paths again could help users better understand the target pattern and how far or close they have been exploring in relation to it.
6.3 User-Related Issues
Despite the complexity component of the fitness function that favours combined dimensions with fewer variables and simple formulae, our method can still yield complex dimensions that are difficult to interpret, something that was also observed in our study with experts (see Section 5). Another issue is related to user fatigue, which is a well-documented problem in interactive evolution (Poli and Cagnoni, 1997). Other methods to collect user feedback need to be investigated. Our controlled study (see Section 5) showed occasionally a chunking effect of evaluations to binary values (“good” or ”bad”), which does not seem to reduce convergence speed (number of generations evolved before a solution was found). Careful selection of a user evaluation scale or method, such as sketching (Shao et al., 2014), can help reduce user fatigue. There is indeed a trade-off between the accuracy of user evaluations and the cost related to user fatigue.
Guiding users in an exploratory visualization environment involves careful consideration of what views to propose, when to propose them, and how to present them to the user. Thus far, we elaborated on a framework that combines automatic methods and user input in order to steer user exploration (i.e., the what part); much work is still needed to establish when and how interesting views should be best presented to the users without interrupting or distracting them from their main exploration tasks.
7 Conclusion and Reflections
Besides the development of a complete EVE framework, and the experimental proof of its efficiency for various data exploration tasks, this paper proposes a full experimental analysis of an IEC system. Assessing convergence, versatility, and user satisfaction is a very difficult task for interactive systems that has been rarely addressed in a systematic way in IEC. The visual analytics community has developed tools and methodologies for addressing such issues, which have been applied in our collective work on EVE, yielding important insights concerning the behaviour of a genetic programming system in an interactive and changing fitness landscape. It is now clear that, as we are dealing with dynamic landscapes, pure mathematical convergence (as usually defined in EC) is less meaningful than adaptation behaviour for IEC systems. Interactive systems are often dealing with very small populations, for which premature convergence is a major risk: convergence may then be considered as a drawback. Maintaining an exploration and adaptation ability strongly depends on diversity management. For EvoGraphDice adaptation relies on a surrogate function, learned from past user interactions; the efficiency of this mechanism has been proved using a very simple scheme. More sophisticated surrogate functions may be interesting to explore and evaluate in order to improve the versatility of the system. Another advantage may be to use the surrogate function as an underlying optimisation fitness, allowing the use of larger EA populations for which only the best individuals are presented to the user for evaluation.
There are many open questions for EVE systems. Regarding initialisation, for instance, we choose a PCA for providing a known initial environment for users that are not familiar with evolutionary approaches. This may not be the best choice for some data sets, in particular for non-numerical ones, albeit the GP search space still remains convenient for providing combined dimensions. Another challenge that EVE tools are facing is the exploration of highly multidimensional data sets and the related big data issues, where all dimensions cannot be displayed in a single view. Evolutionary search may serve as a preselection tool to filter interesting dimensions for the user. The main difficulty is then to learn users' preferences on data they cannot entirely view. Learning solutions proposed by systems such as VisAsist (Guettala et al., 2012) may be interesting but rely on some a priori or more sophisticated surrogate functions.
Finally, we highlight the possibility of collaborative EVE systems, and in particular crowd-sourcing ones. Crowd-sourcing approaches are indeed very attractive to deal with very large and complex data sets like genomic databases. More generally, collaborative EVE systems may serve as a communication framework for multi- to many-user search, as a shared population allows to simultaneously maintain and compare various interesting solutions. In this context we may have to consider large populations, and a set of user-dependent surrogate functions to yield views adapted to each user.
The evaluation of an IEC system remains a difficult task, as these systems adapt to the user, but at the same time the user also adapts to the system. Getting a clear understanding of the subtle mechanisms of co-adaptation (Mackay, 2000) is challenging. The research domains of visualization and human-computer interaction not only provide tools to help understand complex data sets but they also have study methodologies that can help shed some light on how users interact with evolutionary systems. We expect the synergy between experts from the IEC and the visualization communities to bring forward advances in conducting optimisation with a human-centered approach.
In contrast to its sister domain of scientific visualization (SciVis), which traditionally deals with spatial data.
EVE video and a prototype demo are available at http://www.aviz.fr/EVE
Features drawn from the data, such as clusters, outliers, and correlations, can also be seen in the view.
Available as a free downloadable package in R from http://www.rforge.net/scagnostics/
Dang and Wilkinson (2014) evaluated the feasibility of handling a huge collection of scatterplots in a scagnostics-based system using data sets having up to 3k dimensions and found this to be a bottleneck. We tested scagnostics for data sets up to 12 dimensions (Boukhelifa et al., 2013) and found that we were more limited by the size of the display to accommodate new scatterplots than by the algorithms we used.
In the current version of the tool, we consider views with only one dimension being evolved, but in the future we will extend the fitness function to evolve both dimensions of a view. In that case we could talk about “co-evolution.”
The minimal distance is calculated in the scagnostics space, whose values range between 0 and 1. Thus, the default value of 0.1 corresponds to 10% of the maximum value of a scagnostics score.
The scagnostics fitness component has a weight of 1 and is not currently tunable by the user.
Convergence generation per session having stability level 2: d(59), e(25), h(6), and k(9).
Convergence generation per session having stability level 3: f(3), g(29), i(5), j(13), l(10), and m(40).
No session for this game level had stability 0.
Note that we only report on cell selections occurring at the bottom left quarter in Figure 2, where 87% of cell visits took place. The remaining selections were located at the top right quarter of the matrix, as it is not possible to select cells from the bottom right quarter of the SPLOM for this version EvoGraphDice.