To help another person, we need to infer his or her goal and intention and then perform the action that he or she was unable to perform to meet the intended goal. In this study, we investigate a computational mechanism for inferring someone's intention and goal from that person's incomplete action to enable the action to be completed on his or her behalf. As a minimal and idealized motor control task of this type, we analyzed single-link pendulum control tasks by manipulating the underlying goals. By analyzing behaviors generated by multiple types of these tasks, we found that a type of fractal dimension of movements is characteristic of the difference in the underlying motor controllers, which reflect the difference in the underlying goals. To test whether an incomplete action can be completed using this property of the action trajectory, we demonstrated that the simulated pendulum controller can perform an action in the direction of the underlying goal by using the fractal dimension as a criterion for similarity in movements.
1 Introduction: Imitation of Action
As a method of social learning from others, children imitate their parents' movements in early development (Meltzoff, 1995). Imitation, as a behavioral basis for understanding other's goals and intentions, is considered a mechanism for preserving social and cultural knowledge. From the perspective of cultural evolution, it plays a key role as a “latchet,” which preserves the skills and knowledge obtained by our ancestors, preventing human cultural knowledge from moving backward (Tomasello, 2001). Teaching techniques that show and mimic a demonstration are commonly adopted in not only education but also robot learning (Schaal, 1999).
Consider two scenarios in which a human demonstration can end in failure without completing a given task. The actor intended to reach a target but could not do so because of inaccurate motor control or could not move her arm freely due to an injury.
Although the action ends in failure in both scenarios, the reasons underlying why the actor failed to complete the given task are totally different. In scenario 1 (see Figure 1, failure by immaturity), the reason for the failure is an insufficiency in the accuracy of the actor's motor control for the given task, which can likely be overcome with additional motor learning. However, in scenario 2 (see Figure 1, failure by infeasibility), the reason for the failure is temporary or permanent inability to perform an appropriate action. By our definition, the action in scenario 2 is considered a hypothetical success because the actor's motor control is optimal for the task in question and her action would be successful if she were not injured. In this study, we refer to the former type of failure as “failure by immaturity” and the latter as “failure by infeasibility.”
While problems closely related to imitation have been studied in robotics, the majority of these have been classified as failure by immaturity. For example, Schaal proposed a scheme called learning from demonstration (LFD; Schaal, 1997, 1999), which aimed to utilize human demonstration to initialize and improve a robot controller. Typically, combined with reinforcement learning (Sutton & Barto, 1998; Doya, 1999), a set of human-demonstrated trajectories is used to provide an initial guess of controller parameters such as the Q function to leverage learning (Schaal, 1997) and/or update trained controllers based on human performance evaluation (Argall, Browning, & Veloso, 2007). The idea of LFD has been extended to inverse reinforcement learning (IRL; Ng & Russell, 2000; Abbeel & Ng, 2004), whose aim is to infer the unknown reward function given a task structure (states, actions, and environment) and a set of trajectories produced by experts of the given task. The recent development of IRL algorithms (Ziebart, Maas, Bagnell, & Dey, 2008; Babes-Vroman, Marivate, Subramanian, & Littman, 2011; Ho, Littman, MacGlashan, Cushman, & Austerweil, 2016) has led to great success in such applications.
LFD and IRL studies have typically examined certain tasks under the failure-by-immaturity class. Both LFD and IRL typically assume that all given action trajectories are successful or fall under the failure-by-immaturity class with respect to an unknown but fixed task. Recent studies (Grollman & Billard, 2011; Shiarlis, Messias, & Whiteson, 2016) on learning from “failed” demonstrations (reviewed in Zhifei & Joo, 2012) have attempted to utilize nonexpert trajectories in failure-by-immaturity.
In detail, Grollman and Billard (2011) studied how a robot can learn only from nonexpert demonstrations of failure by immaturity. Namely, they assumed that the provided “failed” demonstrations are distributed around the “successful” trajectory in space. In this sense, they assumed that the failed trajectories were similar to the successful ones and on average contained information about the successful ones. A subsequent study found that a handcrafted reward function for the target task or human performance evaluation is required to train an acceptable robot controller (Grollman & Billard, 2012). Recently, Shiarlis et al. (2016) studied how nonexpert demonstrations can be used to improve the performance of existing IRL algorithms. Shiarlis et al. (2016) utilized nonexpert demonstrations of failure by immaturity as auxiliary information and successful demonstrations to train IRL systems as a set of positive and negative samples, respectively, to behave more like successful ones and less like failed ones. Shiarlis et al. (2016) showed that successful demonstrations must be provided to facilitate learning of a controller from demonstrations.
Most previous studies on learning from (failed) demonstrations assumed that the demonstration was either successful or a case of failure by immaturity rather than failure by infeasiblity. Thus, the learner or imitator in this scheme can access an optimal or near-optimal (with noise) demonstration. However, children, even in early development, can go further: they can learn and complete a failed demonstration by infeasibility, which has no action at its goal state. In this case, what the imitator has to infer is the goal underlying the demonstrator's nonoptimal movement, in which the goal state is absent. We think that failure by infeasibility (scenario 2) is needed to help others—that is, it is necessary to complete an action in the course of meeting its goal without knowing the goal state. Relevant developmental studies have shown that two-year-olds can “help” others by completing their action (Meltzoff, 1995; Warneken & Tomasello, 2006).
In this letter, we aim to address the two problems we posed above in imitation learning: (1) action recognition and (2) action completion by performing a numerical study on a task that involves controlling a physical object—a single-stick pendulum. We suppose that this simple control task is minimally sufficient to capture the essential aspects of goal imitation: how one recognizes the intention (motor control) behind a given action and how one performs the action. The primary objective of this letter is to provide computational proofs-of-concept for our hypothesis that some degrees-of-freedom (DoF) is critical for characterizing the underlying goal and intention of an observed action (described in the next section). Therefore, in the computer simulation studies in this letter, we suppose that the imitator can obtain sufficient trajectory data from the demonstrator to learn action features. This allows us to explore the primary problem, the principle of the computational possibility of goal inference, separately from other technical problems, such as learning from a small training data set. This assumption may be a limitation of our study and is discussed in section 5.
Although the control task involving a pendulum may be considered overly simple in its structural complexity compared to the human body, we think this task has very similar characteristics to the experimental task reported by Warneken and Tomasello (2006). In their experiment, Warneken and Tomasello exposed 18-month-old children to an adult (experimenter)'s goal-failed action and investigated whether these children could infer the adult's latent goal, which was not demonstrated, and could help the adult complete the goal-failed action. The research suggested that children of this age can infer others' goals and complete others' actions.
In principle, children in such an experiment are required to (1) recognize the adult's failed goal and/or intention and (2) perform their own action by controlling their own body to meet the adult's goal. In this letter, tasks 1 and 2 are, respectively, called recognition and completion tasks for goal imitation. We illustrate how our simulation framework captures the goal imitation behavior and report two simulation studies for our recognition and completion task.
2 Simulation Design
2.1 A Situation That Requires Recognition and Completion of Other's Action
First, we briefly introduce the psychological experiment performed by Warneken and Tomasello (2006) (abbreviated as WT hereafter) as a representative situation against which we modeled our theoretical framework. WT investigated whether children can infer a demonstrator's goal and the intention behind their behavior. In the experimental (goal-failed) condition, called out-of-reach, the demonstrator accidentally dropped a marker on the floor and was unable to reach for it. In the control (goal-achieved) condition, the demonstrator intentionally dropped a marker on the floor. The former condition implicitly calls for the child to help the demonstrator achieve her unsuccessful intention/goal, namely, to pick up the marker, while the latter does not. The experimental and control conditions were designed such that the demonstrator's apparent bodily movements were similar (e.g., both dropped a marker), whereas the underlying intention/goal behind the action was different. WT showed that the children more frequently showed helping behaviors in the experimental condition than the control condition.
2.2 A Model of Recognition and Completion of Other's Action
In this study, we designed a simulation framework to capture the essence of WT's experimental design in minimal form. Specifically, we employed a single-link pendulum as a simplified human body. The imitator (i.e., the hypothetical child) and the demonstrator must both control a pendulum to perform an action (i.e., a goal-directed movement). The demonstrator's goal is to keep the body of the pendulum at the top-most position of its trajectory (opposite to gravity) as much as possible subject to given bodily constraints, a set of physical parameters for the pendulum (e.g., mass and length). The demonstrator's intention is motor control (or policy in terms of reinforcement learning) of the pendulum, which gives angular acceleration (force) as a function of the angle and angular velocity of the pendulum. An action of the demonstrator is to manipulate the movement (trajectory) of the pendulum, as represented by either an orbit of the coordinates or a vector of the angle and angular velocity, which are generated using a given pair of initial conditions and the demonstrator's controller.
We hypothesize that the essential difference between the experimental (goal-failed) and control (goal-achieved) conditions in the study conducted by WT is captured by the degree of optimality of the intention and action with respect to the given goal. Suppose there are controllers A and B, which are optimal for the distinct goals GA and GB, respectively. If the demonstrator uses controller A for goal GA, the generated movement would be optimal and considered a successful action. In contrast, if the demonstrator uses controller B for goal GA, the generated movement would in general be suboptimal and would be considered a failed action. We consider the former case to be analogous to the control (goal-achieved) condition in the experiment conducted by WT in which a successful action was performed and did not lead the child to help the demonstrator and the latter to be analogous to the experimental (goal-failed) condition that led the child to help the demonstrator.
Given these two types of tasks, we defined the goal-failed condition as a mismatch between the task and action in which the demonstrator is performing the swing-up-no-hit task (with the infeasible region) by controlling the pendulum using a controller that is optimal for the swing-up task (see Figure 3C). We consider this goal-failed condition to be analogous to the experimental condition in the study conducted by WT in which a failed action was demonstrated. We also defined the goal-achieved condition, which we consider to be analogous to WT's control condition, as a match between the task and action in which the demonstrator is performing the swing-up-no-hit task by controlling the pendulum using a controller that is optimal for the swing-up-no-hit task (see Figure 3D). We expect that the demonstrator in our goal-failed condition (see Figure 3C), but not the goal-achieved condition (see Figure 3D), will perform an action that is suboptimal for the swing-up-no-hit task, which may appear similar on the surface but is essentially different from the action intended to be optimal for the swing-up task.
The imitator, in turn, observes two types of goal-failed and goal-achieved actions, which are potentially different, and analyzes their potential difference based only on the observed action trajectories. This situation corresponds to simulation I (see section 3), in which we investigated recognition of the potential difference between goal-failed and goal-achieved actions.
After some visual inspection of actions, the imitator is expected to perform his or her own actions to complete the demonstrator's action (i.e., “help” the demonstrator) if the action is incomplete or goal-failed. This situation corresponds to simulation II (see section 4) in which we investigated action generation based on observation of the demonstrator's incomplete or goal-failed actions.
2.3 Pendulum Control
The pendulum swing-up task is classically used in feedback control theory (Doya, 1999), originally used to design a controller that can swing the pendulum and maintain it at about the top-most position where . The controller for this task is defined by the function , which outputs torque for any given state . The goal of the task is implicitly and quantitatively represented by the reward function (see equation 2.4 or 2.5). With this reward function, we can define the goal-meeting action as an action with the maximal reward value (or large enough to be considered an approximation of the maximum) as a function of the controller (see the next section for details). In each run of the simulation, the initial position of the pendulum was set such that and angle drawn from the uniform distribution ranged by .
2.4 Energy-Based Swing-Up Controller
In our previous work (Torii & Hidaka, 2017), we adopted a controller (or policy) based on reinforcement learning to study the action recognition task described in section 3. We obtained the same qualitative results to those in this letter. To study the action completion task described in section 4, we adopted the energy-based controller throughout the study, whose scalar parametric form is very convenient compared to reinforcement learning, which requires computationally expensive training of the policy function.
2.5 Goal-Achieved and Goal-Failed Action
For both the swing-up and swing-up-no-hit tasks, we applied the energy-based controller (Astrom & Furuta, 2000) with some goal angle introduced in the previous section. Since the energy-based controller with the goal angle was originally designed for the pendulum swing-up task with no angle constraint, this energy-based controller is not optimal for the swing-up-no-hit task with the constrained pendulum: it does not supply sufficient torque to hold the pendulum against gravity. As a result, it produces a repeated swinging movement, unlike the behavior without the infeasible boundary, in which it holds the pendulum still at the goal angle.
When the pendulum collides with the bounds, some loss of mechanical energy occurs because the height of the pendulum forcibly remains unchanged and the body decelerates to an angular velocity of zero, which can be visually observed in both the energy-time series and the trajectory in the angle-velocity plane in the top panel of Figure 4. In contrast, no such loss of energy is observed in the bottom panel because the pendulum rarely collides with the bounds.
2.6 Features for Detecting Intentional Differences
According to our definition of the goal-failed and goal-achieved conditions, the intention behind a movement that is optimal for the swing-up task does not match that for the swing-up-no-hit task (see Figure 3C). Other than in this particular case, many other actions that fail by infeasibility, including those in the study by WT, essentially display this type of mismatch between some originally intended task and the actual performed task. One of the critical features common to these types of tasks is that the task to work has an additional unexpected obstacle that is absent in the original task, for which the controller is optimal.
Beyond specific differences across different tasks, we hypothesize that these types of failures may be characterized by the existence of some additional factor complicating the originally intended task. In WT's condition in which a marker was accidentally dropped, the demonstrator was not ready for the situation in which he or she was required to pick up the dropped marker; the accidental dropping of the marker introduces an additional complexity to the originally intended task: to carry the marker to some location (without dropping it). This is analogous to the goal-failed condition in our pendulum simulation: the additional obstacle, the limitation in the feasible angle, causes suboptimality of the original motor control in this unexpected new task.
What characteristics can be used to detect such suboptimality in an action? In this study, we hypothesize that this additional factor of complexity can be detected in the degrees-of-freedom (DoF) of the given system.
Let us consider a successful action, for example, the goal-achieved condition of the pendulum control task. Such an action is expected to flow smoothly, without any sudden change in its motion trajectory. Thus, the movement can be closely approximated using a set of differential equations with a relatively small number of variables. In contrast, an action that fails by infeasibility, for example, the goal-failed condition of the pendulum control task, is expected to have some discontinuous or nonsmooth change in its motion trajectory, such as at the time point before or after an unexpected accident for the given system. Thus, before and after this change, such a system would be better described using two or more distinct sets of differential equations.
Although it is technically difficult to identify such differences in the underlying systems (or sets of differential equations) in full detail here, it should be clear that the underlying controller in these two cases would differ in their DoF. This consideration leads us to the hypothesis that some difference in the DoF of movement is diagnostic of successful and failed actions.
In this letter, we specifically employ a type of fractal dimension, called pointwise dimension (see section 3.1), of the actions as an indicator of the DoF of the underlying controller and test whether it is characteristic of the difference in intention underlying the actions. In the following two sections, we examine our hypothesis by analyzing the movement data generated by the simulated pendulum control task. We divided our analyses into action recognition and action completion.
First, we analyzed the recognition task from the imitator's perspective by examining which features of the movements the imitator (observer of the actions) was able to discriminate between the goal-achieved and goal-failed actions. Success recognition, the ability to tell the difference between two qualitatively different actions, is considered necessary to complete another's failed action.
Second, we analyzed the completion task by asking whether the characteristic features of the intention underlying actions, as identified in the first analysis, are sufficient to generate an action to complete a goal-failed action. Here, completion of the action means that the imitator performs an action that meets the goal that an observed demonstrator's action failed to meet. As such, a goal-failed action is incomplete by definition and not fully observed by the imitator; thus, the imitator is required to extrapolate the observed action to generate the originally intended action. This action completion task needs not just recognition of the qualitative difference in actions but also some identification of the demonstrator's failed action and the imitator's action.
3 Simulation I: Action Recognition Task
In simulation I, we investigate whether the imitator can tell the difference between the two different intentions underlying the actions performed by the demonstrators in the goal-failed and goal-achieved conditions. The goal of this simulation is to analyze and identify the feature that is most characteristic of the latent intention of actions.
Specifically, we listed several features typically used in time series analysis, such as angle (or angular position), angular velocity, angular acceleration, angular jerk, power spectrum, mechanical energy, and pointwise dimension. We hypothesized that pointwise dimension would be most characteristic of the latent intention of actions for this analysis. Angle, angular velocity, and power spectrum are commonly employed features of movements in the literature. They are also fitting for our simulation, as motor control is a function of angle and angular velocity, and the generated movement is periodic. Mechanical energy is the very concept defining the motor control task (see equation 2.3), and we thus expect mechanical energy to be the best possible feature in theory to characterize the intention (motor control). However, a naive imitator, such as a child, who is ignorant of the demonstrator's physical properties may not have direct access to the mechanical energy because of the need for knowledge of the physical parameters of the pendulum (i.e., mass and length in equation 2.1, which are necessary to compute the mechanical energy of the pendulum system). Thus, we treated mechanical energy as an indicator of the best possible (but unlikely to be directly accessible) reference feature for the recognition task in our analysis.
Finally, given that the pointwise dimension indicates the latent DoF of an underlying dynamical system, we hypothesize that it is an indicator of task-system complexity and is characteristic of the intentional difference between movements performed in the goal-failed and goal-achieved conditions. We tested this hypothesis by evaluating recognition performance using the pointwise dimension compared to that using the reference feature: mechanical energy in the classification of movements with different intentions.
3.1 Pointwise Dimensions
3.2 Two-Class Classification
We performed classification analyses of demonstrator types based on each of the features described. Performance of the classification is used as a measure of how well each feature discriminates among demonstrator types. Specifically, for this two-class classification task, the imitator is exposed to a time series of a pair of angles that reflect each movement demonstrated in the goal-achieved and goal-failed conditions. Part of each time series corresponding to the first 10 seconds of the task was excluded from the training data because these were transient periods that were heavily dependent on the initial state. The rest of the time series, corresponding to the last 50 seconds of the movement (of 5000 sample points), was used as the training data for classification. We used a single, long time series because the system is expected to be ergodic, defined as a time series with any initial starting state that eventually converges to the same stationary near-periodic dynamical system (with some intrinsic noise in motor control).
For classification, we considered the following features: angle (or angular position), angular velocity, angular acceleration, angular jerk (the third derivative of angle), power spectrum, mechanical energy, and pointwise dimension. Given a time series of angle (or angular position), the time series of angular velocity, acceleration, and jerk was calculated by taking the first-, second-, and third-order difference of the angle time series. The third derivative of the position, called “jerk,” is a notable feature that is hypothesized to be critical in the minimum jerk and/or minimum torque-change trajectory for human motor control of reaching (Hogan, 1984; Uno, Kawato, & Suzuki, 1989). The data points for the power spectrum feature were constructed as a collection of frequencies with the largest powers in the power spectrum of angles computed within a moving time window size of 5 seconds. Details of the construction of the pointwise dimension feature are described later. In contrast to the features described, which can only be computed using the observable time series, computation of mechanical energy requires knowledge of the physical properties, such as the body mass and length, of the pendulum system, as evident from equation 2.2.
To analyze the degree of contribution of pointwise dimension to recognizing the underlying controller of the system, the pointwise dimension associated with each data point was estimated from a time series of coordinate values of the pendulum. As pointwise dimension is an invariant under arbitrary smooth transformation, we obtained essentially the same estimate as that from the time series of angle. In dynamical systems theory, Takens's embedding theorem states that a diffeomorphism of a smooth attractor in a latent high-dimensional space can be reconstructed from a univariate or low-dimensional time series, which is a projection on a subset of the original high-dimensional state space, using time-delay coordinates with a sufficiently high dimension for embedding the attractor. Therefore, the positional time series was first embedded into the time-delay coordinates of the embedding dimension , . Then the embedded -dimensional time series (of sample points) was used to estimate the pointwise dimension, equation 3.1, for the time series. We mostly adopted for the pendulum system with a controller.
To perform the recognition task using pointwise dimension as a feature, we used the statistical model underlying the dimension estimation method proposed by Hidaka and Kashyap (2013). The method or dimension estimator constructs a model for given data as a mixture of multiple Weibull-gamma distributions, each of which has an associated parameter representing the fractal dimension. This method can also be used to calculate the probability that a sample data point with time-delay embedding belongs to the mixture model. Therefore, for the probability density functions of the pointwise dimension feature, we adopted the Weibull-gamma mixture model rather than the gaussian mixture model used for the other features. The training and test data for, say, and , were both obtained by time-delay embedding the positional time series. For this recognition task, the spatial neighborhood of a sample point was calculated within the feature space spanned by training data (or ) associated with the probability density function (or ). The number of mixture components of the Weibull-gamma mixture model was selected based on the minimum Akaike information criterion (Akaike, 1974).
3.3 Classification Results
4 Simulation II: Action Completion Task
One of the key observations in the experiment conducted by WT is that the children could perform an action to achieve the demonstrator's “goal” by simply observing their incomplete action. Because the children did not observe the complete action in the experiment, they needed to identify the putative complete action by extrapolating the observed incomplete action. To explore the mechanism of the action completion task, we asked, how does the imitator observing the goal-failed demonstration produce an action that achieves the unobserved goal? As pointwise dimension was found to be reasonably characteristic of the intentions behind observed movements in simulation I, we examined an extended use of the pointwise dimension for the action completion task in this simulation.
In the action completion task, exact identification of the intention is not necessarily required or beneficial because the imitator (e.g., child) does not necessarily have the same body as the demonstrator (e.g., adult), and the motor controllers of different bodies required to meet the same goal may generally differ. Thus, in the action completion task, the imitator needs to identify two actions that have similar goals but may have different physical properties and latent motor control.
4.1 Action Completion Model
Based on the requirement described above, here we propose using the similarity in the dynamic transition patterns in the DoF of the two action-generating systems. Specifically, we hypothesize that the imitator observes an action and extracts the dynamics of the DoFs, defined by pointwise dimension, from the action as estimated for the recognition task in simulation I. Next, the imitator (mentally) simulates a movement by a given pendulum for each set of candidate controllers. Then the imitator performs an action by choosing the controller that can generate the action that is most similar to the demonstrated action. In this way, this action completion model uses a similarity in DoF dynamics rather than a similarity in apparent features such as angle and angular velocity patterns, which were found to be less characteristic of intentional differences in actions.
Specifically, we suppose that the imitator is exposed to one time series of angles generated in the goal-failed condition (see Figure 3C), which is suboptimal for the swing-up-no-hit task. We assume that the imitator performs an action by choosing a controller (see equation 2.3) with goal angle as the parameter. In one condition, the other physical parameters, mass and length , of the pendulum are fixed at , the same values used by the demonstrator. In the two other pendulum conditions, the imitator uses either or , which differ in mass or length to that used by the demonstrator. These three conditions are designed to investigate the robustness of the action completion model compared to differences in the physical features of the imitator's and demonstrator's pendulums. Given the goal-failed action, the action completion task of the imitator is to choose the controller with the goal angle that will most likely generate a movement similar to the demonstrated movement in terms of DoF dynamics. We let a set of controllers with the goal angles represent the imitator's options. The similarity in DoF dynamics of actions is described in the next section.
4.2 Similarity in DoF Dynamics
In this study, the DoF dynamics of a system are defined by the temporal change in the pointwise dimension estimated in the time series generated by the system. Specifically, a pointwise dimension estimator was constructed for a given demonstrated movement using the method proposed by Hidaka and Kashyap (2013), and used to estimate a series of pointwise dimensions for each of the demonstrated and candidate movements. The constructed pointwise dimension estimator is a mixture of multiple Weibull-gamma distributions, where each probability distribution corresponds to a particular pointwise dimension and assigns for each point in a trajectory the probability that the point belongs to the th distribution.
In our action completion model, the imitator is expected to perform an action controlled by the goal angle that maximizes the log-likelihood function (defined by equation 4.4), which indicates some similarity in DoF dynamics between the demonstrated and simulated trajectory. Specifically, the log-likelihood is defined and maximized using the following steps (illustrated in Figure 9):
- 1.Given the demonstrated trajectory as primary data, construct the pointwise dimension estimator/classifier for a reconstructed attractor by the method of time-delay coordinates with sufficiently high embedding dimension. The number of Weibull-gamma distributions is chosen based on the Akaike information criterion (Akaike, 1974).
- 2.The demonstrated trajectory is transformed into a state sequence , where each symbol is the index of the most likely distribution:(4.1)
- 3.Denote as the number of transitions from state to state in . Then the state transition joint probability matrix for all pairs of states is defined by(4.2)
- 4.Given a candidate controller including its parameters (e.g., goal angle ), a simulated trajectory is generated and transformed into another state sequence . To calculate the probability , the simulated trajectory was first transformed by the method of time-delay coordinates with the same embedding dimension, and the spatial neighborhood of sample point was calculated within the feature space spanned by itself. Then the state transition frequency matrix is defined bywhere is the number of transitions from state to state in .(4.3)
- 5.The likelihood of the candidate controller is defined by the multinomial distribution of a transition frequency generated by the unknown controller showing the transition probability of the pointwise dimension estimator. Specifically, the log-likelihood function of the candidate controller with the goal angle parameter is given bywhere is the multinomial coefficient for the multinomial distribution. To obtain comparable log-likelihoods for trajectories of unequal length, the log-likelihoods should be normalized by the numbers of state transitions, that is, .(4.4)
Repeat steps 4 and 5 for other candidate controllers with different parameters.
This method is designed to abstract away differences in the absolute value of the pointwise dimension at each step between the two systems and compute similarities in the temporal change in the relative degrees of freedom. In our simulations below, the imitator is expected to generate an action trajectory controlled by the goal angle that maximizes the log-likelihood function, equation 4.4.
We have two remarks on the procedure described above. First, in step 4, the spatial neighborhood of the sample point was calculated so as to abstract away differences in the physical properties of the pendulum, such as the absolute value of the positional data, such as differences in the pendulum length between the imitator and demonstrator. In the previous section, the neighborhood was calculated within the space spanned by the training data for the dimension estimator. However, this is valid only for the recognition task, in which we assumed the same physical properties for the two demonstrators. Second, in this method, we constructed from a demonstrated and from a simulated trajectory. In terms of imitation or mimicry in biology, the demonstrator is actually the “model” or “exemplar” to be imitated by the imitator or “mimic”; hence, the mimic can be seen as a realization of the model . However, the opposite view is also acceptable: the imitator forms a hypothesis and tests it using a realization by the demonstrator. The same result will be obtained using either view because similarity is a symmetric measure between the demonstrator and imitator, or the model and mimic. In the proposed method, we take the former view for the practical reason that the dimension estimation is computationally expensive and the estimation is required only once. This is in contrast to the latter view, which requires as many estimations as there are hypotheses.
4.3 Results of Action Completion
In this simulation, we assumed that the demonstrator had a controller with goal angle but failed to meet the goal due to the infeasible region with its boundary at . Thus, we set the ground truth at an estimated , as the latent goal angle of the controller used by the goal-failed demonstrator was . In the action completion task, the imitator is expected to generate an action that matches the latent intention of the demonstrator. The imitator employs the action completion model described above to produce the action most likely to display DoF dynamics similar to that of the demonstrator's.
To what extent does this action completion model depend on the sameness of physical parameters of the demonstrator's and imitator's pendulums? To examine the robustness of this action completion model against deviations from the identical physical setting (, ) between the demonstrator and imitator, we analyzed the same action completion task when the imitator controls pendulums with different physical parameters ( and , and, and ). In both cases, we obtained essentially the same results (green and blue points in Figure 10) as those in the same-pendulum condition (red points in Figure 10). The two groups of log-likelihood values were on average both significantly different (, for the condition with ; , for the condition with ). That is, even with physically different pendulums, the imitator can differentiate between the two general types of intentions (i.e., swing-up versus swing-up-no-hit). Thus, by using DoF dynamics as an indicator of similarity in movements, the imitator can successfully abstract away differences in physical features of the two pendulums. Note that when using two physically different pendulums, the motor controller has no “ground truth” of goal inference or no actual way of producing a movement that is exactly the same as the demonstrator's. Thus, in these different-pendulum conditions, it would be difficult for a simple movement-matching strategy to reproduce some unseen action performed by the demonstrator.
4.4 Goal Inference via Inverse Reinforcement Learning
One of the standard techniques used to infer a goal from observed data on an action is inverse reinforcement learning (IRL) (Ng & Russell, 2000). In IRL, the reward function is inferred from a large number of state-transition sequences based on the assumption that those sequences were sampled from an unknown Markov decision process. We adopted one IRL algorithm (Ziebart et al., 2008) that has been frequently reported to be robust and efficient in the IRL literature. The basic idea of IRL is often represented as frequency matching: in general, IRL algorithms estimate a higher reward for a more frequently visited state. Because the pendulum swing-up task is quite common in reinforcement learning (Sutton & Barto, 1998; Doya, 1999), we adopted the simplest state-space discretization method, called tile coding. The state space is divided equally into equally spaced tiles. Time series are sampled every three time steps.
Inspired by the psychological experiment conducted by Warneken and Tomasello (2006), we designed a minimal simulation framework to account for the mechanism of action recognition and action completion. We showed that the simulated imitator can discriminate between goal-failed and goal-achieved actions, which have apparently similar movements but different intentions and goals (simulation I). Then we proposed an action completion model that can perform an action comparable to the optimal action for the swing-up task simply by observing the goal-failed action, which is suboptimal with the pendulum in the infeasible region (simulation II). Both recognition and completion can be the basis of goal inference from an unsuccessful demonstration.
In these two simulations, we used DoF dynamics in actions, or a type of abstraction of bodily movements using a dynamical invariant, as a feature of the underlying motor controllers. For this abstraction, the obtained DoF dynamics can effectively ignore apparent positional variation among observed movements while extracting the dynamical/mechanical characteristics behind the movements. Our simulations comparing action completion based on DoF dynamics versus the frequency of spatial/positional states (see Figure 10 versus Figure 13) suggested that our abstraction to DoF dynamics allowed the imitator to identify a range of controllers (with the parameter ) including the optimal controller for the demonstrator's latent goal. Our additional simulations (see Figure 11) suggested that this abstraction may not allow the imitator to exactly identify the demonstrator's goal. We consider that this limitation of our proposed method is acceptable, as even we humans cannot infer another's hidden goal exactly but can rather identify the general direction of the demonstrator's intended goal. For example, consider the case in which you seeing a man kicking a closed door many times while both of his hands are full. You may think that he wants to open the door. But how can you infer exactly where he is heading after he goes through the door? It is difficult for you to determine this without any prior knowledge of his goal. In our simple pendulum simulations, the imitator successfully inferred that the demonstrator wanted to go beyond the infeasible bounds (the door) but could not exactly identify the demonstrator's goal angle (where he is heading). Given the theoretical results of this letter—that DoF dynamics are effective in both action recognition and action completion—we predict that this feature will also play a crucial role in understanding human action and imitation. This hypothesis will be tested in future work.
Finally, we add two remarks on why a dynamical invariant is effective for completing a goal-failed action compared with existing approaches. First, our approach using a dynamical invariant does not presume any kind of optimality of observed actions, whereas existing approaches, such as inverse optimal control (Wolpert, Doya, & Kawato, 2003) and IRL (Ng & Russell, 2000), do. This difference between the approaches is crucial, because the observed action, to be completed, failed in its original goal in our task.
Second, our approach using a dynamical invariant is likely to be useful for estimating the point-to-control underlying the goal-failed or goal-achieved actions. In general, bodily movements need to be more carefully controlled when the movement is at a state closer to the final goal. Consider reaching, for example. Finer control is needed near the point to be reached, more so than at the beginning of the action. This necessary control gain may be reflected by the observable granularity in the fluctuation of an action and can be quantified using a type of fractal dimension of trajectories.
As Bernstein (1996) pointed out, while the DoF of our bodies can be a source of flexibility in our movements, generally a system with a very large number of DoF is likely to be intractable. Therefore, organisms must reduce their body's DoF to be tractable (Bernstein, 1996). We predict that a reduction in DoF is crucial especially when accuracy of movement is required or when one is close to the task goal state. Thus, we speculate that a dynamic decrease or increase in DoF might inform the imitator about whether the current state of the observed system is near or far from the unknown task goal state. Specifically, the goal-failed actions we adopted in this letter can have their own characteristic dynamic pattern of DoF—for example, the different ways by which the pendulum touches the boundaries of the infeasible space. Therefore, our approach can successfully infer the hidden goal of an observed action, even if the observed action is suboptimal and goal-failed.
A review of IRL (Zhifei & Joo, 2012) proposed that learning from goal-failed or imperfect or incomplete demonstrations is a challenging new problem in related research fields. Our approach based on a dynamical system is expected to bring new insights to this class of problems.
The proposed model, at least in a minimal physical model such as a pendulum control task, is reasonably effective for an action completion task. Although we hypothesized that DoF can be a commonly effective feature for goal inference, the evidence for this claim (i.e., the results from our simulation models) is limited as we supposed that the simple pendulum was a physical body and that we had a sufficient amount of training data, among other assumptions. Given that the simple pendulum is a mechanical system with only one DoF, this assumption greatly simplifies the problem of imitation, which may require a large number of DoFs to control the body. In imitating systems with high DoF, an ill-posed problem, in which there are multiple different ways to achieve the same goal, is another fundamental problem that we have not addressed in this letter. Furthermore, we only quantitatively studied simple stage goal-directed actions with no need for explicit subgoals, as opposed to complex actions composed of multiple stages with the need for explicit subgoals. Whether DoF can be generally effective or how it can be effectively exploited for goal inference from complex actions of high DoF systems should be explored in the future. We expect to extend the current work to more complex action-generating systems in the future.
We thank the anonymous reviewer. This work was supported by JSPS KAKENHI grants JP 20H04994 and JP 16H05860.