## 1 Introduction

In recent years, great advances have been made toward operation in remote environments through the use of avatars or robotic surrogates. The idea that humans can remotely operate a surrogate is very appealing, but for many reasons, both technical and perceptual, this type of operation has not yet been realized. However, many steps have recently been made toward such telepresence and telexistence technologies. For example, to provide a better view of the environment, augmented (AR) and virtual (VR) realities are becoming more and more commonplace for providing visual feedback of a remote location such as those in Martinez-Hernandez, Boorman, and Prescott (2015), Fritsche, Unverzag, Peters, and Calandra (2015), Almeida, Patrao, Menezes, and Dias (2014), and Theofilis, Orlosky, Nagai, and Kiyokawa (2016).

Despite these advances, perceptual problems still interfere with these kinds of interfaces becoming a reality. The usability of the interface, intersensory conflict due to different form factors of humans and robots, and power-control mismatches are all still unsolved problems. Though much research has been done on teleoperation to date, only now are interfaces becoming available that actually allow us to enter the body of the surrogate and control its actions using a head-coupled stereo display.

In this article, our primary goal is to study and improve one of these control mechanisms: head motion. In bidirectional interfaces, vision is perhaps the most important factor in generating intersensory conflict due to the fact that the head is extremely sensitive to latency and degraded or mismatched visual information, especially during rotations or turns. Our contribution in this work is to study exactly how head motion and user perception are affected when humans perform control tasks in remote environments through a surrogate. Other studies on latency reduction often do so only with a monitor, give only a monoscopic view of the environment with no stereo depth cues, or are nonimmersive. We wanted to understand exactly how much of an influence latency would have on head control and what kind, if any at all.

Various differences in human anatomy and robot physics and mechanics, latency, and mismatched head rotation can all result in uncomfortable sensations for the teleoperator, nausea, frustration; and even damage to the robot (Draper, Viirre, Furness, & Gawron, 2001). Overcoming these challenges is a significant step toward building applications for remote manipulation and increasing sense of presence for teleoperation. The problem of latency is compounded in the case of remote control since a remote user must first send a movement command to the robot via network, the robot then executes the movement, the resulting stereoscopic camera images are then captured and sent back to the user, and those images finally have to be rendered on the local display (see Figure 1).

Figure 1.

Teleoperator system with bidirectional control (left) and a stereoscopic image through the surrogate interface using panoramic reconstruction and latency reduction.

Figure 1.

Teleoperator system with bidirectional control (left) and a stereoscopic image through the surrogate interface using panoramic reconstruction and latency reduction.

Throughout the whole experiment, we recorded performance metrics like deviation from target, head movement, discrepancies between robot and human head rotations, and time to completion. These were calculated programmatically either using the time stamps and logs from the robot and VR system, or from post processing using computer vision and fiducial tracking on the recorded eye-perspective videos. Moreover, we also conducted a subjective questionnaire regarding perceptual experiences and metrics like fatigue at the end of every trial. Results revealed many interesting tendencies both for performance and perceptual affordances and several differences from existing work on latency compensation and haptic performance, which will be described in the Experiments and Results Sections later. Next, we describe prior work and other interfaces that have been developed for different types of robotic control.

## 2 Background and Related Work

Since the advent of modern robotics, operators have struggled to control visuomotor systems in an effective and intuitive way. Most initial work on robotic control focuses on static screen-based interaction, which affords the operator several ways to see through the robot's view and relay commands.

One such framework was outlined by Brooks (1986), which provided a general layered format with which to robustly control a remote, mobile robot. Researchers then started to focus on more specific problems in teleoperation such as delay and control mechanisms. For example, in 1991, Bejczy, Venema, and Kim (1991) proposed using a predictive simulator to compensate for delays when visualizing the future actions of a robot. Another example is the predictive display proposed by Rachmielowski, Birkbeck, and Jägersand (2010) in which participants manipulated a haptic device under delayed and predictive conditions, though head movement and immersion were not studied. Other strategies such as collaborative control have been employed to improve driving actions in multi-robot scenarios such as that of Fong, Thorpe, and Baur (2003). NASA also developed a method for controlling robotic arms using the Oculus Rift DK2 and an Xbox Kinect using wireframe overlays (NASA, 2013). For more fine-grained control, Marin et al. (2005) have proposed multimodal interaction using voice, text, and direct control to improve remote operations. A number of mechanisms to reduce or simplify network latency have also been utilized, such as back-channelling (Jung et al., 2013). A similar example by Lee et al. (2015) called Outatime proposes an algorithm for improving latency of cloud-based games. This predictive set of algorithms reduced perceived latency for users even without the use of IMU-based sensors or positional timestamps. Though delay was still a significant factor, even remote teleneurosurgery has been performed recently (Meng et al., 2004). Panoramic reconstruction methods also exist for monocular camera setups, such as that of Cobzas et al. (2005). However, all of the methods mentioned above either have yet to be applied to AR/VR teleoperation, do not allow for stereoscopic viewing of the remote environment, or do not give us any information about how head control can affect performance. Our study focuses on overcoming all three of these limitations, and also provides subjective measures such as fatigue, nausea, and presence.

After the general concept of remote robot control was introduced, several researchers also proposed the use of AR and VR to improve control mechanisms by improving visualization of the physical state of the remote robot. One such system was developed by Milgram et al. (1993). They proposed using graphic overlays as a method to improve depth judgments during remote operation. Though their system was still largely susceptible to network delay, this marked a significant step forward in the use of augmentative environments for improving robot visualization. Several years later, a cockpit-type system was developed that allowed for full control of a remote humanoid robot with stereoscopic image relay. The system, developed by Tachi et al. (2003), employed a full-body cockpit that even allowed for fine-grained control of hand and finger movements. This type of system in particular would benefit from the low latency provided by our panoramic view since good coordination and minimal disorientation are key requirements for the teleoperator.

Such interaction methodologies have also been proposed for human–human tele-assistance, such as the system proposed by Adachi et al. (2005). Another system, which is probably the most similar to our own, was that of Fiala (2005). This method generates a panoramic image of a single 360-degree catadioptric camera to view a remote environment through a small mobile robot. Though this helps deal with latency, it was implemented only for a monoscopic camera with a relatively small field of view (FoV) HMD, and 360-degree cameras are not always present on humanoid robots. Building on the research by Milgram et al. (1993), Hashimoto, Ishida, and Inami (2011) further developed the concept of using AR overlays for improved manipulation of a remote robot in 2011. Though the interaction with the robot was conducted via a 2D static monitor and evaluation did not compare against a non-AR assisted task, this study further motivates the use of AR interfaces for robotic control.

With the introduction of affordable VR headsets, VR interfaces to robots have become more common, and have been implemented as prototype systems. One such prototype is the driving embodiment system proposed by Almeida et al. (2014). They tested the interface for interaction using an RGB-D sensor, but also found that visual feedback delays and limited field of view required mental compensation on the part of the user. Okura et al. (2014) developed a free-viewpoint system that allows the user to navigate a 3D space reconstructed by depth-based images. Still, this method requires the presence of a depth camera, is subject to artifacts and image quality limitations, and final robot movement does not follow head movement.

A more recent design by Martinez-Hernandez et al. (2015) incorporated the use of a fully immersive wide field of view display both for control and streaming of a remote humanoid robot viewpoint. Even more recently, Fritsche et al. (2015) added the ability to manipulate objects using this same technique by incorporating a haptic SensorGlove into a VR interface. Soon after, the same group added tactile sensations as a form of feedback for sensing grip or touch in the remote location (Weber et al., 2016). As of 2017, a commercial product called the Hi5 VR Glove (Noitom Hi5 VR Glove, n.d.) is available for the commercial market, which would be used in a similar fashion to control the hands and fingers of a remote surrogate. The implementations described above explore the use of remote manipulation, but do not study or take into account the affects of latency in their designs or experiments.

Some research has targeted the improvement of robotic control, such as the Bayesian Model proposed by Calandra (2017), which could be used to improve both remote teleoperation and complete autonomous control. Though these designs represent good initial steps towards embodiment, intersensory conflict induced by latent images is still present, affecting user actions and perception. In order to determine what affects this rendering latency may have on the operator, we have designed a system to control latency and experiments to test its effects in remote control tasks.

## 3 System Framework

### 3.1 Hardware

The hardware used in our system consists of a number of different parts, including the iCub robot and its server, the Oculus Rift DK2 VR headset, the server to run the Oculus, and an intermediary laptop for the teleoperator.

The iCub humanoid robot (Metta et al., 2008) is used primarily to study embodied cognition in artificial systems and for human–robot interaction. The robot's software is based on the YARP (Metta, Fitzpatrick, & Natale, 2006) robotic middleware, which facilitates, among others, the communication between the nodes to the distributed system. For our system, the movement range for the three axes of the head of the robot (pitch, roll, yaw) was, respectively, (−30, 22), (−20, 20), and (−45, 45) degrees, and the joints were set in direct position control mode; that is, there was no minimum-jerk trajectory generator. While the possible movement range of the joints of the robot can exceed the above values, safety guards were in place to ensure that the human teleoperator did not exceed the angular limitations of the neck joints.

### 3.2 Software

The software of our framework includes the Oculus Rift DK2's API as well as the camera rendering setup. In order to display the iCub camera video streams and render virtual objects, we used the lightweight Java gaming library (lwjgl—http://www.lwjgl.org). This allows us to render video streams from the cameras as textures, move them in the 3D virtual environment as necessary, and also render virtual objects on top of the see-through components for AR/MR assistance. Stereo calibration of the two camera planes was accomplished initially by the iCub's calibration module and then by making fine, manual adjustments to the size and position of the rendered planes. The rendering process for these textures has been hand-optimized so that the camera streams, gaming library (including the barrel undistortion and aberration correction for the Oculus Rift), and reconstruction framework can run in real time at over 30 fps. Note that the actual frame rate of the display was still approximately 75 Hz since the Oculus framework did not wait to have a new iCub camera frame before reprojecting in virtual space using the IMU. For standardization and ease of replication of this study, we decided to utilize the internal iCub cameras to prevent other researchers from having to modify and recalibrate an external camera pair.

For the direct streaming implementations, eye camera images carry no information about their orientation at the time of capture, and the images are directly in front of the user's eyes, even if they do not correspond to the robot's real, current view. At the same time, the game world is rendered using only the current camera objects. For the panoramic reconstruction view, the image from each of the robot's eyes that is sent to the network is coupled with the orientation of the robot's head at the time of capture in Tait-Bryan angles, the exact flow and latencies of which are described in more detail in Theofilis et al. (2016). In addition, the main bottlenecks for latency in this system are the time required for actuation and movement of the robot and image acquisition and encoding from the eye-cameras. Network latency only accounts for a small portion of this pipeline, and it is important to note that latency is not a function of bandwidth in this case. The video compression was in M-JPEG, transferred through UDP, and the processing time outside of transfers was typically less than 5 ms.

We measured the motion-to-photon latencies of the direct and panoramic modes using a camera recording of real and virtual worlds and found the average latencies to be 702 ms and 55 ms, respectively. Network latencies only compose a small portion of these times, but the mechanical latencies of the nature of the drive train of the iCub robot (Theofilis et al., 2016) contribute additional time.

### 3.3 Data Flow

The first step in setting up our software framework was initializing bidirectional communication between the iCub and Oculus Rift through YARP. Across the whole system, data is sent and received through the framework as follows:

• iCub-side input:

• Oculus rift pose data: to control iCub head

• iCub-side output:

• Left and right eye RGB image

• Current head pose (synchronous with images)

• Teleoperator-side input:

• Latent iCub left and right eye image

• Head pose at time of image retrieval

• Teleoperator-side output:

• Oculus Rift user's head pose

### 3.4 View Reconstruction

The key to the panoramic reconstruction is decoupling the user's head movement from the current robot eye window. In traditional direct streaming methods, the image presented to the teleoperator will be the same, despite user head movement, until the latent images catch up to the teleoperator's current head rotation, as denoted on the left-hand side of Figure 2. This means that the same image is presented while the user's head is moving, which generates intersensory conflict. A majority of perceived latency relative to a head-worn display is due to rotation since the greatest shifts of the virtual image plane over the retina happen during head rotation. Since the human head does not typically experience severe translations over small time increments, translations and hence parallax were considered negligible for the purposes of this experiment. A simple way to understand how the panoramic view works is to imagine a large pre-generated panorama plane that sits in front of a user in a static position in immersive VR. By static position, we mean that the orientation of the user's head is not coupled to the panorama, letting the Oculus compensate for rotations.

Figure 2.

From left to right: Direct Streaming showing the incorrect rotation of a frame (from orange to red) after a head rotation in a direct streaming implementation (top) and a respective pose diagram (bottom). Panoramic reconstruction showing the reconstruction window in black with current reconstructed data visible in the Rift window (top) and respective layout diagram showing the correctly rotated perspective relative to the latent camera frame (orange frame, bottom). Frame Calculation and Panorama Segment showing a completed reconstruction through the Oculus viewport (top) and the pose label diagram showing matrix and transform labels (bottom).

Figure 2.

From left to right: Direct Streaming showing the incorrect rotation of a frame (from orange to red) after a head rotation in a direct streaming implementation (top) and a respective pose diagram (bottom). Panoramic reconstruction showing the reconstruction window in black with current reconstructed data visible in the Rift window (top) and respective layout diagram showing the correctly rotated perspective relative to the latent camera frame (orange frame, bottom). Frame Calculation and Panorama Segment showing a completed reconstruction through the Oculus viewport (top) and the pose label diagram showing matrix and transform labels (bottom).

For this case, the time between a user head movement and the updated image (camera image, not rendered image) is primarily limited by inconsistencies between the frame time stamp and image sensor output on the robot side, although rendering speed and time taken to re-read the image texture from the reconstruction also slightly influence rendering. If the next rendered frame of the display had not received a new video frame at that time, the reprojection algorithm continued to reproject the previous camera frame in its world relative location using the IMU coordinates provided by the display, simulating a higher frame rate. Resolutions in pixels are 640 × 480 for each iCub camera, 960 × 1080 for each Oculus Rift DK2 eye, and 1920 × 1440 for the reconstruction. When designing this strategy, we drew from reconstructive methods such as that of Gauglitz et al. (2012). The panoramic image is generated by our reconstruction functions in the same way, but portions of the panorama corresponding to incoming frames from the iCub are updated as they are received, as shown in the right-hand images of Figure 2. This way, the scene available to the user is much wider than the FoV of the robot's cameras, though it still suffers from minor perspective artifacts.

During startup, the panoramic image shows a center frame (corresponding to (0, 0, 0) pitch, roll, and yaw, respectively, of the iCub head), as outlined by the orange box in the top-left image of Figure 2, prior to generating the reconstruction. Since the borders of the panorama are initialized as a black frame, we start a “warm-up” phase that moves the iCub's head to the four corner points near the limits of its joints to complete an initial reconstruction. After this period, control is relinquished to the user, and he or she will then be able to control robot head movements using his or her own head motions. Creation of the panorama (rightmost images in Figure 2) as new data is received is described as follows.

The subregion to be updated for the current frame is denoted by the orange dotted line in the bottom-right image of Figure 2. The relative position of this frame to the panorama matrix $PM$ is calculated as
$PM(i-PEye(i),j-PEye(j))=EM(i,j)(tL),$
where $PEye$ represents the eye pose of the latent frame at time $tL$, and $EM$ represents the eye camera image matrix. Each time a new frame is received, the panorama is updated on the next frame in the Oculus.
Target pose of the reconstruction texture $PMO$ in the Oculus relative to head movement (the solid blue outlines in Figure 2) is represented by
$PMO=PM×PPcalib×PO-1,$
where $PPcalib$ is the initial position of the image plane after stereo calibration and $PO$ is the current pose of the Oculus. An objective measure of the sum of angular errors $βoff$ for the number of latent frames $NL$ in spherical coordinates $θ$, $φ$, can be measured with
$βoff=∑i=1NLΔ(θ,φ).$

The portion of the panorama that is updated corresponds to angular values which we have embedded into each of the images sent from the iCub. These image and head position data pairs are received on the teleoperator client side, and the correct portion of the panoramas for both left and right eye data are buffered and written. The subimage of the panorama that is updated corresponds to the position of the iCub at the time the last pixel was pulled from the eye camera images.

Note that if the teleoperator continues to move his or her head at this point, he or she will be able to view a different part of the panoramic view despite not having received new image data. This is the main difference from other implementations that have to wait for a new frame. Similar methodology has been used to compensate for rendering lag in older head-mounted display systems, such as the work by Kijima and Ojika (2002), though this has not been applied to stereoscopic humanoid systems. Although a number of other methods can provide 3D reconstructions of the environment that account for translations, they still often have artifacts, reconstruction holes, and resolution limitations. Moreover, the main source of intersensory conflict in HMD systems is almost always the turning of the head since the relative speed of the head rotation far exceeds that of translation. Panoramic reconstruction also accounts for misalignments between the teleoperator's head movement and the actual movement of the robot head. Though recent improvements in the iCub's control software can compensate for most of these changes to some extent, other robots may not have the same fine-grained control mechanisms in place.

### 3.5 Pilot Testing

To gather initial feedback and to decide what types of tasks would be appropriate, we first conducted a demo session with 15 users to test the interface. Users tried both modes for a total of approximately 5 minutes, always starting with direct streaming. Some of the more relevant comments included:

• The panoramic view gives the perception of a circular space, whereas direct mode appears flat.

• The direct mode appears to be more consistent than the panoramic mode, but a delay is noticeable.

• When trying to gaze at certain targets, most participants also observed some overshooting in direct mode.

• Perception of depth was enhanced in panoramic mode.

• Four participants commented that they felt nauseous in direct mode, but that the feeling was reduced after switching to panoramic mode. A majority of participants mentioned they felt more comfortable with panoramic mode compared to direct.

These comments give us good subjective evidence that the system would likely alleviate intersensory conflict. Moreover, as head movement and focus were the main topics of the comments, we designed our primary experiments around a controlled user study targeting head/gaze control.

## 4 Primary Experiments

In monitoring operations like factory work or face-to-face customer service, direct eye contact with a target is essential. As such, the general goal of our experiments was to study the effects of latency on head control during remote teleoperation, both physically and perceptually. These results should be used to help design future iterations of systems like those from Martinez-Hernandez et al. (2015) and Theofilis et al. (2016), where remote operation and latency may have adverse affects on user performance and perception, especially with a stereoscopic setup. To determine how much this latency would affect users, we designed the following experiments.

### 4.1 Study Design

Our study was primarily focused on two metrics, performance and perception, as methods of evaluating VR teleoperation and head control in a surrogate robot. To test these, we came up with two scenarios in which participants were asked to move their heads in a controlled, specific manner. The first was the static task as shown on the left of Figure 4, where the participant moved his or her head from one number to the next in sequential order. This was designed as a warm-up phase to allow the participant some practice with that particular mode. The second task was a dynamic following task, where the participant had to keep a purple target as close to the central field of vision (a reticle) as possible. A sample target is shown on the right of Figure 4.

As one performance metric, we evaluated the consistency of eye contact, which primarily factors in the distance from the surrogate's physical head position to the target in the remote environment. Moreover, such movements should be performed with minimal head movement on the operator's part, so sensor data from the display was also evaluated. A total of 19 individuals participated in the experiment, including 12 males and 7 females, with an average age of 28.9 (SD 4.82). Although 9 participants had at least some experience with AR or VR, none had consistently used any particular display for more than one hour per week.

### 4.2 Tasks, Conditions, and Procedure

An overview diagram of the physical setup and layout is shown in Figure 3. Participants first sat down and then read and signed the informed consent document. Next, the experimenter gave an explanation of both the display and the system designed to show the world through the eyes of the surrogate. Following, the experiment tasks were described, but details of the implementation of each mode being tested were not given to prevent bias.

Figure 3.

Diagram of the experiment environment from an overhead view showing the participant, experimenters, and remote surrogate. The monitoring tasks shown on the display in the lower right are the same as those shown in Figure 4.

Figure 3.

Diagram of the experiment environment from an overhead view showing the participant, experimenters, and remote surrogate. The monitoring tasks shown on the display in the lower right are the same as those shown in Figure 4.

The primary conditions that were tested were the direct mode (latent) versus the reconstruction (latency reduced). This would give us both objective and subjective measures of the effects of latency. Participants then ran through a static and dynamic task for each mode, with a small break in between to fill out the subjective questionnaire for that particular mode. A sample order would be: mode 1 (static $→$ dynamic $→$ questionnaire) $→$ mode 2 (static $→$ dynamic $→$ questionnaire). The starting mode was alternated between participants to prevent ordering effects. This gave us a 2 × 2 × 2 within-subjects design, with 2 display modes × 2 sets of tasks (static $→$ dyanmic) × 2 speeds for dynamic.

Instructions for the static task were to move the reticle sequentially from 1 to 15. For the dynamic task, participants were simply instructed to keep the reticle as closely over the center of the purple target as possible. The static task lasted only 30 seconds and was followed by a break for up to 1 minute if the participant desired. Next was the dynamic task, which lasted 2 minutes, and was split into slow and fast speeds, each lasting one minute. The purple target shown in Figure 4 was moved randomly about the scene on a specified path, the direction of which was alternated per participant.

Figure 4.

Frames taken from videos of the remote monitoring tasks, showing the stationary task (left), where participants had to orient a reticle viewed through the surrogate's head over each number in order, and the dynamic following task (right), where participants had to control the surrogate's head to fit the purple moving target inside the reticle as closely as possible. Both images are taken directly through the surrogate's left eye camera.

Figure 4.

Frames taken from videos of the remote monitoring tasks, showing the stationary task (left), where participants had to orient a reticle viewed through the surrogate's head over each number in order, and the dynamic following task (right), where participants had to control the surrogate's head to fit the purple moving target inside the reticle as closely as possible. Both images are taken directly through the surrogate's left eye camera.

To measure the subjective perceptions of the participants of both the interface itself and the basic sense of presence in the remote environment, we administered a subjective questionnaire immediately after each set of tasks for a particular mode. Survey questions on simulation sickness were rated on a 7-point Likert scale ranging from “none” to “severe,” which included the following:

• how much fatigue is affecting you right now.

• how much nausea is affecting you right now.

• how much eye strain you have right now.

• how much discomfort you have right now.

Finally, we asked about the user's general sense of presence in the remote environment, also using a 7-point Likert scale but rated from “disagree” to “agree.”

• I felt like I was present in the virtual space.

### 4.3 Fiducial Tracking and Automated Data Collection

To automatically process and measure performance in our experiments, we developed a number of fiducial tracking strategies to assist with analysis. Before conducting the experiments, we affixed digital markers to parts of the scene to be able to track the scene in addition to the targets. For example, white outlines at the screen corners, that can be seen in Figure 4, were used to measure differences in head movement from the camera's perspective. First, we used thresholding to segment the blue from the rest of the scene, followed by a centroid calculation that gave us the center of each circle and hence differences in head position frame by frame.

For the static case, we used the red backgrounds of each number as a fiducial marker, and then calculated the center of the entire board by reconstructing marker positions using the entire marker constellation. Even if several of the numbers were occluded from the scene, using a subset of the constellation still allowed us to find scene center.

To track the distance from the user's head to the target in the dynamic tasks, we used the same marker tracking strategy with the purple target shown on the right of Figure 4. The distance from screen center (taken as the surrogate's head position) was subtracted from the coordinates of the centroid of the purple target.

However, because of the user's vergence at approximately one meter, we had to correct for the stereoscopic viewpoint of each eye in the horizontal direction to get an accurate measure of the actual position of each eye relative to the target. To do so, we first ran the tracking on the entire video and computed the average deviance from the target for the entire task. This gave us a good approximation of how far the eye deviated from screen center over the course of the task. In other words, we corrected for the fact that each per-eye screen was calibrated to account for a vergence of one meter. Deviations in the vertical direction were negligible.

A small green target was also added to the lower portion of the screen to signify when the fast mode was engaged. Though synchronizing the data on the server side was possible, using a fiducial marker gave us better ground truth since the server occasionally dropped packets.

Finally, we ran an algorithm to synchronize the data logs from the surrogate server and display server. This gave us an additional layer of comparison to determine how much head movement actually differed between the local and remote environments. We also algorithmically synchronized the different sensor frequencies of the iCub and Oculus. From the data extracted using computer vision and data sensor synchronization, we gathered head movement,5 static number completed, average time to completion, and vergence-corrected deviation from disk for both speed conditions.

## 5 Results

Results are primarily organized into three sections, including static performance, dynamic performance, and subjective scoring as described next.

### 5.1 Number Search Task Performance (Static)

This static task was primarily to get the participant accustomed to a particular mode, so we did not expect much of a difference in performance. Metrics used for evaluation included time taken per number and total number achieved within the allotted 30-second window. On average, both time per number and total achieved were not significantly different. Time per number for direct versus panoramic was 2.95 versus 2.98, with a paired t-test showing no statistical significance ($tStat=-0.104$, $p=0.46$). Total number achieved had a similar result, with an average of 10.25 versus 10.13 for direct versus panoramic, respectively. A paired t-test confirmed no statistical significance ($tStat=0.15,p=0.44$).

### 5.2 Target-Following Performance (Dynamic)

For the target-following task, we expected more of a difference in performance. Metrics included head movement and deviation from the target over the course of the 2 minutes, which were also split into one-minute slow and fast sections. The results are summarized in Figure 5, which shows bar graphs of the average deviation from the target and head movement.

Figure 5.

Deviations from target (left-hand) and differences in head movement (right-hand) for both modes in the slow and fast conditions.

Figure 5.

Deviations from target (left-hand) and differences in head movement (right-hand) for both modes in the slow and fast conditions.

First, there was a significant difference in deviation from target between the direct and panoramic modes for the slow group. Average pixel deviation from target per frame was 29.77 for direct mode and 27.19 for panoramic, an increase of 9.5%. A single-factor analysis of variance (ANOVA) revealed a near-significant effect for slow ($F(1,34)=3.82,p=0.058$) but not for fast ($F(1,34)=0.035,p=0.85$). Rerunning this ANOVA for the slow condition on each valid frame, rather than average deviation per participant, revealed a strong effect of mode on deviation ($F(1,87299)=260.55,p≪0.01$). Example pursuit trajectories are shown in Figure 6, giving a visual representation of these differences.

Figure 6.

Plot showing about 10 seconds (left) and one minute (right) of trajectory data from a single participant. Notice the larger overshooting tendency for direct mode, especially toward screen borders.

Figure 6.

Plot showing about 10 seconds (left) and one minute (right) of trajectory data from a single participant. Notice the larger overshooting tendency for direct mode, especially toward screen borders.

Secondly, there was also a reduction in head movement for direct versus panoramic for the slow condition, with an average of 4.83 and 4.39, a decrease of 10.02%. ANOVA confirmed this difference for the slow condition ($F(1,34)=7.89,p<0.008$), but no significance was found for fast ($F(1,34)=0.87,p=0.36$). Interestingly, the histogram of deviations according to their size in Figure 7 shows that direct mode may be better for small deviations around 10–20 pixels in size, with a cutoff around 25 pixels where performance is better in panoramic view. As a method of confirmation, we ran a similar analysis on all of the logged data produced by the iCub's logging system and the IMU data pulled from the Oculus Rift. The data were synchronized using the time transmission time stamps attached to each packet from the iCub and receipt stamps attached to each Oculus packet. Using the pitch and yaw data taken from each data stream, we also calculated the difference between the Oculus and iCub head. Error (Euclidean distance per frame) between the devices for each mode was 2.11 degrees for direct and 2.02 degrees for panoramic, which also supports the data calculated using fiducial markers. Confirmation via ANOVA revealed a significant difference on mode, ($F(1,36)=6.56,p<0.05$).

Figure 7.

Histogram showing the frequency of deviations with respect to their size over all frames. A trend line shows a clear exponential relationship between frequency and deviation.

Figure 7.

Histogram showing the frequency of deviations with respect to their size over all frames. A trend line shows a clear exponential relationship between frequency and deviation.

Lastly, we tested to see whether participants were able to reduce the deviations over time (i.e., an improvement in performance) for the duration of the dynamic task. To do so, we plotted the deviation in each frame, along with a linear regression in Figure 8, which shows a decrease in deviations over time only for the direct-slow condition and suggests that performance for the two may converge over time.

Figure 8.

Changes in accuracy over time for direct and panoramic modes in slow and fast conditions. Linear regression lines are shown, which suggests a decrease in deviations was present only for the direct-slow condition.

Figure 8.

Changes in accuracy over time for direct and panoramic modes in slow and fast conditions. Linear regression lines are shown, which suggests a decrease in deviations was present only for the direct-slow condition.

### 5.3 Qualitative Results

Here we present the results of the subjective questionnaires gathered after each mode of operation, which are summarized in Figure 9. The most significant of these were differences in fatigue and discomfort between direct and panoramic operation.

Figure 9.

Bar graph showing a summary of the results from the subjective questionnaire, rated on a Likert scale from 1 to 7.

Figure 9.

Bar graph showing a summary of the results from the subjective questionnaire, rated on a Likert scale from 1 to 7.

Fatigue was rated as 3.16 for direct and 2.21 for panorama, and discomfort was rated 2.79 for direct versus 1.63 for panorama. A near significant difference was found for nausea, with averages of 1.67 and 0.41 for direct and panorama, respectively. These results were confirmed with a nonparametric Kruskal–Wallis test, with ($χFatigue2=4.61,p<0.05$), ($χDiscomfort2=11.97,p≪0.01$), and ($χNausea2=3.44,p=0.064$). Eye strain and, somewhat surprisingly, sense of presence were not found to be significant: ($χEyeStrain2=1.58,p=0.21$) and ($χPresence2=0.18,p=0.672$).

## 6 Discussion and Future Work

Though no adaptation was necessary for the fast condition, one interesting result is that some participants were able to adapt to the direct view in the slow condition, as evidenced in Figure 8. Other similar studies on adaptation to motor control tasks show improvements in response to force perturbations, as shown by Levy et al. (2010), and in general motor adaptation during movement and balance tasks, as described by Babič, Oztop, and Kawato (2016). Though it is possible users would adapt to the fast condition if given more practice, there may very well be a threshold limitation for adaptation to this interface. One benefit of the reconstruction is that this adaptation is unnecessary from the start. Another possible direction of future research would be to model the effect of increased latency over time. Since we can control the perceived latency in the display, performing a series of shorter tests at different latency increments might reveal a sort of “break point.”

Based on our results overall, latency reduction strategies using a reconstructive component will likely provide a good practical way to reduce the effects of mechanical and network latency on degraded perception. Most likely, latency reduction is the primary mechanism for improving performance. However, performance results for head control largely differ than those found by Rachmielowski et al. (2010), who showed a performance increase of 48% for haptic tasks versus no difference for our number scanning task. Additionally, participants were also presented with a wider FoV due to the panoramic reconstruction. This FoV may not only have influenced their perceptual ratings, but also could have enabled more freedom of movement of the head and eyes. As shown in previous studies, a wider field of view can improve both target discovery rates and head movement (Kruijff et al., 2018), which may have also affected performance in this experiment.

One more interesting discovery from informal testing was the feeling of presence when looking at the reflection of the surrogate in a darkened monitor. Much like looking at one's reflection in the mirror gives the immediate sensation of embodiment, similar reflection techniques can potentially increase embodiment for teleoperators.

As future work, we plan to study embodiment when the user has not only a visual representation of himself or herself in the remote environment, but bidirectional control and mirrored eye contact. We would also like to add feedback mechanisms for the robot's physical body and limb positions to give the user a more accurate representation of the resulting movements of the surrogate in the remote environment to try to improve control and provide a better sense of embodiment.

## 7 Conclusion

In this article, we implement a method for reducing the perceived visual latency during remote robot teleoperation and study its effects on remote teleoperation. A preliminary session with 15 participants showed that this strategy has the potential to reduce simulation sickness and improve the sense of presence and perception of depth. A second, more thorough set of experiments showed that this type of reconstructive latency reduction technique had a moderate influence on head control for slower speed following tasks, but was not significant for faster conditions. On the other hand, stereoscopic latency reduction highly influenced user perceptions of the remote environment, especially with regards to comfort.

## Acknowledgments

This work was partially supported by the MEXT/JSPS Grants (Research Project Numbers: JP24119003, JP24000012, JP24300048, and A15J030230), by the JST CREST “Cognitive Mirroring” (Grant Number: JPMJCR16E2), Japan, and by the United States Department of the Navy, Office of Naval Research, Grant N62909-18-1-2036. Many thanks to all 30+ participants for their time and feedback.

## References

,
T.
,
Ogawa
,
T.
,
Kiyokawa
,
K.
, &
Takemura
,
H.
(
2005
).
A telepresence system by using live video projection of wearable camera onto a 3D scene model.
International Conference on Computational Intelligence and Multimedia Applications
.
Citeseer
.
Almeida
,
L.
,
Patrao
,
B.
,
Menezes
,
P.
, &
Dias
,
J.
(
2014
).
Be the robot: Human embodiment in tele-operation driving tasks
.
The 23rd International Symposium on Robot and Human Interactive Communication
,
477
482
.
Babič
,
J.
,
Oztop
,
E.
, &
Kawato
,
M.
(
2016
).
Human motor adaptation in whole body motion.
Scientific Reports
,
6
.
Bejczy
,
A. K.
,
Venema
,
S.
, &
Kim
,
W. S.
(
1991
).
Role of computer graphics in space telerobotics: Preview and predictive displays.
In
Proceedings of SPIE
,
1387
,
365
377
.
Brooks
,
R.
(
1986
).
A robust layered control system for a mobile robot.
IEEE Journal of Robotics and Automation
,
2
(
1
),
14
23
.
Calandra
,
R.
(
2017
).
Bayesian modeling for optimization and control in robotics
. Doctoral dissertation. Technische Universität.
Cobzas
,
D.
,
Jagersand
,
M.
, &
Zhang
,
H.
(
2005
).
A panoramic model for remote robot environment mapping and predictive display.
International Journal of Robotics and Automation
,
20
(
1
),
25
33
.
Draper
,
M. H.
,
Viirre
,
E. S.
,
Furness
,
T. A.
, &
Gawron
,
V. J.
(
2001
).
Effects of image scale and system time delay on simulator sickness within head-coupled virtual environments.
Human Factors: The Journal of the Human Factors and Ergonomics Society
,
43
(
1
),
129
146
Fiala
,
M.
(
2005
).
Pano-presence for teleoperation.
IEEE/RSJ International Conference on Intelligent Robots and Systems
, pp.
3798
3802
.
Fong
,
T.
,
Thorpe
,
C.
, &
Baur
,
C.
(
2003
).
Multi-robot remote driving with collaborative control.
IEEE Transactions on Industrial Electronics
,
50
(
4
),
699
704
.
Fritsche
,
L.
,
Unverzag
,
F.
,
Peters
,
J.
, &
Calandra
,
R.
(
2015
).
First-person tele-operation of a humanoid robot.
15th International Conference on Humanoid Robots
,
997
1002
.
Gauglitz
,
S.
,
Sweeney
,
C.
,
Ventura
,
J.
,
Turk
,
M.
, &
Hollerer
,
T.
(
2012
).
Live tracking and mapping from both general and rotation-only camera motion.
IEEE International Symposium on Mixed and Augmented Reality
,
13
22
. doi: 10.1109/ISMAR.2012.6402532
Hashimoto
,
S.
,
Ishida
,
A.
, &
Inami
,
M.
(
2011
).
Touchme: An augmented reality based remote robot manipulation.
Proceedings of the 21st International Conference on Artificial Reality and Telexistence
.
Jung
,
M. F.
,
Lee
,
J. J.
,
DePalma
,
N.
,
,
S. O.
,
Hinds
,
P. J.
, &
Breazeal
,
C.
(
2013
).
Engaging robots: Easing complex human–robot teamwork using backchanneling.
Proceedings of the 2013 Conference on Computer Supported Cooperative Work
,
1555
1566
. doi: 10.1145/2441776.2441954
Kijima
,
R.
, &
Ojika
,
T.
(
2002
).
Reflex HMD to compensate lag and correction of derivative deformation.
Proceedings of the IEEE Virtual Reality Conference
Kruijff
,
E.
,
Orlosky
,
J.
,
Kishishita
,
N.
,
Trepkowski
,
C.
, &
Kiyokawa
,
K.
(
2018
).
The influence of label design on search performance and noticeability in wide field of view augmented reality displays.
IEEE Transactions on Visualization and Computer Graphics
.
Lee
,
K.
,
Chu
,
D.
,
Cuervo
,
E.
,
Kopf
,
J.
,
Degtyarev
,
Y.
,
Grizan
,
S.
, …
Flinn
,
J.
(
2015
).
Outatime: Using speculation to enable low-latency continuous interaction for mobile cloud gaming
.
Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services
,
151
165
.
Levy
,
N.
,
Pressman
,
A.
,
Mussa-Ivaldi
,
F. A.
, &
Karniel
,
A.
(
2010
).
Adaptation to delayed force perturbations in reaching movements.
PLOS One
,
5
(
8
),
e12128
.
Marin
,
R.
,
Sanz
,
P.
,
Nebot
,
P.
, &
Wirz
,
R.
(
2005
).
A multimodal interface to control a robot arm via the web: A case study on remote programming.
IEEE Transactions on Industrial Electronics
,
52
(
6
),
1506
1520
. doi: 10.1109/TIE.2005.858733
Martinez-Hernandez
,
U.
,
Boorman
,
L. W.
, &
Prescott
,
T. J.
(
2015
).
Telepresence: Immersion with the iCub humanoid robot and the Oculus Rift
.
Conference on Biomimetic and Biohybrid Systems
,
461
464
.
Meng
,
C.
,
Wang
,
T.
,
Chou
,
W.
,
Luan
,
S.
,
Zhang
,
Y.
, &
Tian
,
Z.
(
2004
).
Remote surgery case: Robot-assisted teleneurosurgery.
Proceedings of IEEE International Conference on Robotics and Automation
,
1
,
819
823
. doi: 10.1109/ROBOT.2004.1307250
Metta
,
G.
,
Fitzpatrick
,
P.
, &
Natale
,
L.
(
2006
).
Yarp: Yet another robot platform.
International Journal on Advanced Robotics Systems
,
3
(
1
),
43
48
.
Metta
,
G.
,
Sandini
,
G.
,
Vernon
,
D.
,
Natale
,
L.
, &
Nori
,
F.
(
2008
).
The iCub humanoid robot: An open platform for research in embodied cognition
.
Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems
,
50
56
.
Milgram
,
P.
,
Zhai
,
S.
,
Drascic
,
D.
, &
Grodski
,
J.
(
1993
).
Applications of augmented reality for human–robot communication.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems
,
3
,
1467
1472
. doi: 10.1109/IROS.1993.583833
NASA
. (
2013
).
Nasa robot arm control with Kinect.
Noitom Hi5 VR Glove.
(
n.d.
Okura
,
F.
,
Ueda
,
Y.
,
Sato
,
T.
, &
Yokoya
,
N.
(
2014
).
Free-viewpoint mobile robot teleoperation interface using view-dependent geometry and texture.
ITE Transactions on Media Technology and Applications
,
2
(
1
),
82
93
.
Rachmielowski
,
A.
,
Birkbeck
,
N.
, &
Jägersand
,
M.
(
2010
).
Performance evaluation of monocular predictive display.
IEEE International Conference on Robotics and Automation
,
5309
5314
.
Tachi
,
S.
,
Komoriya
,
K.
,
,
K.
,
Nishiyama
,
T.
,
Itoko
,
T.
,
Kobayashi
,
M.
, &
Inoue
,
K.
(
2003
).
Telexistence cockpit for humanoid robot control.
,
17
(
3
),
199
217
.
Theofilis
,
K.
,
Orlosky
,
J.
,
Nagai
,
Y.
, &
Kiyokawa
,
K.
(
2016
).
Panoramic view reconstruction for stereoscopic teleoperation of a humanoid robot
.
The 16th International Conference on Humanoid Robots
,
242
248
.
Weber
,
P.
,
Rueckert
,
E.
,
Calandra
,
R.
,
Peters
,
J.
, &
Beckerle
,
P.
(
2016
).
A low-cost sensor glove with vibrotactile feedback and multiple finger joint and hand motion sensing for human–robot interaction
.
25th IEEE International Symposium on Robot and Human Interactive Communication
,
99
104
.