Since one of the most important aspects of a Fish Tank Virtual Reality (FTVR) system is how well it provides the illusion of depth to users, we present a study that evaluates users' depth perception in FTVR systems using three tasks. The tasks are based on psychological research on human vision and depth judgments common in VR applications. We find that participants do not perform well under motion parallax cues only, when compared with stereo only or a combination of both kinds of cues. Measurements of participants' head movement during each task prove valuable in explaining the experimental findings. We conclude that FTVR users rely on stereopsis for depth perception in FTVR environments more than they do on motion parallax, especially for tasks requiring depth acuity.
Virtual reality (VR) applications aim to replicate the sights, sounds, tastes, smells, and so on of real or imagined environments. Researchers have introduced many display technologies and new techniques intended to make us feel immersed in virtual environments (VE). In recent years, VR technology has progressed significantly. For example, head-worn VR systems have become a commodity, and augmented reality (AR) applications have become a common feature in mobile devices.
However, there are other kinds of VR systems besides head-worn VR and mobile AR systems. Each type of VR system is suitable for different applications because of its distinct characteristics. For example, CAVE (Cruz-Neira, Sandin, & DeFanti, 1993) is one type of system that implements both head-coupled and stereoscopy features. It utilizes projection, relying on a large display area (about the size of a room) to immerse the user in a VE. Another kind of VR system is Fish Tank Virtual Reality (FTVR), which uses a combination of stereo imagery and a head-coupled display to provide the illusion of depth and a three-dimensional scene using a monitor-sized display. The 3D illusion allows a FTVR system to extend beyond traditional 2D displays in terms of both user experience and the range of applications.
In this article, we explore how users perceive depth in FTVR systems toward further expansion of the impact of VR technology. Previous studies have proposed new techniques and applications for FTVR systems. (See the section “Extending Fish Tank Virtual Reality” for more detail.) We are particularly interested in FTVR because of its similarity to 2D desktop computer settings. This may lead to a wider range of applications and user experiences in a familiar environment. However, we believe that to reach the necessary level of usability, further understanding of FTVR is needed. Moreover, stereoscopic displays and tracking systems, FTVR's key components, have now become commodities, and computer performance and display technologies have improved significantly. These developments allow FTVR systems to operate with higher image resolution, higher frame rates, and higher tracking accuracy than previously possible. This makes it important to expand and update our fundamental knowledge of FTVR in the context of today's state of the art technologies.
Several previous studies have investigated how motion parallax and stereopsis contribute to the quality of a FTVR system by measuring users' performance in different tasks. (See the section “Comparative Studies between Stereo and Motion Cues in FTVR Environments” for more detail.) However, when the task a user is faced with in a FTVR interface requires the perception of depth, one of the most important things to measure about a FTVR system is the degree to which it provides depth acuity to a user. With respect to depth acuity, we are particularly interested in the use of head movement; clearly, more head movement means more motion parallax. This would lead us to hypothesize that when head tracking is enabled and users move their head more, they should perform better in depth acuity tasks. Similarly, when the task or setup encourages head motion, we would expect the contribution of head tracking to results to be strong.
In this article, we describe an experiment designed to study users' depth perception in a FTVR system. We constructed a system with a commodity stereoscopic display and head tracker. We then instrumented the system to measure the mean variability of participants' head position in the lateral (left-right), superior-inferior (up-down), and anterior-posterior (AP) directions; these data unveil user behavior and indicate the degree to which participants rely on motion parallax cues while performing the experiment. In the experiment, we consider three tasks for evaluating users' depth perception. The first task, a nulling task, is borrowed from psychological studies on visual acuity and depth cues. The second and third tasks are new tasks specifically designed to test users' stereoacuity, discrimination of motion in the depth direction, and users' performance in interactions commonly needed in VEs. We focus on measuring participants' visual acuity and ability to discriminate motion in depth in the FTVR environment, because acuity can be measured objectively, and because it is largely task independent. This means that the results of our experiment should be generalizable to a wide array of applications.
The contribution of this work is an increase in our knowledge of how stereopsis and motion parallax affect users in FTVR environments. We propose new tasks for assessing FTVR depth perception; the tasks are based on visual acuity testing methods from psychological studies and common tasks in VEs. We also provide empirical data on participants' head movement while performing the tasks. In the rest of this article, we put our work in perspective with related work, describe our experiment and results, discuss the findings, then conclude and provide ideas for future research directions.
2 Related Work
In this section, we briefly review previous studies on FTVR and stereoacuity.
2.1 Extending Fish Tank Virtual Reality
Fish Tank Virtual Reality (FTVR) was first introduced by Ware, Arthur, and Booth (1993), who defined it as “a stereo image of a three dimensional (3D) scene viewed on a monitor using a perspective projection coupled to the head position of the observer.” Shortly thereafter, Rekimoto (1995) proposed the use of a vision-based head tracker to create a FTVR system without headgear.
There has been a great deal of work attempting to extend FTVR concepts and capabilities to increase the range of applications. For example, pCubee (Stavness, Lam, & Fels, 2010), a perspective-corrected handheld cubic display, aims to create the illusion of virtual objects that exist in a box, using only motion parallax for depth cues. SnowGlobe (Bolton, Kim, & Vertegaal, 2011) is a spherical FTVR display that allows users to interact with the system through touch gestures while walking around it. Francone and Nigay (2011) developed a perspective-corrected interface for smartphones and tablets that, similar to pCubee, has only motion parallax depth cues. Heinrichs and McPherson (2012) developed a real-time telepresence system using head tracking and a robotic camera. Another group of systems use a “virtual window” display, which renders a perspective-corrected scene mimicking a real window (Gaver, Smets, & Overbeeke, 1995; Penner & Parker, 2007; IJsselsteijn et al., 2008; de Boer & Verbeek, 2011; Sam & Hattab, 2014). Kellnhofer et al. (2016) developed a process that uses depth information from motion parallax cues in video to enhance the disparity of the stereoscopic video. We (Kongsilp & Dailey, 2017a) proposed an immersive communication technique based on FTVR called Communication Portal. The goal was to enhance long-distance communication to be more like face-to-face communication. Lastly, Rodríguez-Vallejo et al. (2017) developed an iPad application for assessing a user's stereoacuity. They conducted an experiment comparing their method to a standard stereoscopic test (TNO). They declare that their method can be replicated more easily than the standard method and that they did not find statistical differences between the two tests.
2.2 Comparative Studies between Stereo and Motion Cues in FTVR Environments
Since the main appeal of FTVR is the simulation of depth through stereopsis and motion parallax, many researchers have attempted to improve our understanding of human perception in FTVR environments to achieve the best user experience possible. They mainly study the effects that stereopsis and motion parallax have on a user's task performance. We group the work into three areas, as it seems that the importance of stereo and motion cues are task dependent.
Ware et al. (1993) and Wright et al. (2012) conducted quantitative research on depth perception through questionnaires on users' preferences. Ware et al. (1993) found that head-coupled perspective display is most preferred by users, followed by the combination of stereo and head-coupled perspective, followed by stereo only. Wright et al. (2012) found that the combination of stereo and head-coupled perspective is the most preferable modality, followed by head-coupled perspective only then finally stereo only. Hendrix and Barfield (1996) used questionnaires to study the level of presence (a sense of “being there” in the virtual environment; Slater, 1999) contributed by each cue. They found the two cues to be equally important. Lastly, we (Kongsilp & Dailey, 2017b) used standard questionnaires to measure visual fatigue and subjective perception of presence. We found that users receiving both stereo and motion parallax cues have lower visual fatigue and higher perception of presence than those receiving stereo cues only. Based on these findings, we concluded that motion parallax cues enhance user experience by lowering visual fatigue and increasing the subjective level of presence when combined with a stereoscopic display.
For visual search tasks (Sollenberger & Milgram, 1991), Arthur, Booth, and Ware (1993) found that participants perform significantly faster with both cues than only motion parallax cues. They did not find a significant difference in user performance time between stereo only and motion parallax only conditions. However, the stereo-only condition yielded a 14.7% error rate, while the motion parallax only condition yielded a 2.7% error rate. Ware and Franck (1996) studied how information nets are better understood through the use of a FTVR system. They found that head coupling increases understandability more than stereo. A followup study (Ware & Mitchell, 2005) conducted a similar experiment with a different configuration (increased resolution, frame rate lowered from 30 Hz to 20 Hz, use of a mirror stereoscope instead of frame sequential displays, a fully connected graph, and animated motion parallax cues rather than head-coupled parallax cues). The study showed that stereopsis enables faster response time than motion parallax for both experienced and inexperienced users. Finally, Demiralp et al. (2006) conducted a visual search experiment and found that stereo helped participants locate the target feature faster. They also observed that participants generally did not move their heads during the study.
Stereopsis appears to be very important for tasks that require depth perception and visual guidance. Lion (1993) asked users to perform 3D manual tracking tasks and found that motion parallax had no contribution to performance. Boritz and Booth (1997) obtained similar results in their study. Arsenault and Ware (2004) developed a FTVR system in which the user must perform Fitts' (1954) tapping task. The authors found that disabling stereo increased mean inter-tap interval by 32%, while disabling head tracking only increased it by 1%. Some studies, however, do find some contribution from head tracking in performing such tasks—Arsenault and Ware (2000) studied improvements in performance in the same task due to force feedback mechanisms. They found that force feedback decreased average time by 12%, whereas head tracking decreased average time by 9%. This effect may occur because humans perform a small amount of head movement if any at all when performing positioning tasks (Stoffregen, Smart, Bardy, & Pagulayan, 1999). Bingham, Bradley, Bailey, and Vinner (2001) suggest that stereopsis is very important for calibration of eye-hand coordination in reaching tasks with visual guidance. Li, Peek, Wünsche, and Lutteroth (2012) asked participants to select one of four rectangular objects on the screen that is closest to them. They find that participants receiving both stereo and motion parallax cues were able to perform the task faster and more accurate than those receiving motion parallax cues only. Lastly, Albarelli, Celentano, and Cosmo (2015) studied the role of stereo and motion parallax cues in mixed reality contexts. They asked participants to measure virtual objects using physical rulers. They found that participants who used the stereo cues only or the combination of both cues were able to measure virtual objects more accurately than participants who used motion parallax cues only.
2.3 Tests of Stereoacuity
One of the earliest methods for testing stereoscopic acuity is the Howard–Dolman test (Howard, 1919), which was named after its inventors. The test consists of two pegs, one fixed at a distance of 6 m, while the second peg can be moved back and forth using a string. To test stereoscopic acuity, the subject must move the second peg so that it aligns at the same depth horizontally displaced from the first peg. The test is used widely in the aviation industry as well as other areas.
There have been multiple stereoscopic acuity testing methods that build on the Howard–Dolman test. Among them is the nulling task (Bradshaw, Parton, & Glennerster, 2000), which consists of three pegs resting on a plank. There have also been studies on what factors may influence task performance in a Howard–Dolman test. Example factors include low contrast (Westheimer & Pettet, 1990) and exposure time (Westheimer, 1994).
Our objective is to determine the relative significance of motion parallax and stereopsis depth cues in FTVR environments. Therefore, we developed a FTVR system and used it as a test platform for experiments with human participants. Participants performed three tasks in the same order. As we were particularly interested in situations that reflect natural interaction between participants and a FTVR system, we conducted the experiment in a small office without windows with the overhead lights on. Only one researcher and one participant were allowed in the room at a time, to prevent interruption and distraction during the experiment.
We designed our tasks based on two factors related to depth perception that we believe are critical to user performance in most FTVR applications: depth acuity and discrimination of motion in the depth direction. Besides recording the performance of each participant during the experiments, we also recorded participants' head movement. Based on our experience with prototype head-coupled perspective displays and, for example, a demo application developed by Francone (Francone & Nigay, 2011), we hypothesized that motion parallax by itself would not be as effective as displays also providing stereopsis for depth information. We also hypothesized that the amount of head movement would be view-mode dependent, as participants would probably move their head more in the motion parallax mode to obtain depth information.
3.1 System Design
The elements of our prototype FTVR system are illustrated in Figure 1. For the stereoscopic display, we used a 23-inch 3D LCD monitor equipped with NVIDIA 3D Vision glasses. The display's resolution was pixels, and the total display area was cm. At runtime, the monitor renders images at 120 Hz while the 3D glasses alternately block each eye at the same frequency, allowing for stereo depth perception. For head tracking, we used TrackIR 5, an infrared tracking system consisting of infrared LEDs, an infrared camera, and infrared reflectors. The tracker has a horizontal field of view of 51.7 degrees and a 120-Hz sample rate. In every experiment, the user was seated one meter away from the screen and was able to lean in any direction but was not able to move his or her chair.
3.1.1 View Modes
The test applications were developed on the Unity 3D platform, which was driven by a PC equipped with an Intel Core i5-4690 CPU and NVDIA GTX 970 GPU. The application was designed for easy switching among three viewing modes:
Stereopsis: The application displays static stereo images to the user. In this view mode, we use two virtual cameras (one for each eye). We place the virtual cameras in the application at the center of the display in the -plane and one meter outside (in front of) the screen along the -axis. This position represents a user sitting one meter away from the screen. The distance between the two virtual cameras is set to be equal to the user's pupillary distance.
Motion Parallax: The application displays perspective-corrected monocular images to the user. In this view mode, we use a single virtual camera. We set its position to the measured center point between the user's eyes. We track this point and update the position of the camera in real time. Since the goal was to study the role of depth cues in real-life use cases, we did not instruct users as to how they should view the screen. Users were free to use both eyes to view the display or close one eye, according to their preferences.
Both: The application displays perspective-stereo images to a user. For this view mode, we use two virtual cameras (one for each eye). We set the cameras' positions according to the measured positions of the user's eyes. We track the position of the user's eyes and update the camera positions in real time.
The NVIDIA 3D Vision glasses are always active for all view modes, meaning that the resolution, frame rate, and brightness are divided by two for each eye. In the motion parallax view mode, the application renders the same image for subsequent frames for each eye, in order to maintain the same level of brightness, frame rate, and resolution across all view modes. We used 2x multi-sampling for anti-aliasing and bilinear filter mode for texture. We disabled shadow functionality to avoid providing additional depth cues to participants.
The main free parameter of our FTVR system is the participant's pupillary distance. Before the experiment began, we measured the distance manually using a ruler then input the distance into the system.
In order to ensure that the participant began each experiment with his/her head at an appropriate origin, we placed a real cuboid sized cm at the center of the screen. We asked the participant to move his or her head so that the cuboid was pointing directly at the center point between his or her eyes. Once aligned, we recentered the virtual origin at the user's head position and rendered a virtual cuboid at the same virtual position as the real cuboid. We then allowed the participant to move his or her head freely and compare the virtual cuboid with the real cuboid from different angles. The participant was then asked if the real and virtual cuboids were perfectly identical in size. Once the participant reported uniformity, we finalized the origin at the participant's detected head position.
We recruited 14 participants (seven male and seven female). There were four undergraduates, four master's students, and six researchers, aged from 20 to 27 years old. All participants were frequent users of computers in their daily life and had more than five years of computer experience. There were seven participants with normal eyesight and seven participants who wore eyeglasses to correct their eyesight to normal. None had previous experience using FTVR displays before they participated in the experiment. To ease the learning process, all participants were formally explained the basic concepts of stereopsis and motion parallax, as well as the concept of Fish Tank Virtual Reality, before starting their tasks.
In this section, we describe the three tasks used in the experiment (Virtual Nulling task, Placing a Virtual Object task, and Detect Moving Object task). For each task, we explain the design concept and its configuration. During the experiment, every participant performed all three tasks in the same order.
3.3.1 Task 1: Virtual Nulling
Our virtual nulling task is based on the classic Howard–Dolman test (Howard, 1919) and the more recent nulling task (Bradshaw et al., 2000) used to test stereoscopic acuity in psychological research on human vision. Although this task may not precisely reflect a common VR task, it is ideal for measuring participants' depth acuity in the FTVR environment. Therefore, rather than testing with real-world devices as did Bradshaw et al. (2000), we implemented a virtual version of the nulling task in our Fish Tank Virtual Reality system.
Refer to Figure 2. Our virtual nulling setup consists of three identical poles (1 cm in diameter and 3.5 cm tall) resting on a plank that is 11 cm wide and 120 cm long. The left and right poles are static and always aligned with each other, whereas the middle pole is placed 20 cm away from alignment with the poles but is movable by an amount as small as 0.1 cm by scrolling the mouse wheel. Participants are required to move the middle pole along the center line of the plank to align it with the others in under five seconds. We did not want to give participants an extensive amount of time because we were concerned that there would be a ceiling effect and because depth judgment usually happens spontaneously in everyday life. Once the five-second limit is exceeded, the system records the accuracy of the alignment (alignment error), which we use as the dependent measurement.
3.3.2 Task 2: Placing a Virtual Object
The second task is designed to measure depth acuity specifically for virtual environments. Task 1 (Virtual Nulling) is based on stereoscopic acuity testing in psychological research on human vision, but does not reflect a common VR task. Therefore, in Task 2, we measure depth acuity by combining nulling with an activity common in virtual reality environments: placing a virtual object at a specific position in 3D space. We kept the idea of depth alignment from the nulling task (Bradshaw et al., 2000). However, we removed the horizontal alignment and planar surface underlying the objects, so that the environment better resembles how virtual objects may be placed in 3D spaces.
The workspace consists of three balls (2.5 cm in radius), as illustrated in Figure 3. Similar to the previous task, the purple balls are static and rest on the same -plane. The orange ball is placed 20 cm away from the target plane in the direction and is movable along the direction by an amount as small as 0.1 cm using the mouse wheel. During the task, participants have five seconds to move the orange ball so that it rests on the same -plane as the other two balls. As in the case of the previous task, we did not want to give participants an extensive amount of time because we were concerned that there would be a ceiling effect and because depth judgment usually happens spontaneously in everyday life. When the five second time limit is exceeded, the system records the alignment error.
3.3.3 Task 3: Detect Moving Object
For Task 3, since observation-based tasks and visual search-based tasks are both common activities in virtual environments, especially games and simulations, we designed a task that measures participants' ability to detect subtle changes in depth, through observation and without interaction. Task 3 thus required a different design approach from Tasks 1 and 2, which were designed to measure users' depth perception by allowing the participants to move virtual objects along the -axis. Moreover, on the hypothesis that the short time limit in Tasks 1 and 2 had affected users' depth information gathering behavior, we aimed to provide participants significantly more time than in the previous tasks. For that reason, we gave each participant 20 seconds to perform the task. We kept the object positioning and -plane alignment from the previous tasks. We removed the interaction with the virtual objects. We also gave participants more time to perform the task, in case the time constraints posed in the first two tasks were affecting participants' behavior.
The setup consists of three balls (2.5 cm in radius), each positioned at a corner of an equilateral triangle. As illustrated in Figure 4, each ball is marked with a number. Initially, all balls are depth-aligned with one another. However, one of the balls moves slowly into the screen at a constant speed of 1 cm per second. Meanwhile, participants have 20 seconds to indicate which ball is changing in depth by entering the number labeled on the ball with the numeric keypad. At the end of each trial, the system recorded the time taken as well as the correctness of the entered number.
We manipulated two test conditions in the experiment. The first condition is the view mode condition (Stereopsis, Motion Parallax, or Both view modes). (See Subsection 3.1.1 for more information about each view mode.) The second condition is the position of the correct alignment of the virtual objects in each task along the -axis. We consider the second condition because motion parallax and stereopsis are less useful when observing faraway objects. Therefore, we wanted to detect any interaction between depth cue utility and distance from the observer. We used three positions along the -axis in all three tasks: 25 cm out of the screen, 20 cm into the screen, and 50 cm into the screen. Therefore, each task followed a within-subject design (three view modes combined with the three positions). For each task, each participant is required to perform the nine tests in a random order. Since there are three tasks, each participant needs to perform 27 tests in total.
At the beginning of the experiment, the researcher asked each participant the following questions.
How long have you been using a computer?
Do you have normal eyesight? If not, are you wearing eyeglasses to correct your eyesight to normal?
Have you used a FTVR system?
The researcher allowed the participant to proceed if the participant had been using a computer more than five years, had normal eyesight or wore eyeglasses to correct his/her eyesight to normal, and never used a FTVR system before. Next, the researcher explained the concept of FTVR to the participant and calibrated the system for each participant (see Subsection 3.1.2 for more detail). Then the researcher asked the participant to start performing the first task.
At the start of the first task, the researcher gave three training conditions to the participant (one in each of the three view modes, with a fixed pole position at 20 cm into the screen). The participant was allowed to practice performing the task until he/she was satisfied. During these training trials, the researcher explained the goal of the first task to the participant. After the participant completed the trials, the researcher gave nine tests to the participant in random order (see Section 3.4 for more information about the test conditions). Between each test, the system paused and advised the participant what view mode was to be used next. For each test, the system gave the participant five seconds to perform the task. When the five-second limit was exceeded, the system moved to the next test without giving any feedback on how the user performed. Once a participant completed all tests, the researcher asked the participant to continue to the second task.
Before the second task, the researcher gave three training conditions to each participant (one in each of the three view modes, with a fixed ball position at 20 cm into the screen). The participants were asked to follow the same procedure as in the first task.
The procedure during the third task was similar to that of the first two tasks. First, the researcher gave the participant three training trials (one in each of the three view modes, with a fixed ball position at 20 cm into the screen). The researcher explained the task to the participant during the training trials. However, in contrast to Tasks 1 and 2, the researcher highlighted the importance of correctness over swiftness and emphasized the extensive amount of time that the participant would have (20 seconds). The participant could practice performing the task until he/she was satisfied. Next, the researcher gave nine tests in random order, after the participant completed the training trials. Between each test, the system paused and advised the participant what view mode was to be used next. For this task, the system gave the participants 20 seconds to perform the task. When the time limit was exceeded, the system moved to the next test without giving any feedback on how the user performed. When the participant completed all tests, the researcher asked the participant to stop using the system.
After each participant completed all three tests, the researcher asked the participant to recall his/her experience. Then the researcher informally interviewed the participant and asked him or her to rank his/her user experience in each view5 mode.
3.6 Head Tracking
In addition to measuring accuracy in each task, we also tracked the head position of each participant in each trial while performing the experiment. We used the head position at the beginning of each trial as the reference origin. The head-tracking system reports the user's head position at approximately 120 fps. However, the system only records the user's head position for analysis approximately every 0.04 s because we found that recording every 0.0083 s (every single frame) causes the log file size to grow too quickly, becoming unnecessarily large and causing the recording process to run very slowly. We computed the standard deviation of head position in the anteroposterior (AP, forward and backward), lateral (left-right), and superior-inferior (up and down) axes. This measurement method is similar to that used in psychological field studies of human posture (Stoffregen et al., 1999).
4 Experimental Results
This section presents results for the three tasks in turn.
4.1 Task 1 Results
In Task 1, we dropped the data for one participant from the analysis because of an extreme outlier caused by a phone call interruption during task performance. We carried out the analysis on the remaining 13 participants' data. The mean alignment error under each condition is shown in Figure 5. We conducted a two-way ANOVA to examine the effect of view mode and pole position on alignment error.
The two-way ANOVA did not reveal a main effect of depth cue type or an interaction between the factors. However, the two-way ANOVA did find a main effect of pole position (; ). A Tukey analysis for the different pole positions showed that participants perform better in the 25 cm out of the screen condition than in the 50 cm into the screen condition ().
The mean variability of head position in the lateral (left-right), superior-inferior (up and down), and anteroposterior (AP) directions is summarized in Figure 6. Since we were performing statistical tests on motion measurements in the three directions separately, we applied Bonferroni correction to cope with the additional Type 1 error due to the increased number of tests. For lateral head motion, the two-way ANOVA with Bonferroni correction did not reveal any main effects. However, for superior-inferior variability, the two-way ANOVA with Bonferroni correction did reveal a main effect of pole position (; ). Tukey's range test showed that participants moved their head in the superior-inferior direction more in the 50 cm into the screen condition compared to the 25 cm out of the screen condition (). In the case of anteroposterior variability, the two-way ANOVA with Bonferroni correction revealed a main effect of pole position (; ). A Tukey range test analysis found that participants moved their heads in the anteroposterior direction more in the 50 cm into the screen condition when compared to the 25 cm out of the screen condition ().
4.2 Task 2 Results
For the second task, we used the data from all 14 participants. (See Figure 7.) We performed a two-way ANOVA to examine the effect of view mode and ball position on accuracy.
The two-way ANOVA revealed a main effect of view mode (; ). Tukey's range test showed that participants performed significantly worse in the motion parallax depth cue mode in comparison to the stereopsis condition () and the both condition (). However, the two-way ANOVA did not find an interaction or a main effect of ball position.
The mean variability of head position in the lateral (left-right), superior-inferior (up and down), and anteroposterior (AP) directions is summarized in Figure 8. However, the two-way ANOVA did not reveal any significant effects.
4.3 Task 3 Results
In the third task, the participants' accuracy in identifying the moving ball in the stereopsis, motion parallax, and both depth cue modes was 100%, 88.1% (37 hits and 5 misses from four participants), and 100%, respectively. For this task, we want to measure the time taken to achieve a correct response only because we want to ensure the consistency of data that we used for statistical analysis. Therefore, we dropped the four participants who did not achieve 100% correctness from further analysis. The results for the remaining 10 participants are summarized in Figure 9. A two-way ANOVA was conducted on these data to examine the effect of view mode and ball position on the particpants' time to respond. The two-way ANOVA revealed main effects of both depth cue view mode (; ) and ball position (; ). A Tukey range test showed that participants perform worse in the motion parallax condition than in the both condition (). The test also showed that participants perform better in the 25 cm out of the screen condition than in the 20 cm into the screen condition () and 50 cm into the screen condition (). There were no significant interactions.
The mean variability of the participants' head position in the anteroposterior (AP) and lateral (left-right) directions is summarized in Figure 10. A two-way ANOVA on these data did not reveal any significant effect of view mode or ball position.
4.4 Participants' Feedback
After each participant completed the three tasks and provided his or her personal preferences for the depth cue modes, the researcher informally interviewed the participant about his or her responses. In aggregate, we found two major factors important to the participants: the level of dizziness they experienced and the sense of depth that they felt in each depth cue mode.
We split participants into three groups in terms of their preference ranking results (see Table 1). The first group ranked the both depth mode as their first preference. The common feedback from these participants was that this mode caused them less dizziness when compared to stereopsis alone and provided a more realistic sense of depth than the motion parallax only condition. The second group ranked motion parallax as their first choice. The common feedback from this group was that motion parallax is more comfortable than the other depth cue modes. They felt a certain level of dizziness and had a hard time acquiring depth information in the stereopsis and both conditions. Lastly, the third group ranked stereopsis as their first option. They reported that stereopsis provided the best sense of deepness, whereas the both depth cue mode caused them discomfort and difficulty in retrieving depth information.
|.||Depth cue mode .|
|Rank .||Stereopsis .||Motion Parallax .||Both .|
|.||Depth cue mode .|
|Rank .||Stereopsis .||Motion Parallax .||Both .|
In this section, we discuss our observations about users' behavior, provide some interpretation of the effects, remark on head-coupled displays, and make some recommendations for future FTVR projects.
5.1 User Head Motion Behavior
At the beginning of the experiment, participants received a short training session for each depth cue mode until they were satisfied. During the training sessions, participants seemed to be exploring the system to determine how it operates and how they could interact with it. When the participants first used the motion parallax and both depth cue modes, they often moved their head to a very large degree, even so far as to leave the field of view of our head-tracking system. This indicates that the users were well aware of the interaction modality and how depth information can be acquired through motion parallax.
However, contrary to our hypotheses that head movement would be view-mode dependent and that participants would move their heads more in motion parallax view mode than in stereo-only mode, most participants' behavior changed once their performance started being measured. The low mean head position variability in all tasks shows that participants rarely moved their heads in any depth cue mode, unlike their behavior during the practice sessions. Statistical analysis did not indicate that view mode affected head movement. We were able to observe only a small number of movements that could be identified as intentional motion to gain depth information from motion parallax cues. Even in Task 3, in which an extensive amount of time was given, participants seemed to be even more focused on the task and moved their head even less. Note that in the motion parallax view mode, without any motion parallax feedback, participants must have been relying mainly on monocular cues such as small changes in the relative size of the virtual objects, rather than using depth cues from motion. Cutting (1997) provides a list of assumptions about sources of depth information that a user may use in a VE.
Furthermore, we could not classify any head movement we observed as a way of getting more accurate depth information; rather, we believe that participants were using head motion to look at the display from different points of view or simply adjusting their posture. We have learned that it is probably more intuitive for users to rely on other monocular cues than on motion parallax from head movement when performing tasks that require depth acuity without stereo vision. It would be better for them to keep head movement to a minimum to reduce any interference caused by head-coupled viewing.
During the experiment, participants could close one eye or use both eyes to view the display in the motion parallax view mode. We should note that when participants open both eyes, the stereoscopy will not be completely removed from his/her stimulus. As our original goal was to study the role of the depth cues under a real-life use case, and since using a single eye is not realistic, we did not provide any suggestion as to how the participants should look at the display in the motion parallax view mode. During the experiment, we neither formally observed nor recorded whether the participants used a single eye or both eyes in the motion parallax view mode. Nonetheless, it is likely that the participants used both eyes during the motion parallax view mode due to the limited amount of time and random nature of the experiment. However, in retrospect, it would have been better to include a condition in which users performed the experiment strictly monocularly with one eye covered or closed. That way, we would have learned more about user behavior.
5.2 Importance of Stereopsis in FTVR Environments
We had initially hypothesized that motion parallax by itself would not be as effective in conveying depth as displays also providing stereopsis. However, in Task 1 (Virtual Nulling), although there was a trend that participants performed slightly worse under the motion parallax condition than under the stereopsis and both conditions, there was no significant effect of depth cue on user performance. The finding of a main effect of pole position was expected, since any kind of depth cue will be less effective for judging objects at greater distances. We note the rather low performance by participants, which was caused by the small amount of time they had (5 s) to perform the task.
The results of Task 2 (Placing a Virtual Object), on the other hand, confirm our hypothesis more directly: participants performed significantly worse under motion parallax when compared to the other depth cue modes. We believe that the effects of the depth cue mode in this experiment were stronger than in Task 1 (Virtual Nulling) due to a decrease in other sources of depth information. See Cutting (1997) for a list of potential sources of depth information that a user may use in a VE. In the Virtual Nulling task, there were two other sources of depth information besides stereo and motion cues: (1) the relative size of the two poles, which can easily be compared based on the horizontal alignment of the three poles, and (2) the linear perspective provided by the parallel edge of the plank that the poles rest upon. In Task 2, however, we believe the absence of these sources of depth information forced participants to rely more on stereopsis and motion parallax, amplifying the effect and the differences between depth cue modes to detectable levels.
In contrast to the first two tasks, in Task 3 (Detect Moving Object), participants had a considerable amount of time to identify the moving object. Still, participants took significantly longer under the motion parallax condition than in the both condition, with the trend insignificant but in the same direction as for the stereopsis condition. This shows that the effects of depth cue mode observed in Tasks 1 and 2 do not depend on the amount of time that a participant has to perform the task. In addition to the difference in time to respond, the participants were able to perform the task more accurately in the both and in the stereopsis view modes (100% accuracy), compared to the motion parallax view mode (88.1% accuracy). The absence of the stereo depth cue has a strong influence on a viewer's ability to compare the depth of multiple objects in a FTVR environment.
The small deviations in users' head position data that we observed in the experiment indicate that participants hardly depend on motion parallax cues, even in the motion parallax depth cue mode, where the motion cue is the only explicit depth cue present. Our informal observation that participants remained virtually motionless most of the time during the experiment is consistent with the head-tracking data. To speculate, if we were to observe users of traditional 2D desktops, we would very likely find that they remain stationary most of the time. Therefore, we suggest that the users' behavior may reflect their habituation to traditional 2D environments, in which head movement only distracts users from their task.
In retrospect, it would have been more fitting to include a no stereo or motion parallax condition in the experiment. We could have learned more about the effects of stereo cues in contrast to the baseline condition. Still, if we restrict ourselves to the relative significance of motion parallax and stereopsis cues, we can draw conclusions.
As a caveat, we should point out that among the three tasks in the experiment, we only found a significant difference between the stereopsis and motion parallax conditions in Task 2. However, several other findings support the conclusion that users are better able to use stereopsis for depth judgments than motion parallax: (1) there was a significant difference between the both and motion parallax conditions in the last two tasks; (2) there was lower correctness percentage under the motion parallax condition in the third task; (3) there was overall a small degree of head movement; and (4) neither task performance nor user behavior changed when more time was given to perform a task. Based on all of these findings together, we conclude that FTVR users rely on stereopsis for depth perception in FTVR environments more than they do on motion parallax, especially for tasks requiring depth acuity.
Lastly, the participants' experience during the tasks in this experiment can be classified as passive viewing, because the task does not require participants to move their heads. It is likely that participants rely very little on motion parallax cues when the task they are faced with does not require or promote head movement. One might therefore consider using other tasks more likely to promote head movement. However, the goal of the study is not to identify tasks that promote the use of motion parallax and stereopsis depth cues; rather, it is to study the effect of the motion parallax and stereopsis cues under use cases that require depth perception and reflect real-life circumstances. This led to the design of the three tasks. In particular, we did not consider depth perception tasks that require a great deal of head movement, as this would induce micro fatigue and would not reflect common use case scenarios in everyday life.
5.3 Comparison to Results in Previous Experiments
Our findings complement those of other FTVR studies (Lion, 1993; Boritz & Booth, 1997; Arsenault & Ware, 2004; Li et al., 2012; Albarelli et al., 2015) that have found stereopsis to be important for depth perception in FTVR. This is most likely because tasks used in these studies rely primarily on depth perception. Depth information channels influence performance differently in different tasks and different hardware/software setups. On the other hand, as already discussed, some FTVR studies have found motion parallax to be as important as stereopsis for FTVR ergonomics and user experience. These studies are primarily questionnaire-based studies (Ware et al., 1993; Hendrix & Barfield, 1996; Wright et al., 2012; Kongsilp & Dailey, 2017b). Our participants' feedback complements these studies, as it suggests that users may prefer the motion parallax mode over other modes due to visual discomfort caused by the stereopsis and both view modes. We speculate that there are two potential causes of the visual discomfort. The first potential cause is the mismatch between accommodation and convergence. Vienne et al. (2014) present a study that provides a better understanding of how the vergence system behaves in stereoscopic displays. The second potential cause is the static nature of the stereo images in the stereopsis view mode. In other work (Kongsilp & Dailey, 2017b), we have found that adding a perspective-corrected viewing feature to a stereoscopic display helps reduce visual fatigue and increase viewers' subjective level of presence.
Besides the FTVR studies, there have been several studies that investigate user experience in other kinds of VR systems. For example, Jones et al. (2008) studied the effect of motion parallax in VR and AR environments using a HMD-based VR system. Similar to our findings, they found that the motion parallax cue provides an additional interaction modality but does not help improve participants' depth judgments. This result and those of other studies outside the FTVR domain lead to the interesting question as to whether our findings would generalize beyond FTVR to other kinds of VR systems. Answering such questions is beyond the scope of the current work, but we could draw upon related work that studies a single capability across different VR modalities. For example, Renner, Velichkovsky, and Helmert (2013) provide a comprehensive review of work that studies how users perceive distance in different VR systems.
5.4 Distance to Virtual Objects
It is well known that the effectiveness of stereopsis and motion parallax in the real world increases when judging nearby objects and decreases when judging far-away objects. At the same time, binocular disparity (differences between two images of a scene as seen by each eye) also increases with nearness, which can cause double vision and discomfort. This means that a close-range workspace may allow a participant to perform a task better than a far-range workspace would, but a close-range workspace may cause visual discomfort if it is placed too close to the user. Therefore, it is important for FTVR developers to find an optimum distance for their FTVR workspaces. The results of the first and third tasks, which suggest that participants perform better when virtual objects are placed close to them, are consistent with this perspective. However, the statistical analysis of the results of Task 2 (Placing Virtual Object) failed to reject the null hypothesis that workspace positioning does not affect performance. We therefore suspect that the optimum distance for a given task is likely task-dependent and that the virtual objects in Task 2 are being placed too far out of the screen for the specific task.
Apart from our work, Bruder, Argelaguet, Olivier, and Lécuyer (2016) have conducted a study that explores the effects of workspace distance in a CAVE-like system. Their results suggest that a workspace should be placed at zero or negative parallax, and that a CAVE display should be placed 6 to 7 meters away from a user. Clearly, we would need to perform a similar study for a FTVR system to go beyond conjecture here, as we believe different kinds of VR systems may have different optimum workspace distances and because we only tested with three specific distances in the experiments.
5.5 Head-Coupled Perspective Displays Are Not Full FTVR Displays
Although FTVR was originally conceived as a system providing both stereo and motion cues, some of the work based on FTVR concepts (see the “Related Work” section) implements only motion parallax cues. This type of display is also known as a head-coupled perspective display. However, in this article, we have shown that stereopsis is more important than motion parallax when it comes to depth acuity—our users perceived depth less accurately when stereopsis is absent. Therefore, we believe it is important to distinguish head-coupled perspective displays from FTVR systems. That is to say, a head-coupled perspective display alone does not make a FTVR system. This is contrary to the common trend of referring to nearly any system incorporating head-coupled perspective displays as a FTVR system.
Interestingly, this trend may be partly due to the prevalence of public 2D videos researchers use to publicize their work. In fact, in our experience, since a monocular screen capture of a head-coupled perspective display interaction looks identical to a monocular screen capture of a FTVR interaction, 2D videos of head-coupled perspective displays look more impressive than the real-world experience of these displays!
These inconsistencies may lead to misdirection in future studies and are best avoided.
This article describes a comprehensive study of the importance of stereo and motion cues for FTVR interfaces. In the experiment, we focus on measuring participants' depth acuity and discrimination of motion in the depth direction in a FTVR environment, because these factors can be measured objectively and are somewhat task independent. The tasks in the experiment are based on psychological studies of visual acuity and depth cues as well as common interactions needed in virtual environments. We track participants' head position while performing the experiment in the lateral (left-right), superior-inferior (up-down), and anteroposterior (back-and-forth) directions.
The results of the experiment lead us to the conclusion that FTVR users rely on stereopsis for depth perception in FTVR environments more than they do on motion parallax, especially when performing tasks requiring depth acuity. Our analysis of head position data supports the conclusion that head motion and motion parallax are not the most intuitive way for users to obtain depth information. We also find that participants perform better when the workspace is closer to them.
Based on our findings, we argue that head-coupled perspective displays should not be confused with FTVR displays, despite the misleading similarity of monocular videos often used to demonstrate head-coupled perspective display based systems and FTVR systems. We expect that the study results will be useful to future FTVR system developers. We recommend FTVR developers to carefully consider the role of stereopsis in the system, especially when dealing with the user experience with respect to depth perception.
Lastly, the experiment in this study revolves mainly around users' viewing experience. However, we suspect that depth cues may take on different levels of importance when combined with particular interaction methods. In future work, we plan to further investigate the role of depth cues when used with direct and indirect interaction and manipulation techniques. We anticipate that such studies will help improve interaction modalities for FTVR systems and enhance our understanding of human factors in FTVR environments.
This research was supported by a grant from the Faculty of Engineering Research Fund at Thammasat University. We would also like to thank Takashi Komuro for comments on an earlier version of the manuscript.