## Abstract

Modeling stereo transparency with physiologically plausible mechanisms is challenging because in such frameworks, large receptive fields mix up overlapping disparities, whereas small receptive fields can reliably compute only small disparities. It seems necessary to combine information across scales. A coarse-to-fine disparity energy model, with both position- and phase-shift receptive fields, has already been proposed. However, because each scale decodes only one disparity for each location and uses the decoded disparity to select cells at the next scale, this model cannot represent overlapping surfaces at different depths. We have extended the model to solve stereo transparency. First, we introduce multiplicative connections from cells at one scale to the next to implement coarse-to-fine computation. The connection is the strongest when the presynaptic cell’s preferred disparity matches the postsynaptic cell’s position-shift parameter, encouraging the next scale to encode residual disparities with the more reliable phase-shift mechanism. This modification not only eliminates the artificial decoding and selection steps of the original model but also enables maintenance of complete population responses throughout the coarse-to-fine process. Second, because of this modification, explicit decoding is no longer necessary but rather is for visualization only. We use a simple threshold criterion to decode multiple disparities from population energy responses instead of a single disparity in the original model. We demonstrate our model using simulations on a variety of transparent and nontransparent stereograms. The model also reproduces psychophysically observed disparity interactions (averaging, thickening, attraction, and repulsion) as the depth separation between two overlapping planes varies.

## 1 Introduction

We can see overlapping surfaces at different depths in transparent random-dot stereograms (Julesz, 1971; Prazdny, 1985). Computationally, however, this so-called stereo transparency problem is difficult to solve with physiologically plausible methods such as the disparity energy model (Ohzawa, DeAngelis, & Freeman, 1990; Qian, 1994, 1997). On one hand, cells with large receptive fields (RFs) cover dots carrying different disparities, mixing them in the cells’ responses. On the other hand, cells with small RFs can reliably compute only small disparities; this is true even for position-shift RFs (Chen & Qian, 2004; also see section 4). Consequently, a model has to use RFs that are much smaller than distances between adjacent dots in a stereogream but much larger than the disparities involved. This requires that the disparities be much smaller than the distances between adjacent dots. The transparent random-dot stereogram in Figure 1, for example, violates this requirement, yet we can still perceive two transparent surfaces.

Models of stereo transparency often include nonbiological procedures to get around the above problem. For example, a large class of models follows Marr and Poggio (1976) by starting with a compatibility map that contains all possible matches between features in the two eyes and then introducing constraints to eliminate false matches (Prazdny, 1985; Pollard, Mayhew, & Frisby, 1985; Qian & Sejnowski, 1989; Zhaoping, 2002). Such models are nonphysiological because they do not use any reasonable RFs, and each unit of a compatibility map responds to only one potential match (Qian, 1997). If the compatibility map is replaced by disparity energy responses produced by realistic RFs, the Marr-Poggio style constraints cannot be applied because the energy responses are broadly distributed with multiple peaks (Qian, 1994; Chen & Qian, 2004; Assee & Qian, 2007).

In this study, we solve stereo transparency in the framework of the disparity energy model (Ohzawa et al., 1990; Qian, 1994). Since a single RF scale appears to be inadequate, it seems natural to combine information across scales. Intuitively, although a large scale may average overlapping stimulus disparities, the average could still be a good starting point for smaller scales to resolve multiple disparities. Conversely, a small scale alone cannot reliably compute large disparities but can use larger scales’ guidance to offset stimulus disparities with the position-shift component of RFs and compute the residual disparity of each surface with the more reliable phase-shift component (Chen & Qian, 2004). A coarse-to-fine version of the disparity energy model, with both position- and phase-shift RFs, has already been proposed (Chen & Qian, 2004) and successfully applied to nontransparent stereograms. However, each scale of this model decodes only a single disparity for each location and uses the decoded disparity to select cells in the next scale. Consequently, it cannot represent multiple, transparent surfaces at a location. We have now extended this model to solve stereo transparency and at the same time make it more biologically plausible by eliminating explicit decoding and selection during computation. Preliminary results have been presented in abstract form (Li & Qian, 2014).

## 2 Method

### 2.1 Coarse-to-Fine Disparity Energy Model

*k*(set to 2 in our simulations), and is the preferred spatial frequency. We keep and constant across scales to ensure scale-invariant RF shapes.

*d*and are the position- and phase-shift parameters, respectively. Another simple cell forming a quadrature pair with this cell has RFs given by The responses of these simple cells at position to the left and right images, and , are

*D*evenly divided between the two eyes, the response is approximately (when ; see the appendix) where

*A*is the Fourier amplitude of local image patch. Thus, the cell’s preferred disparity is approximately

*d*and without mentioning and of differently oriented cells. Note that the orientation pooling occurs after the disparity energy responses are calculated in each orientation-specific channel. Therefore, the pooling scheme does not violate Mansfield and Parker's (1993) finding of an orientation-specific component in noise masking of stereo detection. Specifically, when the masking noise and the disparity signal are in the same orientation channel, the noise will greatly reduce the (quadratic) disparity energy responses, and consequently the pooled responses, and impair signal detection. However, when the noise and signal are in different orientation channels, the signal will produce large energy responses in one orientation channel, whereas the noise will produce small responses in a different orientation channel. Since the pooling is weighted by the responses, the impact of the noise will be smaller in this case.

Chen and Qian (2004) computed disparity at each location iteratively from large to small RF scales. Each scale selects cells whose position shift *d*’s are all equal to the disparity estimated in the previous scale and whose phase-shift ’s span the whole range of . Consequently, the position-shift RF component offsets stimulus disparity based on the current estimate, whereas the phase-shift RF component estimates any residual stimulus disparity. Therefore, at the end of the iteration, the most responsive cells have position shifts close to stimulus disparity and phase shifts close to 0. This strategy is adopted because the phase-shift RF component estimates stimulus disparity more reliably than the position-shift component when the disparity is made small by offsetting (Chen & Qian, 2004). Unlike the first coarse-to-fine stereo model of Marr and Poggio (1979) that offsets stimulus disparity globally with vergence, this model offsets stimulus disparity locally with the position-shift component of RFs (see Chen & Qian, 2004, for further details). The process is consistent with Menz and Freeman’s (2003) finding that when cells’ RF scales reduce, their preferred disparities do not change. Since the disparity range of the phase-shift component reduces with the scale, the cells must use a position-shift component to offset stimulus disparities and maintain the preferred disparities.

As mentioned above, despite its successful application to various stereograms, Chen and Qian’s (2004) model cannot solve stereo transparency because each scale estimates only a single disparity at each location by finding the response peak of a population of disparity energy units and uses this disparity to select cells of the next scale. Figure 1 shows the simulation result of applying this model to a transparent random dot stereogram with two overlapping planes. The model can recover only one of the two disparities at each location rather than two overlapping planes that we perceive. It is also unclear how the selection procedure in the model could be implemented physiologically.

### 2.2 Connectivity Pattern

*d*

_{pre}, ,

*d*

_{post}, and , respectively. The connection strength is set to where is the preferred spatial frequency of presynaptic cell. Thus, the connection is the strongest when the presynaptic cell’s overall preferred disparity (as determined by its both position and phase shifts) equals the postsynaptic cell’s position shift. This is illustrated in Figure 2. controls the spread of connections around the strongest connections. We used in our simulations, but other values work well too (see Figure 12). Note that the connections are local as equation 2.14 applies to cells tuned to each location . For simplicity, the above description uses the pooled responses indexed by

*d*and . However, an equivalent description can be made with responses before pooling, which effectively combines the pooling and multiplication steps into one.

This pattern of connectivity encourages the next scale to use the position-shift RF component to offset the disparities estimated in the previous scale and to use the phase-shift RF component to estimate residual disparities (i.e., the differences between the actual disparities and their current estimates). It thus provides a physiologically plausible implementation of the coarse-to-fine computation in Chen and Qian (2004). Figure 3 shows an example of population responses without (top row) and with (bottom row) multiplicative gains for a fixed position in the transparent random dot stereogram of Figure 1. The two left-most panels (for the largest scale) are identical. However, at the finest scale, the responses with and without the coarse-to-fine connections are different. Specifically, the connections help reduce false peaks and enhance the correct peaks in the population responses. Moreover, the response peaks are more focused around , as intended in Chen and Qian (2004)’s coarse-to-fine model.

### 2.3 Decoding Multiple Disparities from Population Responses

*d*- space to perform the integration. A relative threshold as in equation 2.18 is also used to remove small noisy peaks.

Although this method integrates responses to reduce noise, it performs slightly worse than the first method. This is likely because the first method takes advantage of the fact that the energy units encode disparity most accurately when the RF position shifts correctly offset the stimulus disparities and thus the phase shifts of the most responsive cell are around (Chen & Qian, 2004).

## 3 Results

We applied our extended model to a variety of stereograms using exactly the same set of parameters. Since the ground truth of the natural-image stereogram in Figure 9 represents near and far disparities as positive and negative, respectively, we use the same convention for all stereograms for consistency.

### 3.1 A Transparent Stereogram with Two Overlapping Fronto-Parallel Planes

We first applied the model to the same transparent random-dot stereogram as in Figure 1 (copied to top panel of Figure 4). The true disparity map and the decoded disparity maps at each scale are shown in the bottom of Figure 4.

Note that 98.3% of all image positions have two decoded disparities, whereas positions have one decoded disparity and the position has more than two decoded disparities. Thus, the model correctly represented the two transparent planes in most positions. The decoded disparity values are also close to the true values: the root mean square (RMS) error is 0.2 pixel, compared with the 5-pixel separation between the two planes.

The small fluctuations of the decoded disparity values are likely attributable to the fact that our model is completely local, with separate estimation of disparities at each location. Interactions among different positions in higher-level surface representations would likely smooth out the fluctuations.

### 3.2 A Nontransparent Stereogram with a Floating Square

To ensure that our model works on nontransparent stereograms, we applied it to a standard random dot stereogram with a floating square. The result is shown in Figure 5. At the finest scale, our model correctly decoded the floating square.

### 3.3 A Transparent Stereogram with a Floating Square

Next, we tested a transparent version of the standard stereogram in the previous example: we added an overlapping background for the central floating square. This is an interesting test because unlike the uniform transparent stereogram in Figure 4, this stereogram has depth boundaries in addition to transparency. Additionally, the dot density in the central square region is twice that in the surround region. Nevertheless, the model with the fixed set of parameters works well. The results are shown in Figure 6.

### 3.4 A Nontransparent Stereogram with a Slanted Plane

A problem with Marr and Poggio’s (1976) model and related models is that they have difficulty with slanted planes because they consider a small number of fronto-parallel planes and include strong interactions within each plane. In contrast, Chen and Qian’s (2004) coarse-to-fine disparity energy model can compute disparity maps from nontransparent stereograms with slated planes. We therefore also tested our extension on a nontransparent stereogram with a slated plane. The result is shown in Figure 7.

### 3.5 A Transparent Stereogram with Overlapping Slanted Planes

We tested a transparent version of the previous stereogram, namely, a transparent stereogram with two overlapping slanted planes. The result is shown in Figure 8.

### 3.6 A Natural Image Stereogram

### 3.7 Disparity Attraction and Repulsion in Transparent Stereograms

Disparities of a few isolated features appear to attract or repel each other depending on the features’ lateral separations (Westheimer, 1986; Westheimer & Levi, 1987). Mikaelian and Qian (2000) applied the disparity energy model to explain this observation. A similar phenomenon occurs for transparent stereograms: disparities of two overlapping planes appear to attract or repel each other depending on the depth separation between the planes (Parker & Yang, 1989; Stevenson, Cormack, & Schor, 1989). Specifically, when the depth separation is small, the two planes appear to merge as a single plane with the average disparity. With increasing separation, the stimulus looks like a thickened slab, a perception termed *pyknostereopsis*. Further depth separation produces two transparent planes with an exaggerated depth separation between them. Finally, at even greater depth separations, the perceived separation between the two planes becomes veridical.

Our model reproduces these observations as shown in Figure 10. We applied our model to a transparent random dot stereogram with various disparity separations between two overlapping planes. The disparities of the two planes always have the same magnitude but opposite signs. In the top panel of Figure 10, each column is a gray-scale histogram (compiled from all positions of the stereogram) of the decoded disparity values for each actual disparity separation between the planes. Brighter colors represent more frequently decoded values. The two actual disparities are indicated by the two dashed black lines. Similar to our perception, the model requires a minimum disparity separation (threshold) between the planes to decode two disparities. This threshold depends on the model’s finest RF scale. Also similar to our perception, the model produces a thickened slab during the transition from decoding one plane to two planes.

Averaging two disparities into one may be viewed as an extreme case of attraction between the two disparities. To examine disparity interactions generally, we plot in the bottom panel of Figure 10 the decoded disparity separation against the actual disparity separation between the two planes (open circles). This was done by searching for the peaks in the histogram of the top panel around the actual disparity values and then subtracting the two peak disparities. The dashed line in the bottom panel marks the equality between the computed and estimated disparity separations. The model predicts smaller-than-actual separations, larger-than-actual separations, and veridical separations as the actual separation increases, in agreement with the observation of Stevenson, Cormack, and Schor (1991).

We also investigated how, at small disparity separations, the averaged disparity of two overlapping planes is weighted by the contrasts of the dots for the planes. We applied our model to a transparent random dot stereogram with two planes having pixel of disparities but various contrast ratios between the dots of the two planes. The decoded disparity is close to the average disparity weighted by the contrasts but with an S-shaped bias (see Figure 11, left), in agreement with the observation in a related experiment (Rogers & Anstis, 1975).

In addition to contrasts, we also varied the dot density ratio between the two planes. The decoded disparity is very close to the average disparity weighted by the dot densities (see Figure 11, right). This is a prediction that could be tested psychophysically.

### 3.8 Dependence on Two Key Parameters

Our extension introduced two new parameters, and we examined how the model performance depends on them. They are the spread of the connectivity pattern characterized by in equation 2.14 and the relative threshold for eliminating noisy small peaks in decoding in equation 2.18.

For the transparent stereogram with two fronto-parallel planes in Figure 4, the right panel of Figure 12 shows the proportion of positions with two decoded disparities as a function of and . The curve in the density plot indicates the optimal combination of the two parameters. When , optimal increases quickly as increases. This suggests that as the connections for coarse-to-fine computation are more spread out from the intended ones, the ratio of noisy small peaks to real peaks in population responses become larger. For small , a broad range of produces similarly good performances. The standard and used in our simulations are 0.1 pixel and 0.3 (indicated by a star in the figure.)

The right panel of Figure 12 shows the decoding RMS error as a function of (with the optimal for each ). The model performance does not vary much as long as is smaller than of the finest scale (2 pixels in our simulations). These results explain why a single parameter set works well for all stereograms in this letter.

## 4 Discussion

We extended Chen and Qian’s (2004) coarse-to-fine disparity energy model to solve the difficult problem of stereo transparency with biologically plausible mechanisms. In the original model, a given scale decodes a single disparity for each location and uses this disparity to select a set of cells for the next scale. We replaced this artificial selection procedure with multiplicative connections from one scale to the next. The connectivity pattern provides a biologically plausible mechanism to achieve the original model’s goal of using cells’ position-shift RF component to offset stimulus disparities and the more reliable phase-shift RF component to estimate residual disparities. More important, whereas each scale of the original model commits to a single decoded disparity at each location, the new model maintains the entire population responses during the coarse-to-fine computation. Consequently, unlike the original model, explicit disparity decoding at each scale is unnecessary for the new model. We can still decode the population responses at each scale for the sole purpose of visualization as we did in this letter. This leads to our second extension: we used a simple threshold criterion capable of decoding multiple disparities instead of single-disparity decoding in the original model. We demonstrated through computer simulations, with a single parameter set, that these extensions allow our model to solve various transparent and nontransparent stereograms in a biologically plausible way. Finally, our model explains disparity interactions (averaging, thickening, attraction, and repulsion) as the separation between two overlapping planes varies.

Both Chen and Qian’s (2004) model and our current extension use the position-shift RF component to offset estimated stimulus disparities and the phase-shift component to estimate the residual disparities. Consequently, at the end of computation, the most responsive cells have position shifts near stimulus disparities and phase shifts near 0. As we noted, this strategy is based on the finding that the phase-shift population response is more reliable than the position-shift population response for disparity computation (Chen & Qian, 2004; Tsang & Shi, 2004). The analysis in the appendix shows that this remains true when stimulus disparity is divided evenly between the two eyes. Position shifts are needed to properly place the limited disparity range of phase shifts. Also note that Read and Cumming (2007) follow Chen and Qian (2004) in searching for the cells whose position shift offsets stimulus disparity and whose phase shift is near 0, albeit with a different algorithm.

It is easy to understand why position-shift RFs are generally less reliable than the phase-shift RFs. Consider disparity encoding at a given location by a set of energy units with a range of preferred disparities. If the units have phase-shift RFs, then the RFs of all the units cover the same left and right image patches. Consequently, variations in the units’ responses are attributable to their different tuning properties. In contrast, if the units have position-shift RFs, then different units cover different left and right image patches, which introduce additional variability in the population responses.

We mentioned in section 1 that cells with small RFs can reliably compute only small disparities. This is easy to understand for phase-shift RFs because phase shift is periodic, and disparity representation is unambiguous only for phase shifts within the range (Qian, 1994). One might argue that because position shift is not periodic, position-shift RFs could represent arbitrarily large disparities. However, this is not the case for the reason discussed. Specifically, by definition, cells with different position shifts are located at different positions. When their RFs are small, they more likely cover completely different image regions. Thus, spatial variations of image properties (e.g., contrast, frequency content, local features such as orientation) may overwhelm the disparity-related signals in population responses.

How does our extended coarse-to-fine disparity energy model solve the stereo transparency problem? We define residual disparity as the difference between an actual stimulus disparity and its current estimate. At the largest scale, cells’ RFs cover many dots carrying different disparities, and thus the most responsive cells are likely those tuned to the average of the stimulus disparities (see Figures 3 and 4). Because of the connectivity pattern, these cells will excite the cells in the next scale whose position-shift components are close to the average disparity. With the offsetting of the average disparity by the position shifts, the cells of the next scale with smaller RFs can better represent the residual disparities with their phase shifts. This process is then repeated to gradually offset more of the stimulus disparities and reduce the residual disparities. At the smallest scale, the most active cells are the ones whose position shifts are close to one of the actual stimulus disparities and whose phase-shift components are near 0 (because the residual disparities are close to 0).

Our model makes specific predictions. There are physiological and psychophysical evidence for coarse-to-fine disparity processing in biological vision (Menz & Freeman’s, 2003; Smallman & MacLeod, 1994; Wilson, Blake, & Halpern, 1991; Rohaly & Wilson, 1993). Our model suggests a specific implementation of this computation, namely, that the connections from cells with larger RFs to those with smaller RFs are the strongest when a presynaptic cell’s overall preferred disparity (as determined by its both position and phase shifts) matches the postsynaptic cell’s position shift. A second prediction is that the smallest disparity separation between two transparent surfaces that can be resolved perceptually is determined by the RF sizes of the finest scale in the coarse-to-fine process. This could be tested by examining whether the smallest resolvable disparity separation increases with retinal eccentricity. Our model also predicts that disparity averaging should be weighted by dot densities (see Figure 11).

In conclusion, we have extended Chen and Qian’s (2004) coarse-to-fine disparity energy model to solve the difficult problem of stereo transparency with biologically plausible mechanisms. The model uses both position-shift and phase-shift RF components and works well on a variety of transparent and nontransparent stereograms. Although large-scale cells tend to average stimulus disparities and small-scale cells cannot compute large stimulus disparities, combining information through the coase-to-fine process solves the transparency problem. Our model also makes specific predictions on connectivity between disparity tuned cells of different scales and on our perception of stereo transparency.

### Appendix: Deviation and Implementation

#### A.1 Quadrature Pair Responses and Preferred Disparities

The derivations here are similar to our previous derivations (Chen & Qian, 2004) but with stimulus disparities evenly divided between the two eyes’ oriented RFs with both position and phase shifts.

*I*(

*x*,

*y*) with disparity

*D*, the images for the two eyes are Without loss of generality, for position (0, 0), equations 2.6 and 2.7 become in which

*x*

_{1},

*y*

_{1},

*x*

_{2},

*y*

_{2}are rotated coordinates defined as Therefore, the quadrature-pair response is with

*x*as where is the RF aspect ratio. The Fourier component at frequency of

*I*

_{1}and

*I*

_{2}is With these notations, along with , the complex cell response is an approximation to the second order of . If the stimulus disparity

*D*is largely offset by cells’ position shift

*d*, then the second term is small, and the cells’ preferred disparity is determined by the first term, resulting in equation 2.10 in the text.

Equation A.16 also demonstrates that phase-shift population responses (from cells with a fixed *d* but a full range of ) are more reliable than position-shift population responses (from cells with a fixed but a range of *d*) even when disparity is evenly divided between the two eyes. Specifically, the second term of equation A.16 can be made small when *D* is largely offset by a fixed *d*, and the cells with this *d* and the full range of have a reliable peak determined by the first term. In contrast, the second term cannot always be small for a fixed and a range of *d*, contaminating the first term. Also note that when , the position-shift population response is symmetric around *d*−*D* (Read & Cumming, 2007). However, this symmetry holds only for the special case of uniform disparity.

#### A.2 Disparity Decoding in Discrete Form

We explain the detailed implementation of disparity decoding. As mentioned in section 2.3, we aim to find satisfying equations 2.16 to 2.18. We can only approximately achieve this goal since the population responses are sampled from cells with a discrete set of parameters *d* and .

## Acknowledgments

We thank Li Zhaoping for her support and helpful discussions. This work was supported by Tsinghua University 985 grant (Li Zhaoping) and Irving Weinstein Foundation (NQ).

## References

*Solving stereo transparency with an extended coarse-to-fine disparity energy model*.