Abstract

Modeling stereo transparency with physiologically plausible mechanisms is challenging because in such frameworks, large receptive fields mix up overlapping disparities, whereas small receptive fields can reliably compute only small disparities. It seems necessary to combine information across scales. A coarse-to-fine disparity energy model, with both position- and phase-shift receptive fields, has already been proposed. However, because each scale decodes only one disparity for each location and uses the decoded disparity to select cells at the next scale, this model cannot represent overlapping surfaces at different depths. We have extended the model to solve stereo transparency. First, we introduce multiplicative connections from cells at one scale to the next to implement coarse-to-fine computation. The connection is the strongest when the presynaptic cell’s preferred disparity matches the postsynaptic cell’s position-shift parameter, encouraging the next scale to encode residual disparities with the more reliable phase-shift mechanism. This modification not only eliminates the artificial decoding and selection steps of the original model but also enables maintenance of complete population responses throughout the coarse-to-fine process. Second, because of this modification, explicit decoding is no longer necessary but rather is for visualization only. We use a simple threshold criterion to decode multiple disparities from population energy responses instead of a single disparity in the original model. We demonstrate our model using simulations on a variety of transparent and nontransparent stereograms. The model also reproduces psychophysically observed disparity interactions (averaging, thickening, attraction, and repulsion) as the depth separation between two overlapping planes varies.

1  Introduction

We can see overlapping surfaces at different depths in transparent random-dot stereograms (Julesz, 1971; Prazdny, 1985). Computationally, however, this so-called stereo transparency problem is difficult to solve with physiologically plausible methods such as the disparity energy model (Ohzawa, DeAngelis, & Freeman, 1990; Qian, 1994, 1997). On one hand, cells with large receptive fields (RFs) cover dots carrying different disparities, mixing them in the cells’ responses. On the other hand, cells with small RFs can reliably compute only small disparities; this is true even for position-shift RFs (Chen & Qian, 2004; also see section 4). Consequently, a model has to use RFs that are much smaller than distances between adjacent dots in a stereogream but much larger than the disparities involved. This requires that the disparities be much smaller than the distances between adjacent dots. The transparent random-dot stereogram in Figure 1, for example, violates this requirement, yet we can still perceive two transparent surfaces.

Figure 1:

Chen and Qian’s (2004) model applied to a transparent random dot stereogram (top row) with two overlapping planes of 3 and −2 pixels of disparities, respectively. The model can decode only one disparity at each position, resulting in a patch-wise map (bottom row) of the two actual disparities.

Figure 1:

Chen and Qian’s (2004) model applied to a transparent random dot stereogram (top row) with two overlapping planes of 3 and −2 pixels of disparities, respectively. The model can decode only one disparity at each position, resulting in a patch-wise map (bottom row) of the two actual disparities.

Models of stereo transparency often include nonbiological procedures to get around the above problem. For example, a large class of models follows Marr and Poggio (1976) by starting with a compatibility map that contains all possible matches between features in the two eyes and then introducing constraints to eliminate false matches (Prazdny, 1985; Pollard, Mayhew, & Frisby, 1985; Qian & Sejnowski, 1989; Zhaoping, 2002). Such models are nonphysiological because they do not use any reasonable RFs, and each unit of a compatibility map responds to only one potential match (Qian, 1997). If the compatibility map is replaced by disparity energy responses produced by realistic RFs, the Marr-Poggio style constraints cannot be applied because the energy responses are broadly distributed with multiple peaks (Qian, 1994; Chen & Qian, 2004; Assee & Qian, 2007).

In this study, we solve stereo transparency in the framework of the disparity energy model (Ohzawa et al., 1990; Qian, 1994). Since a single RF scale appears to be inadequate, it seems natural to combine information across scales. Intuitively, although a large scale may average overlapping stimulus disparities, the average could still be a good starting point for smaller scales to resolve multiple disparities. Conversely, a small scale alone cannot reliably compute large disparities but can use larger scales’ guidance to offset stimulus disparities with the position-shift component of RFs and compute the residual disparity of each surface with the more reliable phase-shift component (Chen & Qian, 2004). A coarse-to-fine version of the disparity energy model, with both position- and phase-shift RFs, has already been proposed (Chen & Qian, 2004) and successfully applied to nontransparent stereograms. However, each scale of this model decodes only a single disparity for each location and uses the decoded disparity to select cells in the next scale. Consequently, it cannot represent multiple, transparent surfaces at a location. We have now extended this model to solve stereo transparency and at the same time make it more biologically plausible by eliminating explicit decoding and selection during computation. Preliminary results have been presented in abstract form (Li & Qian, 2014).

2  Method

2.1  Coarse-to-Fine Disparity Energy Model

We first briefly describe Chen and Qian’s coarse-to-fine disparity energy model and then explain our extensions. The model employs hybrid binocular cells with both position and phase shifts between the two eyes’ RFs (Zhu & Qian, 1996; Ohzawa, DeAngelis, & Freeman, 1997; Anzai, Ohzawa, & Freeman, 1997, 1999; Livingstone & Tsao, 1999; Prince, Cumming, & Parker, 2002). For convenience, we first define Gabor function with orientation (measured from horizontal) as
formula
2.1
where is rotated by angle , characterizes the spatial scale, determines the RF aspect ratio k (set to 2 in our simulations), and is the preferred spatial frequency. We keep and constant across scales to ensure scale-invariant RF shapes.
The left and right RFs of a simple cell are then given by
formula
2.2
formula
2.3
where d and are the position- and phase-shift parameters, respectively. Another simple cell forming a quadrature pair with this cell has RFs given by
formula
2.4
formula
2.5
The responses of these simple cells at position to the left and right images, and , are
formula
2.6
formula
2.7
The energy response of the complex cell receiving inputs from this quadrature pair of simple cells is then
formula
2.8
For a stimulus with disparity D evenly divided between the two eyes, the response is approximately (when ; see the appendix)
formula
2.9
where A is the Fourier amplitude of local image patch. Thus, the cell’s preferred disparity is approximately
formula
2.10
To improve performance, Chen and Qian (2004) pooled energy responses across orientation and space according to
formula
2.11
where the five orientations are
formula
2.12
ensures that the pooled cells of different orientations have the same preferred disparity, and the spatial pooling kernel for scale is
formula
2.13
At each scale and image location, we will index the pooled responses by d and without mentioning and of differently oriented cells. Note that the orientation pooling occurs after the disparity energy responses are calculated in each orientation-specific channel. Therefore, the pooling scheme does not violate Mansfield and Parker's (1993) finding of an orientation-specific component in noise masking of stereo detection. Specifically, when the masking noise and the disparity signal are in the same orientation channel, the noise will greatly reduce the (quadratic) disparity energy responses, and consequently the pooled responses, and impair signal detection. However, when the noise and signal are in different orientation channels, the signal will produce large energy responses in one orientation channel, whereas the noise will produce small responses in a different orientation channel. Since the pooling is weighted by the responses, the impact of the noise will be smaller in this case.

Chen and Qian (2004) computed disparity at each location iteratively from large to small RF scales. Each scale selects cells whose position shift d’s are all equal to the disparity estimated in the previous scale and whose phase-shift ’s span the whole range of . Consequently, the position-shift RF component offsets stimulus disparity based on the current estimate, whereas the phase-shift RF component estimates any residual stimulus disparity. Therefore, at the end of the iteration, the most responsive cells have position shifts close to stimulus disparity and phase shifts close to 0. This strategy is adopted because the phase-shift RF component estimates stimulus disparity more reliably than the position-shift component when the disparity is made small by offsetting (Chen & Qian, 2004). Unlike the first coarse-to-fine stereo model of Marr and Poggio (1979) that offsets stimulus disparity globally with vergence, this model offsets stimulus disparity locally with the position-shift component of RFs (see Chen & Qian, 2004, for further details). The process is consistent with Menz and Freeman’s (2003) finding that when cells’ RF scales reduce, their preferred disparities do not change. Since the disparity range of the phase-shift component reduces with the scale, the cells must use a position-shift component to offset stimulus disparities and maintain the preferred disparities.

As mentioned above, despite its successful application to various stereograms, Chen and Qian’s (2004) model cannot solve stereo transparency because each scale estimates only a single disparity at each location by finding the response peak of a population of disparity energy units and uses this disparity to select cells of the next scale. Figure 1 shows the simulation result of applying this model to a transparent random dot stereogram with two overlapping planes. The model can recover only one of the two disparities at each location rather than two overlapping planes that we perceive. It is also unclear how the selection procedure in the model could be implemented physiologically.

2.2  Connectivity Pattern

We therefore extended Chen and Qian’s (2004) model to resolve the above problems. The first extension is to replace the artificial selection procedure by multiplicative connections from large to small scales. Let the position- and phase-shift parameters of pre- and postsynaptic cells be dpre, , dpost, and , respectively. The connection strength is set to
formula
2.14
where is the preferred spatial frequency of presynaptic cell. Thus, the connection is the strongest when the presynaptic cell’s overall preferred disparity (as determined by its both position and phase shifts) equals the postsynaptic cell’s position shift. This is illustrated in Figure 2. controls the spread of connections around the strongest connections. We used in our simulations, but other values work well too (see Figure 12). Note that the connections are local as equation 2.14 applies to cells tuned to each location . For simplicity, the above description uses the pooled responses indexed by d and . However, an equivalent description can be made with responses before pooling, which effectively combines the pooling and multiplication steps into one.
Figure 2:

(Left) Schematic drawing of the multiplicative connections from cells of a larger scale to cells of the next smaller scale (see equation 2.14). For each scale and image location, the cells are indexed by their position-shift and phase-shift parameters. To avoid clutter, only the strongest connections from three presynaptic cells to three postsynaptic cells are shown. The three presynaptic cells lie on a negative diagonal line and thus have the same total preferred disparity (see equation 2.10). The three postsynaptic cells have the same position shift equal to the presynaptic cells’ total preferred disparity. Each cell’s RFs also receive inputs from stimuli (not shown) to compute energy responses. (Right) The actual connection weights from all cells of the fourth scale to a cell of the fifth scale with zero position-shift parameter. Therefore, the presynaptic cells with a total preferred disparity of zero have the strongest connections. In this example, we let pixel in equation 2.14, but other values work well too (see Figure 12).

Figure 2:

(Left) Schematic drawing of the multiplicative connections from cells of a larger scale to cells of the next smaller scale (see equation 2.14). For each scale and image location, the cells are indexed by their position-shift and phase-shift parameters. To avoid clutter, only the strongest connections from three presynaptic cells to three postsynaptic cells are shown. The three presynaptic cells lie on a negative diagonal line and thus have the same total preferred disparity (see equation 2.10). The three postsynaptic cells have the same position shift equal to the presynaptic cells’ total preferred disparity. Each cell’s RFs also receive inputs from stimuli (not shown) to compute energy responses. (Right) The actual connection weights from all cells of the fourth scale to a cell of the fifth scale with zero position-shift parameter. Therefore, the presynaptic cells with a total preferred disparity of zero have the strongest connections. In this example, we let pixel in equation 2.14, but other values work well too (see Figure 12).

The final response of a cell is a multiplication of its energy response to the stimulus and the total gain it receives from the previous scale. Similar to the iteration in Chen and Qian (2004), the response is locally determined. For each position , denote the energy response after spatial and orientation pooling as as in equation 2.11 and the activity of each cell after the gain multiplication as , then,
formula
2.15
where is a constant specifying the ratio of two adjacent scales. As in Chen and Qian (2004), we let and used five scales with equal to 8, 5.7, 4, 2.8, and 2 pixels, respectively. For the largest scale, .

This pattern of connectivity encourages the next scale to use the position-shift RF component to offset the disparities estimated in the previous scale and to use the phase-shift RF component to estimate residual disparities (i.e., the differences between the actual disparities and their current estimates). It thus provides a physiologically plausible implementation of the coarse-to-fine computation in Chen and Qian (2004). Figure 3 shows an example of population responses without (top row) and with (bottom row) multiplicative gains for a fixed position in the transparent random dot stereogram of Figure 1. The two left-most panels (for the largest scale) are identical. However, at the finest scale, the responses with and without the coarse-to-fine connections are different. Specifically, the connections help reduce false peaks and enhance the correct peaks in the population responses. Moreover, the response peaks are more focused around , as intended in Chen and Qian (2004)’s coarse-to-fine model.

Figure 3:

The energy responses (top) and the responses multiplied by the coarse-to-fine gains (bottom) at a fixed position in the transparent random dot stereogram of Figure 1. Different columns show results from different scales. In each panel, the horizontal axis represents the cells’ phase-shift parameter (divided by to covert to disparity) and the vertical axis represents their position-shift parameter d. Dotted lines indicate combinations of phase and position shifts that equal the true disparities of the stimulus.

Figure 3:

The energy responses (top) and the responses multiplied by the coarse-to-fine gains (bottom) at a fixed position in the transparent random dot stereogram of Figure 1. Different columns show results from different scales. In each panel, the horizontal axis represents the cells’ phase-shift parameter (divided by to covert to disparity) and the vertical axis represents their position-shift parameter d. Dotted lines indicate combinations of phase and position shifts that equal the true disparities of the stimulus.

2.3  Decoding Multiple Disparities from Population Responses

Our second extension is to replace the single-disparity decoding in Chen and Qian (2004) by multidisparity decoding. For each scale and location, the decoding finds all reliable peaks in the population responses of cells with various position- and phase-shift parameters. Denote the population response at scale and position as . Since the coarse-to-fine computation aims to use RF position shifts to offset stimulus disparities computed by the RF phase shifts so that at the end, the most responsive cells have near 0 (Chen & Qian, 2004), the decoding method should find all s that satisfy
formula
2.16
formula
2.17
To eliminate noisy small peaks, we require
formula
2.18
where is a relative threshold for the peaks as a fraction of the highest peak. We let , but its exact value is not important (see Figure 12). In our implementation, we used parabolic interpolation to determine . (More details are described in the appendix.)
We also tried another decoding method, first integrating responses of the cells with the same preferred disparity (see equation 2.10),
formula
2.19
and then finding local maxima of as the decoded disparity . We applied 2D interpolation in the d- space to perform the integration. A relative threshold as in equation 2.18 is also used to remove small noisy peaks.

Although this method integrates responses to reduce noise, it performs slightly worse than the first method. This is likely because the first method takes advantage of the fact that the energy units encode disparity most accurately when the RF position shifts correctly offset the stimulus disparities and thus the phase shifts of the most responsive cell are around (Chen & Qian, 2004).

3  Results

We applied our extended model to a variety of stereograms using exactly the same set of parameters. Since the ground truth of the natural-image stereogram in Figure 9 represents near and far disparities as positive and negative, respectively, we use the same convention for all stereograms for consistency.

3.1  A Transparent Stereogram with Two Overlapping Fronto-Parallel Planes

We first applied the model to the same transparent random-dot stereogram as in Figure 1 (copied to top panel of Figure 4). The true disparity map and the decoded disparity maps at each scale are shown in the bottom of Figure 4.

Figure 4:

Model performance on the same transparent random-dot stereogram as in Figure 1 with two overlapping fronto-parallel planes. (Top) The stereogram. (Bottom) The true disparity map and computed maps at the five scales.

Figure 4:

Model performance on the same transparent random-dot stereogram as in Figure 1 with two overlapping fronto-parallel planes. (Top) The stereogram. (Bottom) The true disparity map and computed maps at the five scales.

Note that 98.3% of all image positions have two decoded disparities, whereas positions have one decoded disparity and the position has more than two decoded disparities. Thus, the model correctly represented the two transparent planes in most positions. The decoded disparity values are also close to the true values: the root mean square (RMS) error is 0.2 pixel, compared with the 5-pixel separation between the two planes.

The small fluctuations of the decoded disparity values are likely attributable to the fact that our model is completely local, with separate estimation of disparities at each location. Interactions among different positions in higher-level surface representations would likely smooth out the fluctuations.

3.2  A Nontransparent Stereogram with a Floating Square

To ensure that our model works on nontransparent stereograms, we applied it to a standard random dot stereogram with a floating square. The result is shown in Figure 5. At the finest scale, our model correctly decoded the floating square.

Figure 5:

Model performance on a standard nontransparent stereogram with a floating square.

Figure 5:

Model performance on a standard nontransparent stereogram with a floating square.

3.3  A Transparent Stereogram with a Floating Square

Next, we tested a transparent version of the standard stereogram in the previous example: we added an overlapping background for the central floating square. This is an interesting test because unlike the uniform transparent stereogram in Figure 4, this stereogram has depth boundaries in addition to transparency. Additionally, the dot density in the central square region is twice that in the surround region. Nevertheless, the model with the fixed set of parameters works well. The results are shown in Figure 6.

Figure 6:

Model performance on a transparent stereogram with a floating square.

Figure 6:

Model performance on a transparent stereogram with a floating square.

3.4  A Nontransparent Stereogram with a Slanted Plane

A problem with Marr and Poggio’s (1976) model and related models is that they have difficulty with slanted planes because they consider a small number of fronto-parallel planes and include strong interactions within each plane. In contrast, Chen and Qian’s (2004) coarse-to-fine disparity energy model can compute disparity maps from nontransparent stereograms with slated planes. We therefore also tested our extension on a nontransparent stereogram with a slated plane. The result is shown in Figure 7.

Figure 7:

Model performance on a nontransparent stereogram with a slanted plane.

Figure 7:

Model performance on a nontransparent stereogram with a slanted plane.

3.5  A Transparent Stereogram with Overlapping Slanted Planes

We tested a transparent version of the previous stereogram, namely, a transparent stereogram with two overlapping slanted planes. The result is shown in Figure 8.

Figure 8:

Model performance on a transparent stereogram with overlapping slanted planes.

Figure 8:

Model performance on a transparent stereogram with overlapping slanted planes.

3.6  A Natural Image Stereogram

Finally, since Chen and Qian’s (2004) model was applied to natural image stereograms, we have also tested our extension on a natural image stereogram in which disparity and contrast covary; the result is shown in Figure 9.

Figure 9:

Model performance on a natural image stereogram. (Top) The image pair of Cloth4 stereogram from Middlebury Stereo Datasets (Hirschmuller & Scharstein, 2007; Scharstein & Pal, 2007). (Bottom) The ground truth and the model performance. The original image pairs were shifted by 125 pixels and downsampled by a factor of 10 so that the disparities are within the range covered by the model cells.

Figure 9:

Model performance on a natural image stereogram. (Top) The image pair of Cloth4 stereogram from Middlebury Stereo Datasets (Hirschmuller & Scharstein, 2007; Scharstein & Pal, 2007). (Bottom) The ground truth and the model performance. The original image pairs were shifted by 125 pixels and downsampled by a factor of 10 so that the disparities are within the range covered by the model cells.

3.7  Disparity Attraction and Repulsion in Transparent Stereograms

Disparities of a few isolated features appear to attract or repel each other depending on the features’ lateral separations (Westheimer, 1986; Westheimer & Levi, 1987). Mikaelian and Qian (2000) applied the disparity energy model to explain this observation. A similar phenomenon occurs for transparent stereograms: disparities of two overlapping planes appear to attract or repel each other depending on the depth separation between the planes (Parker & Yang, 1989; Stevenson, Cormack, & Schor, 1989). Specifically, when the depth separation is small, the two planes appear to merge as a single plane with the average disparity. With increasing separation, the stimulus looks like a thickened slab, a perception termed pyknostereopsis. Further depth separation produces two transparent planes with an exaggerated depth separation between them. Finally, at even greater depth separations, the perceived separation between the two planes becomes veridical.

Our model reproduces these observations as shown in Figure 10. We applied our model to a transparent random dot stereogram with various disparity separations between two overlapping planes. The disparities of the two planes always have the same magnitude but opposite signs. In the top panel of Figure 10, each column is a gray-scale histogram (compiled from all positions of the stereogram) of the decoded disparity values for each actual disparity separation between the planes. Brighter colors represent more frequently decoded values. The two actual disparities are indicated by the two dashed black lines. Similar to our perception, the model requires a minimum disparity separation (threshold) between the planes to decode two disparities. This threshold depends on the model’s finest RF scale. Also similar to our perception, the model produces a thickened slab during the transition from decoding one plane to two planes.

Figure 10:

Disparity interactions in stereo transparency. (Top) Each column shows a decoded-disparity histogram for each actual disparity separation between the two planes in a transparent random-dot stereogram. Brighter colors indicate more frequently decoded values. The two actual disparities are represented by the two black dashed lines. The model explains three observed perceptual regimes with increasing disparity separation: depth averaging (one plane), pyknostereopsis (thickening), and transparency (two planes). (Bottom) The decoded disparity separation, according to the peaks of the histograms, against the actual disparity separation. The dashed line marks equality between the computed and actual disparity separations. The computed separations show attraction (below the dashed line) and repulsion (above the dashed line) depending on the actual disparity separation.

Figure 10:

Disparity interactions in stereo transparency. (Top) Each column shows a decoded-disparity histogram for each actual disparity separation between the two planes in a transparent random-dot stereogram. Brighter colors indicate more frequently decoded values. The two actual disparities are represented by the two black dashed lines. The model explains three observed perceptual regimes with increasing disparity separation: depth averaging (one plane), pyknostereopsis (thickening), and transparency (two planes). (Bottom) The decoded disparity separation, according to the peaks of the histograms, against the actual disparity separation. The dashed line marks equality between the computed and actual disparity separations. The computed separations show attraction (below the dashed line) and repulsion (above the dashed line) depending on the actual disparity separation.

Averaging two disparities into one may be viewed as an extreme case of attraction between the two disparities. To examine disparity interactions generally, we plot in the bottom panel of Figure 10 the decoded disparity separation against the actual disparity separation between the two planes (open circles). This was done by searching for the peaks in the histogram of the top panel around the actual disparity values and then subtracting the two peak disparities. The dashed line in the bottom panel marks the equality between the computed and estimated disparity separations. The model predicts smaller-than-actual separations, larger-than-actual separations, and veridical separations as the actual separation increases, in agreement with the observation of Stevenson, Cormack, and Schor (1991).

We also investigated how, at small disparity separations, the averaged disparity of two overlapping planes is weighted by the contrasts of the dots for the planes. We applied our model to a transparent random dot stereogram with two planes having pixel of disparities but various contrast ratios between the dots of the two planes. The decoded disparity is close to the average disparity weighted by the contrasts but with an S-shaped bias (see Figure 11, left), in agreement with the observation in a related experiment (Rogers & Anstis, 1975).

Figure 11:

Disparity averaging weighted by dot contrasts and dot density. We applied our model to a transparent random dot stereogram with two planes at pixel of disparities and varied the contrast (left) and density (right) of the dots of the two planes. Each panel plots the computed disparity against the average disparity weighted by the contrast (left) or density (right).

Figure 11:

Disparity averaging weighted by dot contrasts and dot density. We applied our model to a transparent random dot stereogram with two planes at pixel of disparities and varied the contrast (left) and density (right) of the dots of the two planes. Each panel plots the computed disparity against the average disparity weighted by the contrast (left) or density (right).

In addition to contrasts, we also varied the dot density ratio between the two planes. The decoded disparity is very close to the average disparity weighted by the dot densities (see Figure 11, right). This is a prediction that could be tested psychophysically.

3.8  Dependence on Two Key Parameters

Our extension introduced two new parameters, and we examined how the model performance depends on them. They are the spread of the connectivity pattern characterized by in equation 2.14 and the relative threshold for eliminating noisy small peaks in decoding in equation 2.18.

For the transparent stereogram with two fronto-parallel planes in Figure 4, the right panel of Figure 12 shows the proportion of positions with two decoded disparities as a function of and . The curve in the density plot indicates the optimal combination of the two parameters. When , optimal increases quickly as increases. This suggests that as the connections for coarse-to-fine computation are more spread out from the intended ones, the ratio of noisy small peaks to real peaks in population responses become larger. For small , a broad range of produces similarly good performances. The standard and used in our simulations are 0.1 pixel and 0.3 (indicated by a star in the figure.)

Figure 12:

Dependence of the model performance on parameters and . We used the same transparent random dot stereogram as in Figure 4 with two overlapping planes of disparities −2 and 3 pixels. (Left) The proportion of image positions with exactly two decoded disparities as a function of both and . Brighter colors indicate higher proportions. The black curve marks the optimal for each . The star marks the standard parameters used in all the simulations of this letter. The right panel shows the decoding RMS error as a function of . is chosen to be optimal for each . The two lines are the decoding RMS errors for the two planes. The shaded areas indicate the standard deviations of the errors estimated from 10 different stereograms, and the darker areas indicate overlaps of the shades. The axis is in log scale for both panels.

Figure 12:

Dependence of the model performance on parameters and . We used the same transparent random dot stereogram as in Figure 4 with two overlapping planes of disparities −2 and 3 pixels. (Left) The proportion of image positions with exactly two decoded disparities as a function of both and . Brighter colors indicate higher proportions. The black curve marks the optimal for each . The star marks the standard parameters used in all the simulations of this letter. The right panel shows the decoding RMS error as a function of . is chosen to be optimal for each . The two lines are the decoding RMS errors for the two planes. The shaded areas indicate the standard deviations of the errors estimated from 10 different stereograms, and the darker areas indicate overlaps of the shades. The axis is in log scale for both panels.

The right panel of Figure 12 shows the decoding RMS error as a function of (with the optimal for each ). The model performance does not vary much as long as is smaller than of the finest scale (2 pixels in our simulations). These results explain why a single parameter set works well for all stereograms in this letter.

4  Discussion

We extended Chen and Qian’s (2004) coarse-to-fine disparity energy model to solve the difficult problem of stereo transparency with biologically plausible mechanisms. In the original model, a given scale decodes a single disparity for each location and uses this disparity to select a set of cells for the next scale. We replaced this artificial selection procedure with multiplicative connections from one scale to the next. The connectivity pattern provides a biologically plausible mechanism to achieve the original model’s goal of using cells’ position-shift RF component to offset stimulus disparities and the more reliable phase-shift RF component to estimate residual disparities. More important, whereas each scale of the original model commits to a single decoded disparity at each location, the new model maintains the entire population responses during the coarse-to-fine computation. Consequently, unlike the original model, explicit disparity decoding at each scale is unnecessary for the new model. We can still decode the population responses at each scale for the sole purpose of visualization as we did in this letter. This leads to our second extension: we used a simple threshold criterion capable of decoding multiple disparities instead of single-disparity decoding in the original model. We demonstrated through computer simulations, with a single parameter set, that these extensions allow our model to solve various transparent and nontransparent stereograms in a biologically plausible way. Finally, our model explains disparity interactions (averaging, thickening, attraction, and repulsion) as the separation between two overlapping planes varies.

Both Chen and Qian’s (2004) model and our current extension use the position-shift RF component to offset estimated stimulus disparities and the phase-shift component to estimate the residual disparities. Consequently, at the end of computation, the most responsive cells have position shifts near stimulus disparities and phase shifts near 0. As we noted, this strategy is based on the finding that the phase-shift population response is more reliable than the position-shift population response for disparity computation (Chen & Qian, 2004; Tsang & Shi, 2004). The analysis in the appendix shows that this remains true when stimulus disparity is divided evenly between the two eyes. Position shifts are needed to properly place the limited disparity range of phase shifts. Also note that Read and Cumming (2007) follow Chen and Qian (2004) in searching for the cells whose position shift offsets stimulus disparity and whose phase shift is near 0, albeit with a different algorithm.

It is easy to understand why position-shift RFs are generally less reliable than the phase-shift RFs. Consider disparity encoding at a given location by a set of energy units with a range of preferred disparities. If the units have phase-shift RFs, then the RFs of all the units cover the same left and right image patches. Consequently, variations in the units’ responses are attributable to their different tuning properties. In contrast, if the units have position-shift RFs, then different units cover different left and right image patches, which introduce additional variability in the population responses.

We mentioned in section 1 that cells with small RFs can reliably compute only small disparities. This is easy to understand for phase-shift RFs because phase shift is periodic, and disparity representation is unambiguous only for phase shifts within the range (Qian, 1994). One might argue that because position shift is not periodic, position-shift RFs could represent arbitrarily large disparities. However, this is not the case for the reason discussed. Specifically, by definition, cells with different position shifts are located at different positions. When their RFs are small, they more likely cover completely different image regions. Thus, spatial variations of image properties (e.g., contrast, frequency content, local features such as orientation) may overwhelm the disparity-related signals in population responses.

How does our extended coarse-to-fine disparity energy model solve the stereo transparency problem? We define residual disparity as the difference between an actual stimulus disparity and its current estimate. At the largest scale, cells’ RFs cover many dots carrying different disparities, and thus the most responsive cells are likely those tuned to the average of the stimulus disparities (see Figures 3 and 4). Because of the connectivity pattern, these cells will excite the cells in the next scale whose position-shift components are close to the average disparity. With the offsetting of the average disparity by the position shifts, the cells of the next scale with smaller RFs can better represent the residual disparities with their phase shifts. This process is then repeated to gradually offset more of the stimulus disparities and reduce the residual disparities. At the smallest scale, the most active cells are the ones whose position shifts are close to one of the actual stimulus disparities and whose phase-shift components are near 0 (because the residual disparities are close to 0).

Our model makes specific predictions. There are physiological and psychophysical evidence for coarse-to-fine disparity processing in biological vision (Menz & Freeman’s, 2003; Smallman & MacLeod, 1994; Wilson, Blake, & Halpern, 1991; Rohaly & Wilson, 1993). Our model suggests a specific implementation of this computation, namely, that the connections from cells with larger RFs to those with smaller RFs are the strongest when a presynaptic cell’s overall preferred disparity (as determined by its both position and phase shifts) matches the postsynaptic cell’s position shift. A second prediction is that the smallest disparity separation between two transparent surfaces that can be resolved perceptually is determined by the RF sizes of the finest scale in the coarse-to-fine process. This could be tested by examining whether the smallest resolvable disparity separation increases with retinal eccentricity. Our model also predicts that disparity averaging should be weighted by dot densities (see Figure 11).

In conclusion, we have extended Chen and Qian’s (2004) coarse-to-fine disparity energy model to solve the difficult problem of stereo transparency with biologically plausible mechanisms. The model uses both position-shift and phase-shift RF components and works well on a variety of transparent and nontransparent stereograms. Although large-scale cells tend to average stimulus disparities and small-scale cells cannot compute large stimulus disparities, combining information through the coase-to-fine process solves the transparency problem. Our model also makes specific predictions on connectivity between disparity tuned cells of different scales and on our perception of stereo transparency.

Appendix:  Deviation and Implementation

A.1  Quadrature Pair Responses and Preferred Disparities

The derivations here are similar to our previous derivations (Chen & Qian, 2004) but with stimulus disparities evenly divided between the two eyes’ oriented RFs with both position and phase shifts.

The RFs of simple cells in a quadrature pair are defined in equations 2.2, 2.3, 2.4, and 2.5 of the text. For a stimulus I(x, y) with disparity D, the images for the two eyes are
formula
A.1
formula
A.2
Without loss of generality, for position (0, 0), equations 2.6 and 2.7 become
formula
A.3
formula
A.4
in which x1, y1, x2, y2 are rotated coordinates defined as
formula
A.5
formula
A.6
Therefore, the quadrature-pair response is
formula
A.7
with
formula
A.8
formula
A.9
The first-order approximation of with respect to is
formula
A.10
Define a gaussian envelope as
formula
A.11
and define the original image filtered by this gaussian envelope and its scaled first partial derivative with respect to x as
formula
A.12
formula
A.13
where is the RF aspect ratio. The Fourier component at frequency of I1 and I2 is
formula
A.14
formula
A.15
With these notations, along with , the complex cell response is
formula
A.16
an approximation to the second order of . If the stimulus disparity D is largely offset by cells’ position shift d, then the second term is small, and the cells’ preferred disparity is determined by the first term, resulting in equation 2.10 in the text.

Equation A.16 also demonstrates that phase-shift population responses (from cells with a fixed d but a full range of ) are more reliable than position-shift population responses (from cells with a fixed but a range of d) even when disparity is evenly divided between the two eyes. Specifically, the second term of equation A.16 can be made small when D is largely offset by a fixed d, and the cells with this d and the full range of have a reliable peak determined by the first term. In contrast, the second term cannot always be small for a fixed and a range of d, contaminating the first term. Also note that when , the position-shift population response is symmetric around dD (Read & Cumming, 2007). However, this symmetry holds only for the special case of uniform disparity.

A.2  Disparity Decoding in Discrete Form

We explain the detailed implementation of disparity decoding. As mentioned in section 2.3, we aim to find satisfying equations 2.16 to 2.18. We can only approximately achieve this goal since the population responses are sampled from cells with a discrete set of parameters d and .

For a given scale () and spatial location (x and y), local population responses are stored in a 2D array,
formula
in which di and indicate the position- and phase-shift parameters of the cells. For convenience, we use j0 to index the cell whose .
The algorithm first finds all i’s satisfying:
formula
Then, for each di so determined, it is reasonable to assume that falls within . Define . We search for j over according to and . Apply parabolic interpolation on , and , we find the peak position of , and let
formula

Acknowledgments

We thank Li Zhaoping for her support and helpful discussions. This work was supported by Tsinghua University 985 grant (Li Zhaoping) and Irving Weinstein Foundation (NQ).

References

Anzai
,
A.
,
Ohzawa
,
I.
, &
Freeman
,
R. D.
(
1997
).
Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase
.
Proceedings of the National Academy of Sciences
,
94
(
10
),
5438
5443
.
Anzai
,
A.
,
Ohzawa
,
I.
, &
Freeman
,
R. D.
(
1999
).
Neural mechanisms for processing binocular information I. Simple cells
.
Journal of Neurophysiology
,
82
(
2
),
891
908
.
Assee
,
A.
, &
Qian
,
N.
(
2007
).
Solving da Vinci stereopsis with depth-edge-selective V2 cells
.
Vision Research
,
47
(
20
),
2585
2602
.
Chen
,
Y.
, &
Qian
,
N.
(
2004
).
A coarse-to-fine disparity energy model with both phase-shift and position-shift receptive field mechanisms
.
Neural Computation
,
16
(
8
),
1545
1577
.
Hirschmuller
,
H.
, &
Scharstein
,
D.
(
2007
).
Evaluation of cost functions for stereo matching
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
1
8
).
Piscataway, NJ
:
IEEE
.
Julesz
,
B.
(
1971
).
Foundations of Cyclopean perception
.
Chicago
:
University of Chicago Press
.
Li
,
Z.
, &
Qian
,
N.
(
2014
). Solving stereo transparency with an extended coarse-to-fine disparity energy model.
Talk at Vision Science Society, St. Pete Beach, FL
.
Livingstone
,
M. S.
, &
Tsao
,
D. Y.
(
1999
).
Receptive fields of disparity-selective neurons in macaque striate cortex
.
Nat. Neurosci., 2
(
9
),
825
832
. doi:10.1038/12199
Mansfield
,
J. S.
, &
Parker
,
A. J.
(
1993
).
An orientation-tuned component in the contrast masking of stereopsis
.
Vision Research
,
33
(
11
),
1535
1544
.
Marr
,
D.
, &
Poggio
,
T.
(
1976
).
Cooperative computation of stereo disparity
.
Science
,
194
(
4262
),
283
287
.
Marr
,
D.
, &
Poggio
,
T.
(
1979
).
A computational theory of human stereo vision
.
Proceedings of the Royal Society of London, Series B, Biological Sciences
,
204
(
1156
),
301
328
.
Menz
,
M. D.
, &
Freeman
,
R. D.
(
2003
).
Stereoscopic depth processing in the visual cortex: A coarse-to-fine mechanism
.
Nat. Neurosci., 6
(
1
),
59
65
. doi:10.1038/nn986
Mikaelian
,
S.
, &
Qian
,
N.
(
2000
).
A physiologically-based explanation of disparity attraction and repulsion
.
Vision Research
,
40
(
21
),
2999
3016
.
Ohzawa
,
I.
,
DeAngelis
,
G. C.
, &
Freeman
,
R. D.
(
1990
).
Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors
.
Science
,
249
(
4972
),
1037
1041
.
Ohzawa
,
I.
,
DeAngelis
,
G. C.
, &
Freeman
,
R. D.
(
1997
).
Encoding of binocular disparity by complex cells in the cat’s visual cortex
.
Journal of Neurophysiology
,
77
(
6
),
2879
2909
.
Parker
,
A. J.
, &
Yang
,
Y.
(
1989
).
Spatial properties of disparity pooling in human stereo vision
.
Vision Research
,
29
(
11
),
1525
1538
.
Pollard
,
S. B.
,
Mayhew
,
J. E. W.
, &
Frisby
,
J. P.
(
1985
).
PMF: A stereo correspondence algorithm using a disparity gradient limit
.
Perception
,
14
(
4
),
449
470
.
Prazdny
,
K.
(
1985
).
Detection of binocular disparities
.
Biological Cybernetics
,
52
(
2
),
93
99
.
Prince
,
S.
,
Cumming
,
B. G.
, &
Parker
,
A. J.
(
2002
).
Range and mechanism of encoding of horizontal disparity in macaque V1
.
Journal of Neurophysiology
,
87
(
1
),
209
221
.
Qian
,
N.
(
1994
).
Computing stereo disparity and motion with known binocular cell properties
.
Neural Computation
,
6
(
3
),
390
404
.
Qian
,
N.
(
1997
).
Binocular disparity and the perception of depth
.
Neuron
,
18
(
3
),
359
368
.
Qian
,
N.
, &
Sejnowski
,
T. J.
(
1989
).
Learning to solve random-dot stereograms of dense and transparent surfaces with recurrent backpropagation
. In
Proceedings of the 1988 Connectionist Models Summer School
(pp.
435
443
).
San Mateo, CA
:
Morgan Kaufmann
.
Read
,
J. C. A.
, &
Cumming
,
B. G.
(
2007
).
Sensors for impossible stimuli may solve the stereo correspondence problem
.
Nat. Neurosci., 10
(10),
1322
1328
. doi:10.1038/nn1951
Rogers
,
B. J.
, &
Anstis
,
S. M.
(
1975
).
Reversed depth from positive and negative stereograms
.
Perception
,
4
(
2
),
193
201
.
Rohaly
,
A. M.
, &
Wilson
,
H. R.
(
1993
).
Nature of coarse-to-fine constraints on binocular fusion
.
Journal of the Optical Society of America A
,
10
(
12
),
2433
2441
.
Scharstein
,
D.
, &
Pal
,
C.
(
2007
).
Learning conditional random fields for stereo
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp. 
1
8
).
Piscataway, NJ
:
IEEE
.
Smallman
,
H. S.
, &
MacLeod
,
D. I. A.
(
1994
).
Size-disparity correlation in stereopsis at contrast threshold
.
Journal of the Optical Society of America A
,
11
(
8
),
2169
2183
.
Stevenson
,
S. B.
,
Cormack
,
L. K.
, &
Schor
,
C. M.
(
1989
).
Hyperacuity, superresolution and gap resolution in human stereopsis
.
Vision Research
,
29
(
11
),
1597
1605
.
Stevenson
,
S. B.
,
Cormack
,
L. K.
, &
Schor
,
C. M.
(
1991
).
Depth attraction and repulsion in random dot stereograms
.
Vision Research
,
31
(
5
),
805
813
.
Tsang
,
E. K. C.
, &
Shi
,
B. E.
(
2004
).
A preference for phase-based disparity in a neuromorphic implementation of the binocular energy model
.
Neural Computation
,
16
(
8
),
1579
1600
.
Westheimer
,
G.
(
1986
).
Spatial interaction in the domain of disparity signals in human stereoscopic vision
.
Journal of Physiology
,
370
(
1
),
619
629
.
Westheimer
,
G.
, &
Levi
,
D. M.
(
1987
).
Depth attraction and repulsion of disparate foveal stimuli
.
Vision Research
,
27
(
8
),
1361
1368
.
Wilson
,
H. R.
,
Blake
,
R.
, &
Halpern
,
D. L.
(
1991
).
Coarse spatial scales constrain the range of binocular fusion on fine scales
.
Journal of the Optical Society of America A
,
8
(
1
),
229
236
.
Zhaoping
,
L.
(
2002
).
Preattentive segmentation and correspondence in stereo
.
Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences
,
357
(
1428
),
1877
1883
.
Zhu
,
Y.-D.
, &
Qian
,
N.
(
1996
).
Binocular receptive field models, disparity tuning, and characteristic disparity
.
Neural Computation
,
8
(
8
),
1611
1641
.