## Abstract

A common practice to account for psychophysical biases in vision is to frame them as consequences of a dynamic process relying on optimal inference with respect to a generative model. The study presented here details the complete formulation of such a generative model intended to probe visual motion perception with a dynamic texture model. It is derived in a set of axiomatic steps constrained by biological plausibility. We extend previous contributions by detailing three equivalent formulations of this texture model. First, the composite dynamic textures are constructed by the random aggregation of warped patterns, which can be viewed as three-dimensional gaussian fields. Second, these textures are cast as solutions to a stochastic partial differential equation (sPDE). This essential step enables real-time, on-the-fly texture synthesis using time-discretized autoregressive processes. It also allows for the derivation of a local motion-energy model, which corresponds to the log likelihood of the probability density. The log likelihoods are essential for the construction of a Bayesian inference framework. We use the dynamic texture model to psychophysically probe speed perception in humans using zoom-like changes in the spatial frequency content of the stimulus. The human data replicate previous findings showing perceived speed to be positively biased by spatial frequency increments. A Bayesian observer who combines a gaussian likelihood centered at the true speed and a spatial frequency dependent width with a “slow-speed prior” successfully accounts for the perceptual bias. More precisely, the bias arises from a decrease in the observer's likelihood width estimated from the experiments as the spatial frequency increases. Such a trend is compatible with the trend of the dynamic texture likelihood width.

## 1 Introduction

### 1.1 Modeling Visual Motion Perception

However, gaussian likelihoods and priors do not always fit with psychophysical results (Wei & Stocker, 2012; Hassan & Hammett, 2015). Thus, a major challenge is to refine the construction of generative models so that they are consistent with the widest variety of empirical results.

### 1.2 Previous Work in Context

#### 1.2.1 Dynamic texture synthesis

The model defined in equation 1.2 is quite simplistic compared to the complexity of natural scenes. It is therefore useful here to discuss generative models associated with texture synthesis methods previously proposed in the computer vision and computer graphics community. Indeed, the literature on the subject of static textures synthesis is abundant (see, e.g., Wei, Lefebvre, Kwatara, & Turk, 2009). Of particular interest for us is the work by Galerne, Gousseau, and Morel (2011a) and Galerne (2011), which proposes a stationary gaussian model restricted to static textures. This provides an equivalent generative model based on Poisson shot noise. Realistic dynamic texture models have received less attention, and the most prominent method is the nonparametric gaussian autoregressive (AR) framework developed by Doretto, Chiuso, Wu, and Soatto (2003), which has been thoroughly explored (Xia, Ferradans, Peyré, & Aujol, 2014; Yuan, Wen, Liu, & Shum, 2004; Costantini, Sbaiz, & Süsstrunk, 2008; Filip, Haindl, & Chetverikov, 2006; Hyndman, Jepson, & Fleet, 2007; Abraham, Camps, & Sznaier, 2005). These works generally consist in finding an appropriate low-dimensional feature space in which an AR process models the dynamics. Many of these approaches focus on the feature space where the decomposition is efficiently performed using singular value decomposition (SVD) or its higher-order version (HOSVD) (Doretto et al., 2003; Costantini et al., 2008). In Abraham et al. (2005), the feature space is the Fourier frequency domain, and the AR recursion is carried independently over each frequency, which defines the space-time stationary processes. A similar approach is used in Xia et al. (2014) to compute the average of several dynamic texture models. Properties of these AR models have been studied by Hyndman et al. (2007), who find that higher-order AR processes are able to capture perceptible temporal features. A different approach aims at learning the manifold structure of a given dynamic texture (Liu, Lin, Ahuja, & Yang, 2006), while yet another deals with motion statistics (Rahman & Murshed, 2008). What all these works have in common is the aim to reproduce the natural spatiotemporal behavior of dynamic textures with rigorous mathematical tools. Similarly, our concern is to design a dynamic texture model that is precisely parameterized for experimental purposes in visual neuroscience and psychophysics.

#### 1.2.2 Stochastic Differential Equations

Stochastic ordinary differential equations (sODE) and their higher-dimensional counterparts, stochastic partial differential equations (sPDE), can be viewed as continuous-time versions of these one-dimensional or higher-dimensional AR models. Conversely, AR processes can therefore also be used to compute numerical solutions to these sPDE using finite difference approximations of time derivatives. Informally, these equations can be understood as partial differential equations perturbed by a random noise. The theoretical and numerical study of these sDEs is of fundamental interest in fields as diverse as physics and chemistry (Van & Nicolaas, 1992), finance (El Karoui, Peng, & Quenez, 1997), or neuroscience (Fox, 1997). They allow for the dynamic study of complex, irregular, and random phenomena such as particle interactions, stock or saving prices, or ensembles of neurons. In psychophysics, sODEs have been used to model decision-making tasks in which the stochastic variable represents the accumulation of knowledge until the decision is made, thus providing detailed information about predicted response times (Smith, 2000). In imaging sciences, sPDE with sparse nongaussian driving noise have been proposed as models of natural signals and images (Unser & Tafti, 2014). As described above, the simple motion energy model 1.3 can similarly be demonstrated to rely on the sPDE equation, 1.2, of a stochastic model of visual sensory input. This has not previously been presented in a formal way in the literature. One key goal of this letter is to comprehensively formulate a parametric family of gaussian sPDEs that describes the modeling of moving images (and the corresponding synthesis of visual stimulation) and thus allows for a finely grained systematic exploration of psychophysical behavior.

#### 1.2.3 Inverse Bayesian Inference

Importantly, these dynamic stochastic models are closely related to the
likelihood and prior models, which serve to infer motion estimates from the
dynamic visual stimulation. In order to account for perceptual bias, a now
well-accepted methodology in the field of psychophysics is to assume that
observers are “ideal observers” and therefore make decisions
using optimal statistical inference, typically a maximum a posteriori (MAP)
estimator, using Bayes' formula to combine this likelihood with some
internal prior (see equation 1.1). Several experimental studies have used this hypothesis as
a justification for the observed perceptual biases by proposing some
adjusted likelihood and prior models (Doya, 2007; Colombo & Seriès, 2012), and more recent work pushes these ideas
even further. Observing some perceptual bias, is it possible to invert this
forward Bayesian decision-making process, and infer the (unknown) internal
prior that best fits a set of observed experimental choices made by
observers? Following Stocker and Simoncelli (2006), we coined this promising methodology *inverse
Bayesian inference*. This is, of course, an ill-posed and highly
nonlinear inverse problem, making it necessary to add constraints on both
the prior and the likelihood to make it tractable. For instance
Sotiropoulos, Seitz, and Seriès (2014), Stocker and Simoncelli (2006), and Jogan and Stocker (2015) impose smoothness constraints in order to be able to
locally fit the slope of the prior. Here, we propose to use visual
stimulations generated by the (forward) generative model to test these
inverse Bayesian models. To allow for a simple yet mathematically rigorous
analysis of this approach within the context of speed discrimination, in
this study, we use a restricted parametric set of descriptors for the
likelihood and priors. This provides a self-consistent approach to test the
visual system, from stimulation to behavior analysis.

### 1.3 Contributions

In this letter, we lay the foundations that we hope will enable a better understanding of human motion perception by improving generative models for dynamic texture synthesis. From that perspective, we motivate the generation of visual stimulation within a stationary gaussian dynamic texture model.

We develop our current model by extending, mathematically detailing, and testing in psychophysical experiments previously introduced dynamic noise textures (Sanz-Leon, Vanzetta, Masson, & Perrinet, 2012; Simoncini, Perrinet, Montagnini, Mamassian, & Masson, 2012; Vacher, Meso, Perrinet, & Peyré, 2015; Gekas, Masson, & Mamassian, 2017) coined motion clouds (MC). Our first contribution is a complete axiomatic derivation of the model, seen as a shot noise aggregation of dynamically warped “textons.” Within our generative model, the parameters correspond to average spatial and temporal transformations (zoom, orientation, and translation speed) and associated standard deviations of random fluctuations, as illustrated in Figure 1, with respect to external (objects) and internal (observer) movements. The second main contribution is the explicit demonstration of the equivalence between this model and a class of linear sPDEs. This shows that our model is a generalization of the well-known luminance conservation (see equation 1.2). This sPDE formulation has two chief advantages: it allows for a real-time synthesis using an AR recurrence (in the form of a GPU implementation) and allows one to recast the log likelihood of the model as a generalization of the classical motion energy model, which is crucial to allow for Bayesian modeling of perceptual biases. Our last contribution follows from the Bayesian approach and is an illustrative application of this model to the psychophysical study of motion perception in humans. This example of the model development constrains the likelihood, which enables a simple fitting procedure to be performed using both an empirical and a larger Monte Carlo–derived synthetic data set to determine the prior driving the perceptual biases. The code associated with this work is available at https://github.com/JonathanVacher/projects/tree/master/bayesian_observer.

### 1.4 Notations

## 2 Axiomatic Construction of the Dynamic Textures

Dynamic textures that are efficient to probe visual perception should be generated from low-dimensional yet naturalistic parametric stochastic models. They should embed meaningful physical parameters (such as the effect of head rotations or whole-field scene movements; see Figure 1) into the local or global dependencies of the random field (e.g., the covariance). In the luminance conservation model, equation 1.2, the generative model is parameterized by a spatiotemporal coupling encoded in the covariance $\Sigma $ of the driving noise and the motion flow $v0$.

This localized space-time coupling (e.g., the covariance, if one restricts one's attention to gaussian fields) is essential, as it quantifies the extent of the spatial integration area, as well as the integration dynamics. This is an important issue in neuroscience when considering the implementation of spatiotemporal integration mechanisms from very small to very large scales, that is, going from local to global visual features (Rousselet, Thorpe, & Fabre-Thorpe, 2004; Born & Bradley, 2005; DiCarlo, Zoccolan, & Rust, 2012). In particular, this is crucial to understand the modular sensitivity within the different lower visual areas. In primates for instance, the primary visual cortex (V1) generally encodes small features in a given range of spatiotemporal scales. In contrast, ascending the processing hierarchy, the middle temporal (V5/MT) area exhibits selectivity for larger visual features. For instance, by varying the spatial frequency bandwidth of such dynamic textures, distinct mechanisms for perception and action have been identified in humans (Simoncini et al., 2012). Our goal here is to develop a principled axiomatic definition of these dynamic textures.

### 2.1 From Shot Noise to Motion Clouds

We propose a derivation of a general parametric model of dynamic textures. This model is defined by aggregation, through summation, of a basic spatial “texton” template $g(x)$. The summation reflects a transparency hypothesis, which has been adopted, for instance, by Galerne et al. (2011b). While one could argue that this hypothesis is overly simplistic and does not model occlusions or edges, it leads to a tractable framework of stationary gaussian textures, which has proved useful to model static microtextures (Galerne et al., 2001b) and dynamic natural phenomena (Xia, Ferradans, Peyré, & Aujol, 2014). The simplicity of this framework allows for a fine tuning of frequency-based (Fourier) parameterization, which is desirable for the interpretation of psychophysical experiments with respect to underlying spatiotemporal neural sensitivity.

Intuitively, this model equation 2.1, corresponds to a dense mixing of stereotyped static textons as in the work of Galerne et al. (2011b). In addition to the extension to the temporal domain, the originality of our approach is twofold. First, the components of this mixing are derived from the texton by visual transformations $\varphi Ap$, which may correspond to arbitrary transformations such as zooms or rotations (in which case, $Ap$ is a vector containing the scale factor and the rotation angle). (See the illustration in Figure 1.) Second, we explicitly model the motion (position $Xp$ and speed $Vp$) of each individual texton.

In the following, we denote $PA$ the common distribution of the independent and identically distributed (i.i.d) $(Ap)p$, and we denote $PV$ the distribution in $R2$ of the speed vectors $(Vp)p$. Section 2.3 instantiates this model and proposes canonical choices for these variabilities.

The following result shows that the model, equation 2.1, converges for high point density $\lambda \u2192+\u221e$ to a stationary gaussian field and gives the parameterization of the covariance. Its proof follows from a specialization of theorem 3.1 in Galerne (2011) to our setting.

This proposition enables us to give a precise definition of an MC:

A motion cloud (MC) is a stationary gaussian field whose covariance is given by equation 2.2.

Note that following Galerne et al. (2011a), the convergence result of proposition ^{1} could be used in practice to simulate a motion
cloud $I$ using a high but finite value of $\lambda $ in order to generate a realization of $I\lambda $. We do not use this approach and instead rely on the sPDE
characterization proved in section 3, which
is well tailored for an accurate and computationally efficient dynamic
synthesis.

### 2.2 Toward Motion Clouds for Experimental Purposes

^{2}) that is characterized by $cg,\varphi a,PA$, and $PV$. In order to have better control of the covariance $\gamma $, one needs to resort to a low-dimensional representation of these parameters. We further study this model in the specific case where the warps $\varphi a$ are rotations and scalings (see Figure 1). They account for the characteristic orientations and sizes (or spatial scales) of a scene, in relation to the observer. We thus set

^{3}. It is fully parameterized by the distributions $(PZ,P\Theta ,PV)$ and the central frequency and speed $(\xi 0,v0)$. Note that it is possible to consider any arbitrary textons $g$, which would give rise to more complicated parameterizations for the power spectrum $g^$, but here we decided to stick to the simple asymptotic case of gratings.

^{1}converges toward a stationary gaussian field of covariance having the power spectrum

**Proof.** See appendix A for the proof.

Note that the envelope of $\gamma ^$ as defined in equation 2.7 is constrained to lie within a cone in the spatiotemporal domain with the apex at zero (see the division by $||\xi ||$ in the argument of $L(P||V-v0||)$). This is an important and novel contribution when compared to a classical Gabor. Basing the generation of the textures on distributions of translations, rotations, and zooms, we provide a principled approach to show that speed bandwidth gets scaled with spatial frequency to provide a scale-invariant model of moving texture transformations.

### 2.3 Biologically Inspired Parameter Distributions

We now give meaningful specialization for the probability distributions $PZ$, $P\Theta $, and $P||V-v0||$, which are inspired by some known scaling properties of the visual transformations relevant to dynamic scene perception.

#### 2.3.1 Parameterization of $PZ$

#### 2.3.2 Parameterization of $P\Theta $

#### 2.3.3 Parameterization of $P||V-v0||$

#### 2.3.4 Bringing Everything Together

Plugging expressions 2.9 to 2.11 into the definition of the power spectrum of the defintion of MCs, equation 2.7, one obtains a parameterization that shares similarities with the one originally introduced in Simoncini et al. (2012).

Table 1 recaps the parameters of the biologically inpired MC models. It is composed of the central parameters $v0$ for the speed, $\theta 0$ for orientation, and $z0$ for the central spatial frequency modulus, as well as corresponding dispersion parameters $(\sigma V,\sigma \Theta ,BZ)$ that account for the typical deviation around these central values. Figure 2 graphically shows the influence of these parameters on the shape of the MC power spectrum $\gamma ^$.

Parameter Name | Translation Speed | Orientation Angle | Spatial Frequency Modulus |

(mean, dispersion) | $(v0,\sigma V)$ | $(\theta 0,\sigma \Theta )$ | $(z0,BZ)$ |

Parameter Name | Translation Speed | Orientation Angle | Spatial Frequency Modulus |

(mean, dispersion) | $(v0,\sigma V)$ | $(\theta 0,\sigma \Theta )$ | $(z0,BZ)$ |

We show in Figure 3 two examples of such stimuli for two different spatial frequency bandwidths. This is particularly relevant as it is possible to dissociate the respective roles of broader or narrower spatial frequency bandwidths in action and perception (Simoncini et al., 2012). Using this formulation to extend the study of visual perception to other dimensions like orientation or speed bandwidths should provide a means to systematically titrate their respective role in motion integration and obtain a quantitative assessment of their respective contributions in experimental data.

## 3 sPDE Formulation and Synthesis Algorithm

In this section, we show that the MC model (see definition ^{2}) can equally be described as the stationary
solution of a stochastic partial differential equation (sPDE). This sPDE formulation
is important since we aim to deal with dynamic stimulation, which should be
described by a causal equation that is local in time. This is crucial for numerical
simulations, since this allows us to perform real-time synthesis of stimuli using an
autoregressive time discretization. This is a significant departure from previous
Fourier-based implementation of dynamic stimulations (Sanz-Leon et al., 2012; Simoncini et al., 2012). Moreover, this is also important to simplify the
application of MC inside a Bayesian model of psychophysical experiments (see section 4). In particular, the derivation of an
equivalent sPDE model exploits a spectral formulation of MCs as gaussian random
fields. The full proof along with the synthesis algorithm follows.

To be mathematically correct, all the sPDEs in this letter are written in the sense of generalized stochastic processes (GSP), which are to stochastic processes what generalized functions are to functions. This allows for the consideration of linear transformations of stochastic processes, like differentiation or Fourier transforms as for generalized functions. We refer to Unser, Tafti, and Sun (2014) for a recent use of GSP and to Gel'fand, Vilenkin, and Feinstein (1964) for the foundation of the theory. The connection between GSP and stochastic processes has been described in previous work (Meidan, 1980).

### 3.1 Dynamic Textures as Solutions of sPDE

#### 3.1.1 Using a sPDE without Global Translation, $v0=0$

We first give the definition of an sPDE cloud $I$ making use of another cloud $I0$ without translation speed. This allows us to restrict our attention to the case $v0=0$ in order to define a simple sPDE and then to explicitly extend that result to the general case.

The random field $I0$ solving equation 3.2 thus corresponds to an sPDE cloud with no translation speed, $v0=0$. The filters $(\alpha ,\beta )$ parameterizing this sPDE cloud aim at enforcing an additional correlation of the model in time. Section 3.2 explains how to choose $(\alpha ,\beta ,\sigma W)$ so that these sPDE clouds, which are stationary solutions of equation 3.2, have the power spectrum given in equation 2.7 (in the case that $v0=0$), that is, are motion clouds. Defining a causal equation that is local in time is crucial for numerical simulation (as explained in section 3.3) but also to simplify the application of MC inside a Bayesian model of psychophysical experiments (see section 4.3.2).

While equation 3.3 should hold for all time $t\u2208R$, the construction of stationary solutions (hence, sPDE clouds) of this equation is obtained by solving the sODE, equation 3.3, forward for time $t>t0$ with arbitrary boundary conditions at time $t=t0$, and letting $t0\u2192-\u221e$. This is consistent with the numerical scheme detailed in section 3.3.

The theoretical study of equation 3.2 is beyond the scope of this letter; however, one can show the existence and uniqueness of stationary solutions for this class of sPDE under stability conditions on the filters $(\alpha ,\beta )$ (see, e.g., Unser & Tafti, 2014, and Brockwell & Lindner, 2009, and appendix theorem 1). These conditions are automatically satisfied for the particular case of section 3.2.

#### 3.1.2 sPDE with Global Translation

The easiest way to define and synthesize an sPDE cloud $I$ with nonzero translation speed $v0$ is to first define $I0$ solving equation 3.3 and then translating it with constant speed using equation 3.1. An alternative way is to derive the sPDE satisfied by $I$, as detailed in the following proposition. This is useful to define motion energy in section 4.3.2.

**Proof.** See appendix A for the proof.

### 3.2 Equivalence between the Spectral and sPDE Formulations

Since both MCs and sPDE clouds are obtained by a uniform translation with speed $v0$ of a motionless cloud, we can restrict our analysis to the case $v0=0$ without loss of generality.

The following proposition shows that the sPDE cloud model, equation 3.2, and the motion cloud model, equation 2.7, are identical for an appropriate choice of function $P||V-v0||$.

**Proof.** See appendix A for the proof.

#### 3.2.1 Expression for $P||V-v0||$

### 3.3 AR(2) Discretization of the sPDE

Most previous work on gaussian texture synthesis (such as Galerne et al., 2011b, for static and Sanz-Leon et al., 2012, Simoncini et al., 2012, for dynamic textures) has used a global Fourier-based approach and the explicit power spectrum expression, equation 2.7. The main drawbacks of such an approach are that (1) it introduces an artificial periodicity in time and thus can only be used to synthesize a finite number of frames; (2) these frames must be synthesized at once, before the stimulation, which prevents real-time synthesis; and (3) the discrete computational grid may introduce artifacts—in particular, when one of the included frequencies is of the order of the discretization step or a bandwidth is too small.

To address these issues, we follow the previous works of Doretto et al. (2003) and Xia et al. (2014) and make use of an autoregressive (AR) discretization of the sPDE, equation 3.2. In contrast with this previous work, we use a second-order AR(2) regression (instead of a first-order AR(1) model). Using higher-order recursions is crucial to make the output consistent with the continuous formulation equation, 3.2. Indeed, numerical simulations show that AR(1) iterations lead to unacceptable temporal artifacts. In particular, the time correlation of AR(1) random fields typically decays too fast in time.

#### 3.3.1 AR(2) Synthesis without Global Translation, $v0=0$

One can show that when $\u21130\u2192-\u221e$ (to allow for a long enough warmup phase to reach approximate time stationarity) and $\Delta \u21920$, then $I0\Delta $ defined by interpolating $I0\Delta (\xb7,\Delta \u2113)=I(\u2113)$ converges (in the sense of finite-dimensional distributions) toward a solution $I0$ of the sPDE, equation 3.2. Here we choose to use the standard finite difference. However, we refer to Unser, Tafti, Amini, and Kirshner (2014) and Brockwell, Davis, and Yang (2007) for more advanced discretization schemes. We implement the recursion equation 3.9 by computing the 2D convolutions with FFTs on a GPU, which allows us to generate high-resolution videos in real time, without the need to explicitly store the synthesized video.

#### 3.3.2 AR(2) Synthesis with Global Translation

## 4 An Empirical Study of Visual Speed Discrimination

To exploit the useful parametric transformation features of our MC model and provide a generalizable proof of concept based on motion perception, we consider here the problem of judging the relative speed of moving dynamical textures. The overall aim is to characterize the impact of both average spatial frequency and average duration of temporal correlations on perceptual speed estimation, based on empirical evidence.

### 4.1 Methods

Case | $t\u2605$ | $\sigma Z$ | $BZ$ | $v\u2605$ | $z\u2605$ | $\Delta Z$ |

A1 | $200ms$ | $1.0c/\u2218$ | $\xd7$ | $5\u2218/s$ | $0.78c/\u2218$ | ${-0.31,-0.16,0,0.16,0.47}$ |

A2 | $200ms$ | $1.0c/\u2218$ | $\xd7$ | $5\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

A3 | $200ms$ | $\xd7$ | 1.28 | $5\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

A4 | $100ms$ | $\xd7$ | 1.28 | $5\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

A5 | $200ms$ | $\xd7$ | 1.28 | $10\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

Case | $t\u2605$ | $\sigma Z$ | $BZ$ | $v\u2605$ | $z\u2605$ | $\Delta Z$ |

A1 | $200ms$ | $1.0c/\u2218$ | $\xd7$ | $5\u2218/s$ | $0.78c/\u2218$ | ${-0.31,-0.16,0,0.16,0.47}$ |

A2 | $200ms$ | $1.0c/\u2218$ | $\xd7$ | $5\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

A3 | $200ms$ | $\xd7$ | 1.28 | $5\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

A4 | $100ms$ | $\xd7$ | 1.28 | $5\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

A5 | $200ms$ | $\xd7$ | 1.28 | $10\u2218/s$ | $1.25c/\u2218$ | ${-0.47,-0.31,0,0.31,0.63}$ |

Note: A1 and A2 in the first two rows are both bandwidth controlled in $c/\u2218$ and A3 to A5 are bandwidth controlled in octaves with high (A3 and A5) and low (A4) $t\u2605$.

We tested different scenarios summarized in Table 2. Each row corresponds to approximately 35 minutes of testing per participant and was always performed by at least three of the participants. Stimuli were generated using Matlab 7.10.0 on a Mac running OS 10.6.8 and displayed on a 20” Viewsonic p227f monitor with resolution $1024\xd7768$ at 100 Hz. Psychophysics routines were written using Matlab, and Psychtoolbox 3.0.9 controlled the stimulus display. Observers sat 57 cm from the screen in a dark room. Five male observers with normal or corrected-to-normal vision took part in these experiments. They gave their informed consent, and the experiments received ethical approval from the Aix-Marseille Ethics Committee in accordance with the declaration of Helsinki.

### 4.2 Psychometric Results

Estimating speed in dynamic visual scenes is undoubtedly a crucial skill for successful interaction with the visual environment. Human judgments of perceived speed have therefore generated much interest and been studied with a range of psychophysics paradigms. The different results obtained in these studies suggest that rather than computing a veridical estimate, the visual system generates speed judgments influenced by contrast (Thompson, 1982), speed range (Thompson, Brooks, & Hammett, 2006), luminance (Hassan & Hammett, 2015), spatial frequency (Brooks, Morris, & Thompson, 2011; Simoncini et al., 2012; Smith, Majaj, & Movshon, 2010), and retinal eccentricity (Hassan, Thompson, & Hammett, 2016). There are currently no theoretical models of the underlying mechanisms serving speed estimation that capture this dependence on such a broad range of image characteristics. One of the reasons for this might be that the simplified grating stimuli used in most of the previous studies do not allow experimenters to shed light on the possible elaborations in neural processing that arise when more complex natural or naturalistic stimulation is used. Such elaborations, like nonlinearities in spatiotemporal frequency space, can be seen in their simplest form even with a superposition of a pair of gratings (Priebe, Cassanello, & Lisberger, 2003). In the current work, we use our formulation of motion cloud stimuli, which allows for separate parametric manipulation of peak spatial frequency ($z$), spatial frequency bandwidth ($Bz,\sigma z$), and stimulus lifetime ($t\u2605$), which is inversely related to the temporal variability. The stimuli are all broadband, closely resembling the frequency properties under natural stimulation. Our approach is to test five participants under several parametric conditions given in Table 2 and using a large number of trials.

#### 4.2.1 Psychometric Function Estimation

#### 4.2.2 Cycle-Controlled Bandwidth Conditions

The main manipulation in each case is the direct comparison of the speed of a range of five stimuli in which the central spatial frequency varies between five values, but all other parameters are equated under the different conditions. In a first manipulation in which bandwidth is controlled by fixing it at a value of 1 $c/\u2218$ for all stimuli (conditions A1 and A2 in Table 2), we find that lower frequencies are consistently perceived to be moving slower than higher frequencies (see Figure 4a). This trend is the same for both the lower and the higher spatial frequency ranges used in the tasks, yet the biases are smaller for the higher frequency range (see A1 and A2 in Table 2 for details). This suggests that the effect generalizes across the two scales used, but that shifting the central spatial frequency value of the stimulus, which forms the reference scale, results in a change in sensitivity during speed discrimination. For example, comparing A1 and A2 performance in Figure 4, when the five stimuli of different speed that make up the reference scale are changed from $z\u2605=0.78$ (A1) to $z\u2605=1.25$ (A2), speed estimates seem to become less reliable. The same comparison is using a different psychometric measurement scale in each case. The sensitivity to the discrimination of stimuli measured in the inverse of the psychometric slope is found to remain approximately constant across the range of frequency tested for each of the tested spatial frequencies (see Figure 4e). However, the sensitivity increases significantly ($\Sigma z\u2605,z$ decreases) from condition A1 to condition A2. Such an effect suggests that an increasing trend in sensitivity may exist (see section 4.2.4).

#### 4.2.3 Octave-Controlled Bandwidth Conditions

The octave-bandwidth-controlled stimuli of conditions A3 to A5 (see Table 2) allow us to vary the spatial frequency manipulations ($z$) in a way that generates scale-invariant bandwidths exactly as would be expected from zooming movements toward or away from scene objects (see Figure 1). Thus, if trends seen in Figure 4e were purely the result of ecologically invalid fixing of bandwidths at 1 $c/\u2218$ in the manipulations, this would be corrected in the current manipulation. Only the higher-frequency comparison range from conditions A2 is used because trends are seen to be consistent across conditions A1 and A2. We find that the trends are generally the same as in Figure 4a. Indeed higher spatial frequencies are consistently perceived as faster than lower ones, as shown in Figures 4b to 4d. Interestingly, for the degree bandwidth-controlled stimuli, the biases are lower than those for the equivalent octave-controlled stimuli (e.g., compare Figure 4a with 4b). This can also be seen in Figure 10 (conditions A2 and A3). A change in the bias is also seen with the manipulation of $t\u2605$, as increasing temporal frequency variability when going from biases in Figure 4b to those in Figure 4c entails a reduction in measured biases, with an effect of about 25%, which is also visible in Figure 10 (conditions A3 and A4 for M1).

#### 4.2.4 Is Sensitivity Dependent on Stimulus Spatial Frequency?

To explore further the sensitivity trend, we fit the data with a psychometric function by assuming a linear model for $\Sigma z,z\u2605$ and test for a significant negative slope. None of the slopes are significantly different from 0 at the population level. At the individuals' level, among all conditions and subjects, we find that 13 out of 21 slopes were significantly decreasing. Therefore, we interpret this as a possible decrease in sensitivity at higher $z$ seen in 13 out of 21 of the cases, but one that shows large individual differences in sensitivity trends.

#### 4.2.5 Qualitative Results Summary

- •
Spatial frequency has a positive effect on perceived speed ($\mu z,z\u2605$ increases as $z$ increases).

- •
The inverse sensitivity remains constant or is decreasing with spatial frequency (resp. $\Sigma z,z\u2605$ does not depend on $z$ or decreases as $z$ increases), but there are large individual differences in this sensitivity change.

In the next section, we detail a Bayesian observer model to account for these observed effects.

### 4.3 Observer Model

We list here the general assumptions underlying our model:

The observer performs abstract measurement of the stimulus denoted by a real random variable $M$.

The observer estimates speed using an estimator based on the posterior speed distribution $PV|M$.

The posterior distribution is implicit; Bayes' rule states that $PV|M\u221dPM|VPV$.

The observer knows all other stimulus parameters (in particular, the spatial frequency $z$).

The observer takes a decision without noise.

These asssumptions correspond to the ideal Bayesian observer model. We detail below the relation between this model and the psychometric bias and inverse sensitivity $(\mu z,z\u2605,\Sigma z,z\u2605)$. We also give details to derive the likelihood directly from the MC model and discuss the expected consequences.

#### 4.3.1 Ideal Bayesian Observer

We initially assume that the posterior speed distribution is conditioned on spatial frequency; thus, the likelihood and prior distributions also depend on spatial frequency. However, there is currently no conclusive support in favor of a spatial frequency-dependent speed prior in the literature, but evidence of spatial frequency influencing speed estimation is discussed in the previous paragraph. Therefore, only the likelihood width $\sigma z$ depends on spatial frequency $z$, and the log-prior slope $a$ does not. We discuss in more detail the choice of the likelihood and its dependence on spatial frequency in section 4.3.2.

Figure 5 shows an example of how the likelihood and prior described in equation 4.3 combine into a posterior distribution that resembles a shifted version of the likelihood. In practice, we are able to compute the distribution of the estimates $v^(Mv,z)$ as stated in the following proposition:

^{10}, in our special case of gaussian likelihood and Laplacian prior, the psychometric curve can be computed in closed form.

**Proof.** See appendix A for the proof.

^{11}provides the connection between the Bayesian model parameters and the classical psychometric measures of bias and sensitivity. In particular, it explains the heuristic sigmoidal templates commonly used in psychophysics (see section 4.2). An example of two psychometric curves is shown in Figure 6. We have the following relations:

The experiment allows us to estimate bias and inverse sensitivity $(\mu z,z\u2605,\Sigma z,z\u2605)$. Knowing these parameters, it is possible to recover parameters of the ideal Bayesian observer model. Equation 4.7 has a unique solution, and equation 4.6 can be solved using the least square estimator.

Under this model, a positive bias comes from a decrease in likelihood width and a negative log-prior slope. As concluded in section 4.2, we observe a significant decrease in inverse sensitivity in 13 of 21 subjects and conditions. Therefore, the model, when fitted to the data, will force the likelihood width to decrease. Further experiments will be necessary to verify the significance of this observation. Yet, the model is well supported by the literature (see Stocker & Simoncelli, 2006; Jogan & Stocker, 2015; Sotiropoulos et al., 2014) and is compatible with the properties of the stimuli (see section 4.3.2).

#### 4.3.2 Discussion: Likelihood

The solution to this optimization problem with respect to $v$ is computed using the Newton-CG optimization method implemented in the Python library scipy. In Figure 7a, we show a histogram of speed estimates $v^MLE(Iv,z)$ performed over 200 motion clouds generated with speed $v=6\u2218/s$ and spatial frequency $z=0.78c/\u2218$. In Figure 7b, we show the evolution of the standard deviation of speed estimates $v^MLE(Iv,z)$ as a function of spatial frequencies $z\u2208{0.47c/\u2218,0.62c/\u2218,0.78c/\u2218,0.94c/\u2218,1.28c/\u2218}$. For each spatial frequency, estimates are again similarly obtained over a set of 200 motion clouds generated with speed $v=6\u2218/s$. First, we observed that $v^MLE(Iv,z)$ is well approximated by a gaussian random variable with mean $v$. Second, the standard deviation of these estimates decreases when the spatial frequency increases. The two conclusions follow the fact that our model is gaussian and that we impose the relation $\sigma V=1/(t\u2605z)$—that standard deviation of speed is inversely proportional to spatial frequency. The decreasing trend combined with a prior for slow speed $a<0$ would reproduce the positive bias of spatial frequency over speed perception observed in section 4.2. If a human subject were estimating speed in such an optimal way, equation 4.7 indicates that inverse sensitivity $\Sigma z,z\u2605$ would also be inversely proportional to spatial frequency. Yet the primary analysis conducted in section 4.2 does not give a clear trend for the inverse sensitivity. As a consequence, that analysis is ambiguous and does not allow us to make definitive conclusions about the compatibility of the MC model and the existing literature with the observed human performances.

### 4.4 Likelihood and Prior Estimation

In order to fit this model to our data, we use an iterative two-step method, each minimizing the Kullback-Leibler divergence between the model and its samples. This process is the equivalent of a maximum likelihood estimate. The first step is to fit each psychometric curve individually and the second step to use the results as a starting point to fit all the psychometric curves together. Numerically, we used the Nelder-Mead simplex method as implemented in the Python library scipy.

- For all $z,z\u2605$, initialized at a random point, computewhere $\varphi v\u2605,z\u2605\mu ,\Sigma $ is defined in equation 4.1.$(\mu \xaf,\Sigma \xaf)=argmin\mu ,\Sigma \u2211vKL(\varphi ^v\u2605,z\u2605|\varphi v\u2605,z\u2605\mu ,\Sigma ),$
- Solve equations 4.6 and 4.7 between $(\mu \xaf,\Sigma \xaf)$ and $(a\xaf,\sigma \xaf)$, initialize at $(a\xaf,\sigma \xaf)$, and computewhere $\varphi v\u2605,z\u2605a,\sigma $ is defined in equation 4.5.$(a^,\sigma ^)=argmina,\sigma \u2211z,z\u2605\u2211vKL(\varphi ^v\u2605,z\u2605|\varphi v\u2605,z\u2605a,\sigma ),$

We use a repeated stochastic initialization in the first step in order to overcome the presence of local minima encountered during the fitting process. The approach was found to exhibit better results than a direct and global fit (third point).

### 4.5 Modeling Results

We use the Bayesian formulation detailed in section 4.3.1 and the fitting process described in section 4.4 to estimate, for each subject, the likelihood widths and the corresponding log-prior slopes under the tested experimental conditions. We plot in Figure 8 the fit of bias and inverse sensitivity for the sigmoid model (see section 4.2) and Bayesian model (see section 4.3.1) averaged over subjects. Figure 9 displays the corresponding likelihood widths and log-prior slopes for the Bayesian model also averaged over subjects. Finally, Figure 10 summarizes the total bias between extremal tested spatial frequencies for each experimental condition and for both models. Error bars correspond to the standard deviation of the mean.

#### 4.5.1 Measured Biases and Inverse Sensitivity

As shown in Figure 8, both models M2 and M3 correctly account for the biases and inverse sensitivity estimated with model M1 (see section 4.2) except for conditions A1 and A2. For condition A1 (see Figure 8a), the bias is underestimated by models M2 and M3 compared to model M1. For condition A2 (see Figure 8e), the inverse sensitivity is overestimated by models M2 and M3 compared to model M1. The observed differences come from the fact that in models M2 and M3, the overlapping spatial frequencies of conditions A1 and A2 are pulled together. As a consequence, the fit is more constrained than for model M1 and is therefore smoother. While that discrepancy does not affect our conclusion, it raises the question of pulling different overlapping conditions together. The overlapping tested spatial frequencies are together, whereas they were collected with different reference spatial frequencies such that the sensitivity of each of the psychometric speed measurement scales appears to have been different. Despite averaging over subjects, the Bayesian estimates of inverse sensitivity appear smoother than the sigmoid estimates (see Figures 8f and 8h). Finally, a clearer decreasing trend is visible in the Bayesian estimates of inverse sensitivity.

#### 4.5.2 Corresponding Sensory Likelihood Widths

There is a systematic decreasing trend within the likelihood width fits in Figures 9a, 9b, and 9d, which shows an inverted U-shape. The fact that all subjects did not run all experimental conditions explains this difference (two subjects out of four show a U-shape bias; see Figure 8c). Subject-to-subject variability is similar for all conditions except for the least temporal variability for which it is smaller (see Figure 9c).

#### 4.5.3 Corresponding Log-Prior Slopes

The log-prior slope estimates have a high subject-to-subject variability for conditions A3 to A5 (see Figures 9f to 9h) compared to conditions A1–A2 (see Figure 9e). The high intersubject variability is expected in speed discrimination tasks, and in the case of conditions A4 and A5, this is particularly magnified by two subjects that have an extremely low value for $az$ (their small markers are not visible in Figures 9e and 9f).

### 4.6 Insights into Human Speed Perception

We exploited the principled and ecologically motivated parameterization of MC to study biases in human speed judgements under a range of parametric conditions. Primarily, we consider the effect of scene scaling on perceived speed, manipulated via central spatial frequencies in a similar way to previous experiments that have shown spatial frequency-induced perceived speed biases (Brooks et al., 2011; Smith & Edgar, 1990). In general, our experimental result confirms that higher spatial frequencies are consistently perceived to be moving faster than compared lower frequencies, which is the same result as reported in a previous study using both simple gratings and compounds of paired gratings, the second of which can be considered as a relatively broadband bandwidth stimulus (Brooks et al., 2011) compared to single grating stimuli, without considering the inhibitive interactions we know to occur when multiple gratings are superimposed (Priebe et al., 2003). In that work, they noted that biases were present but slightly reduced in the compound (broadband) stimuli. That conclusion is consistent with a more recent psychophysics manipulation in which up to four distinct composite gratings were used in relative speed judgments. Estimates were found to be closer to veridical as bandwidth was increased by adding components from the set of four, but increasing spatial frequencies generally biased toward faster perceived speed, even if individual participants showed different trends (Jogan & Stocker, 2015). Indeed, findings from primate neurophysiology studies have also noted that while responses are biased by spatial frequency, the tendency toward true speed sensitivity (measured as the proportion of individual neurons showing speed sensitivity) increases when broadband stimulation is used (Priebe et al., 2003; Perrone & Thiele, 2001). A model of visual motion sensitivity with a hierarchical framework, selectively reading from and optimally decoding V1 inputs in an MT layer, has also been tested. It was found to be consistent with human speed sensitivity to natural images (Burge & Geisler, 2015).

It is increasingly being recognized that linear systems approaches to interrogating visual processing with single sinusoidal luminance grating inputs represent a powerful but limited approach to study speed perception, as they fail to capture the fact that naturalistic broadband frequency distributions may support speed estimation (Brooks et al., 2011; Meso & Simoncini, 2014; Meso & Zanker, 2009; Gekas et al., 2017). A linear consideration, for example, may not fully account for the fact that estimation in the presence of multiple sinusoidal components results in linear optimal combination performing best among alternatives (Jogan & Stocker, 2015). In that case, the simple monotonic increase in perceived speed predicted by the optimal model when components were added to the compound is not seen in the data, particularly in the difference between three and four components. This may be due to interaction between components that are not fully captured by this optimal linear model. Our work seeks to extend the body of previous studies by looking at spatial-frequency-induced biases, using a parametric configuration in the form of motion clouds, which allow a manipulation across a continuous scale of frequency and bandwidth parameters. The effect of frequency interactions across the broadband stimulus defined along the two-dimensional spatiotemporal luminance plane allows us to measure the perceptual effect of the projection of different areas (e.g., see Figure 2) onto the same speed line. The measurement would be the result of proposed inhibitory interactions, which occur during spatiotemporal frequency integration for speed perception (Simoncini et al., 2012; Gekas et al., 2017), which cannot be observed with component stimuli separated by several octaves (Jogan & Stocker, 2015).

We use a slower and a faster speed because previous work using sinusoidal grating stimuli has shown that below the slower range ($<8\u2218/s$), uncertainty manipulated through lower contrasts causes an underestimation of speeds, while at faster speeds ($>16\u2218/s$), it causes an overestimation, an effect that is not fully explained by Bayesian models with a prior encouraging slow speeds. (Thompson et al., 2006; Hassan & Hammett, 2015). Our findings show that biases are larger at the faster speed than the slower one. Biases are also generally lower for the octave-controlled than for the cycle-controlled stimuli, indicating that the underlying system was better at processing the former.

The Bayesian fitting identifies a decrease in the likelihood width estimates, which could explain the biases in over half of our fitted psychometric functions. For cases of the same frequency range where comparable likelihoods are estimated, some conditions—like the low and high $t\u2605$ cases—have very different prior estimates. This result can be interpreted in light of recent work (Gekas et al., 2017): biases might act along the speed line and an orthogonal scale line within the spatiotemporal space, depending on the spread or bandwidth of the stimulus. While the current work does not resolve some of the ongoing gaps in our understanding of speed perception mechanisms, particularly as it does not tackle contrast-related biases, it shows that known frequency biases in speed perception also arise from orthogonal spatial and temporal uncertainties when RMS contrast is controlled—as it is within the MC stimuli. Bayesian models such as the one we apply, which effectively project distributions in the spatiotemporal plane onto a given speed line in which a linear low speed prior applies (Stocker & Simoncelli, 2006), may be insufficient to capture the effect of spatiotemporal priors, which may underlie some of the broad set of empirical results. Individual differences, which are pervasive in these experiments, may also be associated with internal assumptions that can be considered as priors. Indeed for Bayesian models to fully predict speed perception with more complex or composite stimuli, they often require various elaborations away from the simplistic combination of likelihood and low speed prior (Hassan & Hammett, 2015; Gekas et al., 2017; Jogan & Stocker, 2015; Sotiropoulos et al., 2014). Indeed even imaging studies considering the underlying mechanisms fail to find definitive evidence for the encoding of a slow speed prior (Vintch & Gardner, 2014).

## 5 Conclusion

In this work, we have proposed and detailed a generative model for the estimation of the motion of dynamic images based on a formalization of small perturbations from the observer's point of view and parameterized by rotations, zooms, and translations. We connected these transformations to descriptions of ecologically motivated movements of both observers and the dynamic world. The fast synthesis of naturalistic textures optimized to probe motion perception was then demonstrated through fast GPU implementations applying autoregression techniques with much potential for future experimentation. This extends previous work from Sanz-Leon et al. (2012) by providing an axiomatic formulation. Finally, we used the stimuli in a psychophysical task and showed that these textures allow one to further understand the processes underlying speed estimation. We used broadband stimulation to study frequency-induced biases in visual perception, using various stimulus configurations, including octave bandwidth and RMS contrast-controlled manipulations, which allowed us to manipulate central frequencies as scale-invariant stimulus zooms. We showed that measured biases under these controlled conditions were qualitatively the same at both a faster and a slower tested speed. By linking the stimulation directly to the standard Bayesian formalism, we demonstrated that the sensory representation of the stimulus (the likelihoods) in such models is compatible with the generative MC model in over half of the collected empirical data cases. Together with a slow speed prior, the inference framework correctly accounts for the observed bias. We foresee that more experiments with naturalistic stimuli such as MCs and a consideration of more generally applicable priors will be needed in the future.

## Acknowledgments

We thank Guillaume Masson for fruitful discussions during the project. We also acknowledge Manon Bouyé and Élise Amfreville for proofreading. Finally, we strongly thank the reviewers for their useful comments that helped us improve the readability and quality of this letter. L.U.P. was supported by EC FP7-269921, BrainScaleS, and BalaV1 ANR-13-BSV4-0014-02. The work of J.V. and G.P. was supported by the European Research Council (ERC project SIGMA-Vision). A.I.M. and L.U.P. were supported by SPEED ANR-13-SHS2-0006.