Tracking the Emergence of Location-based Spatial Representations in Human Scene-Selective Cortex

Abstract Scene-selective regions of the human brain form allocentric representations of locations in our environment. These representations are independent of heading direction and allow us to know where we are regardless of our direction of travel. However, we know little about how these location-based representations are formed. Using fMRI representational similarity analysis and linear mixed models, we tracked the emergence of location-based representations in scene-selective brain regions. We estimated patterns of activity for two distinct scenes, taken before and after participants learnt they were from the same location. During a learning phase, we presented participants with two types of panoramic videos: (1) an overlap video condition displaying two distinct scenes (0° and 180°) from the same location and (2) a no-overlap video displaying two distinct scenes from different locations (which served as a control condition). In the parahippocampal cortex (PHC) and retrosplenial cortex (RSC), representations of scenes from the same location became more similar to each other only after they had been shown in the overlap condition, suggesting the emergence of viewpoint-independent location-based representations. Whereas these representations emerged in the PHC regardless of task performance, RSC representations only emerged for locations where participants could behaviorally identify the two scenes as belonging to the same location. The results suggest that we can track the emergence of location-based representations in the PHC and RSC in a single fMRI experiment. Further, they support computational models that propose the RSC plays a key role in transforming viewpoint-independent representations into behaviorally relevant representations of specific viewpoints.


INTRODUCTION
Rapidly learning the spatial layout of a new environment is a critical function that supports flexible navigation. This ability is thought to depend on the emergence of location-based representations in scene-selective brain regions that signal where we are irrespective of our current heading direction. As we are unable to sample all possible viewpoints from a given location simultaneously, the formation of location-based representations requires the integration of scenes from differing viewpoints. Despite evidence for the existence of location-based representations in scene-selective regions (e.g., Marchette, Vass, Ryan, & Epstein, 2015;Vass & Epstein, 2013), we know little about how such representations emerge.
In support of these models, human fMRI studies using representational similarity analyses (RSA) have found evidence for viewpoint-independent representations of specific locations (henceforth referred to as "location-based representations") in a network of brain regions including the PHC and RSC (Marchette, Vass, Ryan, & Epstein, 2014;Vass & Epstein, 2013). More recently, panoramic videos have been used to experimentally induce the formation of location-based representations (Robertson, Hermann, Mynick, Kravitz, & Kanwisher, 2016). Assessing pattern similarity for distinct scenes taken from the same location, Robertson et al. provided evidence for greater pattern similarity in the RSC and occipital place area (OPA) after participants had seen a panoramic video showing that two scenes were from the same location. This effect was not evident when participants could not learn that two scenes were from the same location. Interestingly, they also provided evidence for an effect in the PHC that occurred inboth video conditions-that is, regardless of whether participants could learn the scenes were from the same location-suggesting a more general associative role for the PHC.
Despite these results, we still know little about (1) how quickly such representations are formed, (2) what types of spatial information they encode, and (3) under what conditions they are evoked. First, it remains unclear whether location-based representations emerge rapidly after short exposures to a new environment or whether they only develop after prolonged experience. Robertson et al. had participants watch videos outside the scanner, over the course of 2 days, before assessing pattern similarity inside the scanner. To test whether location-based representations can form rapidly, we developed a protocol that permitted us to scan participants before and after a short learning phase, allowing us to estimate changes in pattern similarity as a function of learning in a single fMRI experiment. Second, without tracking the formation of location-based representations, it is difficult to determine exactly what type of information they are representing. For instance, shared representations across viewpoints may relate to long-term semantic knowledge that is invoked when seeing different views of a well-known location (see Marchette, Ryan, & Epstein, 2017). In contrast, rapidly learning representations that are shared across different viewpoints of a new environment implies that the information being encoded is more likely to be spatial rather than semantic in nature.
Third, we do not know whether location-based representations are involuntary retrieved during visual processing. Computational models of spatial navigation predict that allocentric representations are automatically activated and updated by egocentric viewpoints (Bicanski & Burgess, 2018;Byrne et al., 2007). Furthermore, electrophysiological studies in rodents have shown that allocentric representations are automatically activated and updated during exploration (e.g., Monaco, Rao, Roth, & Knierim, 2014;O'Keefe & Dostrovsky, 1971). However, evidence in humans is lacking. Robertson et al. required participants to recall whether scenes were presented on the left or right of the screen, introducing a task that explicitly required them to recall the panorama, and the position of the specific scene within the panorama. Suggesting some level of involuntary retrieval, one fMRI study found that viewpoint-independent representations of specific buildings may be activated when participants judge whether the building is well known to them (Marchette et al., 2014). In the current study, participants performed an unrelated low-level attentional task as the scenes were presented. The activation of location-based representations under these conditions would suggest that they can be retrieved in a relatively automatic manner.
Here, we test whether location-based representations of novel environments can be learnt by integrating visual information across different scenes. Although location-based representations are predicted by models of spatial navigation, they may also be consistent with various other cognitive models (see Discussion). As such, we define locationbased representations to be any type of information that encodes the relationship between different, nonoverlapping views of the same location. We recorded patterns of BOLD activity as participants passively observed a number of scenes depicting different views of novel locations. Subsequently, using an experimental manipulation introduced by Robertson et al. (2016), participants watched videos showing these scenes as part of a wider panorama. Half of the videos allowed participants to learn the spatial relationship between two scenes from the same location (overlap condition). The remaining videos acted as a control by presenting scenes from different locations (no-overlap condition). After the videos, we again recorded patterns of activity for each of the scenes. Whereas Robertson et al. (2016) only assessed scene representations after video presentation, we also scanned before and during the videos; see Clarke, Pell, Ranganath, and Tyler (2016) for a similar preexperimental versus postexperimental design focused on changes in object representations. This allowed us to track the potential emergence of locationbased representations using RSAs as well as assess neural activity when these representations were being formed.
Using generalized linear mixed models, we show that patterns evoked by different scenes become more similar in scene-selective regions of the PHC and RSC after the presentation of the video panoramas. This increase in similarity was specific to the "overlap" video condition, where scenes from the same location were presented together, and was not observed in the no-overlap condition. This suggests the emergence of location-based representations in the PHC and RSC. Importantly, whereas this increase in pattern similarity emerged in the PHC regardless of behavioral performance, the same pattern was only present in the RSC when participants could remember which scenes came from the same location. This finding supports computational models that propose the RSC is critical in translating viewpoint-independent representations in the medial temporal lobe into more behaviorally relevant egocentric representations.

METHODS Participants
Twenty-eight right-handed participants were recruited from the University of York, United Kingdom. These participants had no prior familiarity with the locations used as stimuli in the experiment (see below). All participants gave written informed consent and were reimbursed for their time. Participants had either normal or corrected-tonormal vision and reported no history of neurological or psychiatric illness. Data from five participants could not be included in the final sample because of problems with fMRI data acquisition (one participant), excess of motion-related artifacts in the imaging data (three participants), and a failure to respond during one of the in-scanner tasks (one participant). As such, analyses included 23 participants (10 men) with a mean age of 21.96 years (SD = 3.22 years). The study was approved by a local research ethics committee at the University of York.

Stimuli
We generated 12 panoramic images of different urban locations from the City of Sunderland, and Middlesbrough town center, United Kingdom (Figure 1; osf.io/cgy97). These panoramas spanned a 210°field-of-view horizontally but were restricted in the vertical direction to limit the appearance of proximal features (<2 m from the camera). Throughout the experiment, 24 "endpoint images" displaying 30°scenes taken from either end of each panorama were shown (i.e., centered at 0°and 180°; Figure 1A). These images were shown both inside and outside the scanner to assess participants' spatial knowledge of the depicted locations and for the RSA (see below).
Endpoints were also shown in a series of videos (see osf.io/cgy97 ). In overlap videos, Images A1 and A2 (taken from opposite ends of the same panorama) were presented such that their spatial relationship could be inferred ( Figure 1B). Here, a camera panned from each endpoint to the center of the panorama showing that A1 and A2 belonged to the same location. In contrast, a no-overlap video featured endpoints from two unrelated panoramas (Images B1 and C2). Again, these videos showed an endto-center camera pan from each image. However, because there was no visual overlap between the video segments, observers could only infer that Endpoints B1 and C2 belonged to different locations. The no-overlap condition acted as a control condition, ensuring Endpoints B1 and C2 were seen in a similar video to endpoints in the overlap condition (A1 and A2), with the same overall exposure and temporal proximity. To ensure that the occurrence of a visual overlap was easily detectable, all videos alternated the end-to-center sweep from each endpoint over two repetitions.
Pairs of endpoints from the same panorama were grouped into sets of three. The first pair in each set was assigned to the overlap video condition (A1 and A2). Two endpoints from different panoramas were assigned to the no-overlap video condition (B1 and C2). The remaining endpoints belonged to an "unseen video" condition as they were not shown during any video (B2 and C1). These assignments were counterbalanced across participants such that each image appeared in all three conditions an equal number of times. The order of camera pans during videos (e.g., A1 first vs. A2 first) was also counterbalanced both within and across participants. Analyses showed the visual similarity of image endpoints was matched across experimental conditions as measured by the Gist descriptor (Oliva & Torralba, 2001) and local correlations in luminance and color information (osf.io/6sr9p/ ). Pilot data Figure 1. Stimuli used during, and analyses performed on, the in-scanner tasks. (A) An example location panorama with two endpoint images. Single endpoints were shown during the in-scanner target detection task. As in Robertson et al. (2016), full panoramas were never shown as whole images but were presented during the in-scanner videos. (B) Depiction of the two video conditions: overlap versus no-overlap videos. Overlap videos showed camera pans from each endpoint of a given panorama (denoted A1 and A2) to the center of that panorama. The central overlap allowed participants to learn a spatially coherent representation that included both A1 and A2. No-overlap videos involved pans between endpoints from two different locations (denoted B1 and C2), meaning that there was no visual overlap. (C) Similarity contrast matrices used to model changes in representational similarity between endpoints (i.e., between A1, A2, B1, B2, C1, and C2). Red squares indicate positively weighted correlations, and blue squares indicate zero-weighted correlations. From left to right, the matrices account for the representational similarity of endpoints: (1) from the same location regardless of video condition, (2) that were seen in the same video (including overlap and no-overlap videos), (3) in the unseen condition specifically, and (4) in the overlap condition specifically. Linear combinations of these matrices, along with their interactions with a session regressor (pre-videos vs. post-videos), accounted for each RSA effect across all experimental conditions. revealed that participants could not reliably identify which endpoints belonged to the same location without having seen the videos (osf.io/kgv64/).

Procedure
Before entering the scanner, participants performed a behavioral task to assess their ability to infer which image endpoints were from the same location. Once in the scanner, they undertook a functional localizer task to identify scene-selective regions of the PHC and RSC. They were then shown each image endpoint multiple times (performing a low-level attentional task) to assess baseline representational similarity between each image endpoint (i.e., before learning). During a learning phase, overlap and no-overlap videos were presented, with participants instructed to identify whether the endpoints in each video belonged to the same location or not. After this video learning phase, each image endpoint was again presented multiple times to assess postlearning representational similarity between image endpoints. Finally, outside the scanner, participants performed the same behavioral task (as before scanning) to assess the extent to which participants had learnt which image endpoints belonged to the same location (and a further test of associative memory; see below). A figure illustrating the order and approximate duration of each experimental task is available at osf.io/zh8f2/.

Prescanner/Postscanner Tasks
Participants were tested on their ability to identify which endpoints belonged to the same location both before and after scanning (both outside the scanner). On each trial, one endpoint surrounded by a red box was presented for 3 sec. After this, five other endpoints were displayed in a random sequence, each shown alongside a number denoting the order of appearance (i.e., 1-5; Figure 2A; 2 sec per image, 500-msec ISI). One image in the sequence (the target) was taken from the same panorama as the cue. The remaining four endpoints (lures) belonged to panoramas in the same set of stimuli. As such, if Endpoint B1 was presented as the cue, B2 would be the target, and A1, A2, C1, and C2 would be lures (i.e., a five-alternative forced choice, 5-AFC). After the five alternatives had been shown, participants were prompted to select the target using a numeric key press (1-5). Across 24 trials, each endpoint was used as a cue image.
After scanning and the second block of the location identity task described above, participants were also asked to identify which images appeared together in the same video. Note that this is slightly different to the previous task because participants could have known that Endpoints B1 and C2 appeared in the same video, despite not knowing which endpoints were from the same location (i.e., B2 and C1, respectively). Using a similar procedure to that described above, endpoints from either the overlap or no-overlap video conditions were cued and participants were asked to select the appropriate endpoint from the five alternatives in the same set.

In-scanner Tasks
Functional localizer. Before the main experimental task, participants undertook a functional localizer scan with the purpose of identifying four scene-selective ROIs-in particular, the left and right PHC as well as the left and right RSC. This involved presenting four blocks of scene images (coasts, mountains, streets, and woodlands) interleaved with four blocks of face images (male and female). In each block, 10 unique images were shown in quick succession with a display time of 700 msec per image and an ISI of 200 msec. Blocks were separated with a 9-sec interblock interval, and their running order was counterbalanced across participants. The scene images used here were different to those in the main experiment, and none was repeated during the localizer itself. All images were shown in grayscale and were presented with a visual angle of ∼14°. To ensure localizer images were being attended to, participants were tasked with detecting an oddball target that Figure 2. Behavioral task and results (A) A schematic illustration of the pre-video and post-video behavioral task. One endpoint was first presented as a cue (enclosed by red box), followed by five numbered alternatives. Participants were then prompted to select which one of the alternatives belonged to the same location as the cue. (B) Performance on the pre-video and post-video behavioral tasks plotted by video condition. Error bars represent 95% confidence intervals, and the dashed line at p = .2 reflects 5-AFC chance level.
was superimposed onto one of the images in each block. The target was a small red dot with a 3-pixel radius. When this was seen, participants were required to respond with a simple button press as quickly as possible (mean detection performance: d 0 = 3.116, SD = 0.907).
Presentation of endpoint images. Participants were shown all 24 endpoint images during an event-related functional imaging task. The task was optimized to measure multivariate patterns of BOLD activity specific to individual endpoints and was run both before and after participants had seen the panoramic videos (Session 1: pre-videos; Session 2: post-videos). All endpoints were presented nine times for both the pre-video and post-video functional run. Images were displayed for 2.5 sec with an ISI of 2 sec. The order of stimuli in each functional run was optimized to facilitate the decoding of unique BOLD patterns across endpoints (optimization algorithm available at osf.io/eh78w/ ). No image was presented on successive trials to avoid large adaptation effects, and the design included 12 null events in each functional run (i.e., 10% of all events). Like the functional localizer, participants were tasked with detecting an oddball target that was superimposed onto a small proportion of the images. Here, the target was a group of three small red dots (3-pixel radius, <0.2°), with each dot drawn at a random position on the image. Targets were present on one of every nine trials such that eight repetitions of each endpoint image were target free (target trials were not used to estimate BOLD patterns). As above, participants were required to respond to these targets with a simple button press (mean detection performance, d 0 , were 3.362 [SD = 0.642, pre-videos] and 3.659 [SD = 0.485, post-videos]).
Panoramic video task. Participants watched all video clips from the overlap and no-overlap video conditions while being scanned. Each video lasted 20 sec and was followed by a 10-sec rest period. In the first 3 sec of this rest period, participants were prompted to indicate whether each video segment depicted scenes from the same or different locations. Responses were recorded with a left/right button press. This question was asked to ensure that participants were attending to the visual overlap across segments (mean discrimination performance: d 0 = 3.220, SD = 0.373). All videos were repeated three times in a pseudorandom order to allow for sufficient learning. Before entering the scanner, participants were asked to remember which endpoints were seen together in the same video, even if they appeared in a no-overlap video. Participants were told that a test after the scan would assess their knowledge of this.

MRI Acquisition
All functional and structural volumes were acquired on a 3-T Siemens MAGNETOM Prisma scanner equipped with a 64-channel phased array head coil. T2*-weighted scans were acquired with EPI, 35 axial slices (approximately 0°t o the AC-PC line; interleaved), and the following parameters: repetition time = 2000 msec, echo time = 30 msec, flip angle = 80°, slice thickness = 3 mm, and in-plane resolution = 3 × 3 mm. The number of volumes acquired during (a) the functional localizer, (b) the video task, and (c) each run of the endpoint presentation task was 75, 363, and 274, respectively. To allow for T1 equilibrium, the first three EPI volumes were acquired before the task started and then discarded. Subsequently, a field map was captured to allow the correction of geometric distortions caused by field inhomogeneity (see the MRI Preprocessing section below). Finally, for purposes of coregistration and image normalization, a whole-brain T1-weighted structural scan was acquired with a 1-mm 3 resolution using a magnetization prepared rapid gradient echo pulse sequence.

MRI Preprocessing
Image preprocessing was performed in SPM12 (www.fil.ion .ucl.ac.uk/spm). This involved spatially realigning all EPI volumes to the first image in the time series. At the same time, images were corrected for field inhomogeneity based geometric distortions (as well as the interaction between motion and such distortions) using the Realign and Unwarp algorithms in SPM (Hutton et al., 2002;Andersson, Hutton, Ashburner, Turner, & Friston, 2001). For the RSA, multivariate BOLD patterns of interest were taken as t statistics from a first-level general linear model (GLM) of unsmoothed EPI data in native space. Aside from regressors of interest, each first-level GLM included a set of nuisance regressors: six affine motion parameters, their first-order derivatives, regressors censoring periods of excessive motion (rotations > 1°and translations > 1 mm), and a Fourier basis set implementing a 1/128-Hz high-pass filter. For the analyses of univariate BOLD activations, EPI data were warped to Montreal Neurological Institute space with transformation parameters derived from structural scans (using the DARTEL toolbox; Ashburner, 2007). Subsequently, the EPI data were spatially smoothed with an isotropic 8-mm FWHM Gaussian kernel before GLM analysis (regressors included the same nuisance effects noted above).

ROIs
We generated four binary masks per participant to represent each ROI in native space. To do this, a first-level GLM of the functional localizer data modeled BOLD responses to scene and face stimuli presented during the localizer task. Each ROI was then defined as the conjunction between a "scene > face" contrast and an anatomical mask of each region that had been warped to native space (left/right PHC sourced from Tzourio-Mazoyer et al., 2002; left/right RSC sourced from Julian, Fedorenko, Webster, & Kanwisher, 2012). Thus, the ROIs were functionally defined but constrained to anatomical regions known to be spatially selective. Normalized group averages of these ROIs are available at osf.io/gbznp/ and neurovault.org/collections /4819. Recent evidence suggests that the RSC is composed of at least two functionally distinct subregions, both of which may be scene selective: (1) a retinotopically organized medial place area in posterior sections of the RSC and (2) a more anterior region corresponding to BAs 29 and 30 associated with more integrative mnemonic processes (Silson, Steel, & Baker, 2016). In the current study, we focus on the functionally defined RSC as a whole and do not differentiate between these subregions. However, the functional ROIs that we identified for each participant principally cover anterior sections of the RSC corresponding to BAs 29 and 30 and show little overlap with the retinotopic areas identified by Silson et al. The OPA has also been implicated as a critical sceneselective region (e.g., Robertson et al., 2016;Marchette et al., 2015). Recent research suggests that this region is principally involved in representing environmental boundaries and navigable paths during visual perception (Malcolm, Silson, Henry, & Baker, 2018;Bonner & Epstein, 2017;Julian, Ryan, Hamilton, & Epstein, 2016). However, computational models of spatial navigation do not predict that the OPA maintains location-based representations that are viewpoint invariant (Bicanski & Burgess, 2018;Byrne et al., 2007). In addition, we were only able to reliably delineate the OPA bilaterally in 6 of the 23 participants in our sample. As such, we did not focus on this region in the current study; instead, we restricted our main analyses and family-wise error (FWE) corrections to the PHC and RSC bilaterally. Nonetheless, for completeness, we generated an OPA mask using a normalized group-level contrast and ran the location-based RSA analyses reported below on this region separately (statistical outputs available at osf.io /d8ucj/). No effects of interest were identified in either the left or right OPA.

RSAs
Our general approach to the RSA involved modeling the observed similarity between different BOLD patterns as a linear combination of effects of interest and nuisance variables. Here, the similarity between BOLD responses was taken as the correlation of normalized voxel intensities (t statistics) across all voxels in an ROI. The resulting correlation coefficients were then Fisher-transformed before being subjected to statistical analysis. This transform ensures that the sampling distribution of similarity scores is approximately normal to meet the assumption of normality for statistical inference. We then entered all the transformed similarity scores under test from each participant and stimulus set into a general linear mixedeffects regression model. Although underused in the neurosciences (although see Motley et al., 2018), these models are common in the psychological literature as they offer a robust method of modeling nonindependent observations with few statistical assumptions (Baayen, Davidson, & Bates, 2008). Here, we used mixed-effects models to predict observed representational similarity between endpoints with a set of fixed-effects and randomeffects predictors (discussed below).
Importantly, mixed-effects models allow us to include estimates of pattern similarity across individual items (endpoints) and participants in the same statistical model. The fixed-effects predictors in each model specified key hypotheses of interest. The random effects accounted for statistical dependencies between related observations at both the item and participant levels. RSAs of fMRI data typically either assess patterns across all items (regardless of condition) or average across items in the same condition, meaning that important variation within conditions is ignored. Our modeling approach allows us to examine changes in representational similarity at the level of both items and conditions simultaneously while controlling for statistical dependencies between related observations. Raw similarity data and mean similarity matrices are available on the Open Science Framework (osf.io/cgy97). This page also includes MATLAB functions for estimating each statistical model as well as the model outputs.

Visual Representations of Specific Endpoints
We first examined whether the passive viewing of endpoint images evoked stimulus-specific visual representations in each of our four ROIs (left and right PHC and RSC). Multivariate BOLD responses to the endpoints were estimated for Session 1 (pre-videos) and Session 2 (post-videos) separately. We then computed the similarity of these responses across sessions by correlating BOLD patterns in Session 1 with patterns in Session 2. This resulted in a nonsymmetric, 24 × 24 correlation matrix representing the similarity between all BOLD patterns observed in Session 1 and those observed in Session 2. The correlation coefficients (n = 576 per participant) were then Fisher-transformed and entered as a dependent variable into a mixed-effects regression model with random effects for participants and endpoints. The main predictor of interest was a fixed effect that contrasted correlations between the same endpoints (e.g., A1-A1, B1-B1; n = 24 per participant) with correlations between different endpoints (e.g., A1-A2, A1-B1; n = 552 per participant) across the two sessions.
As well as running this analysis in each ROI, we performed a complementary searchlight analysis to detect endpoint-specific representations in other brain regions. Here, local pattern similarity was computed for each brain voxel using spherical searchlights with a 3-voxel radius (the mean number of voxels per searchlight was 105.56; searchlights were not masked by gray-/white-matter tissue probability maps). Fisher-transformed correlations for same versus different endpoints were contrasted at the first level before running a group-level random-effects analysis.

Location-based Memory Representations
We next tested our principal hypothesis-whether representations of Endpoints A1 and A2 became more similar to one another as a result of watching the overlap videos-in each ROI. Using the multivariate BOLD responses from Sessions 1 and 2, we computed the neural similarity between endpoints that were presented within the same image set and the same session. This resulted in eight symmetric, 6 × 6 correlation matrices for each participant-one per set in Session 1 and one per set in Session 2. All the correlation coefficients from the lower triangle of these matrices (n = 15) were then Fisher-transformed and entered as a dependent variable into a mixed-effects regression model (see Figure 1C). As such, the model included 120 correlation coefficients per participant (2 sessions × 4 sets × 15 similarity scores).
One fixed-effects predictor modeled unspecific changes in similarity between sessions (hereafter referred to the session effect) by coding whether similarity scores were recorded in Session 1 or Session 2. Similarly, a further three fixed-effects predictors modeled similarity differences attributable to (1) endpoints in the overlap condition (i.e., A1-A2), (2) endpoints shown in the same video (A1-A2, B1-C2), and (3) endpoints that were not shown in any video (C1-B2)-shown in Figure 1C. Together, these predictors and their interactions constituted a 2 × 3 factorial structure (Session [1 vs. 2] × Condition [overlap vs. no-overlap vs. unseen]) and so were tested with a Session × Condition F test. Nonetheless, our principal hypothesis holds that there will be a specific interaction between the Session and Overlap predictors (referred to as the Session × Overlap effect), which we report alongside the F test. The model also included a predictor indicating whether endpoints were from the same location (A1-A2, B1-B2, C1-C2), thereby allowing us to estimate changes in similarity between them. This ensured that variance loading onto the Session × Overlap effect was properly attributable to the learning of spatially coherent representations rather than some combination of other factors (e.g., same location + seen in the same video). Note that this model term quantifies similarity differences between overlap endpoints and all other endpoints that "change" between Session 1 and Session 2. A positive effect may indicate either an increase in similarity in the overlap condition or a decrease across all other similarity scores regardless of condition (or both). As such, the model is structured to account for any systematic change in the baseline level of similarity across sessions (see Results). Furthermore, the Session × Overlap term is only sensitive to a learning effect that causes relative shifts in similarity scores specific to the overlap condition and cannot be attributed to any other combination of effects. Finally, the model included a behavioral predictor specifying whether participants were able to match Endpoints A1-A2 in the postscanner task (mean centered with three levels: 0, 1, or 2 correct responses per pair). This examined whether changes in representational similarity were dependent on participants' ability to identify that endpoints from the overlap condition belonged to the same location after scanning (i.e., a three-way interaction: Session × Overlap × Behavior). Random effects in the model accounted for statistical dependencies across image sets, sessions, and participants.
To complement the ROI analyses, we ran a searchlight analysis that tested for RSA effects across the whole brain (searchlight radius: 3 voxels). Here, first-level contrast estimates compared the Fisher-transformed correlations between overlap endpoints (i.e., A1-A2) and all other endpoint correlations (e.g., B1-B2, B1-C1). A group-level analysis then compared these similarity contrasts between sessions to test the Session × Overlap interaction. To test for a Session × Overlap × Behavior interaction, the grouplevel model also included a behavioral predictor specifying a participant's average performance in matching A1 to A2 during the postscanner task (mean centered). Note that this searchlight analysis is not able to control for the potential contributions of other important factors (i.e., same location, same video) that our mixed-effects approach explicitly controls. It is complementary, but secondary, to the ROI analyses.

Statistical Validation and Inference
To ensure that each mixed-effects regression model was not unduly influenced by outlying data points, we systematically excluded observations that produced unexpectedly large residuals more than 2.5 SDs above or below model estimates. This was conducted regardless of condition and so did not bias the analyses to finding an effect (if no effect were present). Furthermore, a highly similar pattern of results was seen when not excluding outliers, supporting the robustness of our findings (see osf.io/dzy3p). Following these exclusions, Kolmogorov-Smirnov tests indicated that residuals were normally distributed across all the linear mixed-effects models. In addition, visual inspection of scatterplots showing residual versus predicted scores indicated no evidence of heteroscedasticity or nonlinearity. Where effects size estimates are contrasted across different models, we report the result as an unequal variance t test with the degrees of freedom being approximated using the Welch-Satterthwaite equation (Welch, 1947).
All p values are reported as two-tailed statistics. FWE corrections related to the multiple comparisons across our four ROIs are made for each a priori hypothesis (denoted p FWE ). In addition, we report whole-brain effects from searchlight and mass univariate analyses when they survive FWE-corrected thresholds ( p FWE < .05) at the cluster level (cluster-defining threshold: p < .001 uncorrected). All other p values are noted at uncorrected levels. As well as reporting null hypothesis significance tests, we present the results of complimentary Bayesian analyses. Unlike the frequentist statistics, these indicate whether the null is statistically preferred over the alternative hypothesis.
As such, we use the Bayesian analyses to determine whether there is evidence for the null when frequentist tests are nonsignificant. For each t test, a Bayes factor in favor of the null hypothesis (BF 01 ) was computed with a Cauchy prior centered at zero (i.e., no effect) and a scale parameter (r) of ffiffiffiffiffiffi ffi 0:5 p (see Gelman, Jakulin, Pittau, & Su, 2008). Bayes factors greater than 3 are taken as evidence in favor of the null hypothesis, whereas those less than 1/3 are taken as evidence in favor of the alternative (Kass & Raftery, 1995). Finally, alongside the inferential statistics, we report Cohen's d effect sizes for each t test. When effects are tested in the context of a mixed-effects model, estimates of Cohen's d are computed from the fixed effects only and exclude variance attributed to random effects.

Behavioral Performance
We first analyzed behavioral responses to the prescanner and postscanner tasks to determine (a) whether participants were able to identify which endpoints belonged to the same location and (b) whether performance increased as a result of watching the overlap videos. A generalized linear mixed-effects analysis modeled correct versus incorrect matches between cue and target endpoints as a function of Session (pre-videos vs. postvideos) and Experimental Condition (overlap, no overlap, and unseen). As such, the model constituted a 2 × 3 factorial design with random intercepts and slopes for both participants and endpoints.
Participants' increased ability to match endpoints in the overlap condition was not characteristic of a general tendency to match endpoints that appeared in the same video (i.e., selecting B1 when cued with C2). This was evident because matches between no-overlap endpoints were not more likely in Session 2 compared with Session 1, t(366) = 0.646, p = .519, BF 01 = 3.785, d = 0.135. In contrast, performance increases in the overlap condition (i.e., the post-video > pre-video effect reported above) were significantly larger than this general effect of matching all endpoints that appeared in the same video, t(949.20) = 5.027, p < .001, BF 01 = 0.002, d = 1.048. In addition, participants were unable to explicitly match no-overlap endpoints shown in the same video during the final behavioral task (comparison to 0.2 chance level: t(334) = −0.467, p = .641, BF 01 = 4.141, d = −0.097). In summary, participants rapidly learnt which scenes were from the same location; however, this was only seen in the overlap condition (and not in the no-overlap condition).
The searchlight analysis that tested for consistent representations of specific endpoints across the whole brain revealed representations in one large cluster that peaked in the right occipital lobe (area V1; t(22) = 11.50, p FWE < .001, k = 5202, BF 01 < 0.001, d = 2.398) and extended into the areas V2, V3, and V4 and the fusiform gyri bilaterally. Three smaller clusters were also detected in the right precuneus, t(22) = 4.64, p FWE = .011, k = 44, BF 01 = 0.005, d = 0.968, right inferior parietal lobule, t(22) = 4.40, p FWE = .028, k = 37, BF 01 = 0.008, d = 0.918, and right RSC, t(22) = 4.32, p FWE = .025, k = 38, BF 01 = 0.008, d = 0.901. The latter effect overlapped considerably with the right RSC ROI identified for each participant. However, the effect size estimated in the ROI analysis was weaker than the peak searchlight effect, principally because it was variable across endpoints and as such largely accounted for by random effects in the model. Unthresholded statistical maps of these effects are available at neurovault.org/collections/4819. In summary, we find evidence that the PHC (bilaterally), the right RSC, and a number of early visual areas maintained consistent representations of specific endpoints across scanning sessions. Note that whether a region codes such representations across scanning sessions is independent of whether it may learn location-based memory representations in the second session; these effects are, in principle, dissociable.
As part of a supplementary analysis, we also tested for visual representations of specific scenes that remained stable within (but not necessarily across) scanning sessions (see osf.io/exzba/). To quantify the BOLD similarity of specific scenes within each session, we required two independent pattern representations per session. Thus, across both sessions, we estimated voxel patters derived from four distinct periods: (a) first half of Session 1 (pre-videos), (b) second half of Session 1 (pre-videos), (c) first half of Session 2 (post-videos), and (d) second half of Session 2 (post-videos). As a result, each of these voxel representations was only derived from four endpoint presentations. Nonetheless, when similarity scores were modeled in a mixed-effects regression, each ROI showed greater levels of similarity between representations of the same endpoint relative to the similarity between different endpoints (weakest effect in the left PHC: t(26492) = 2.211, p = .027, BF 01 = 0.606, d = 0.461). Furthermore, this analysis revealed that representations of the same endpoints became more similar to one another after the videos in the right RSC and left PHC (weakest effect: t(26492) = 2.598, p = .009, BF 01 = 0.308, d = 0.542). This latter effect was insensitive in the right PHC and left RSC (weakest effect: t(26492) = 1.671, p = .095, BF 01 = 1.375, d = 0.348).

Effects in the Right PHC
Next, we report the results of the mixed-effects model examining whether pattern similarity between different endpoints changed across sessions as a result of watching the videos. This revealed a significant Session × Condition interaction in the right PHC, F(2, 2739) = 6.827, p FWE = .004 (average similarity matrices shown in Figure 3; condition estimates and confidence intervals plotted in Figure 4A). Post hoc tests showed that this effect was driven by a difference between pre-video to post-video sessions for endpoints in the overlap condition, t(2739) = 2.923, p = .004, BF 01 = 0.167, d = 0.610. This difference was not observed in any other condition (no overlap: t(2739) = 0.756, p = .450, BF 01 = 3.533, d = 0.156; unseen: t(2739) = −0.970, p = .332, BF 0 1 = 3.001, d = −0.202). Furthermore, a significant Session × Overlap interaction highlighted that the similarity differences in the overlap condition were attributable to the video manipulation alone rather than some combination of other factors, t (2739) = 2.549, p FWE = .043, BF 01 = 0.337, d = 0.532. Importantly, before the videos were shown, pairs of endpoints from the same location (i.e., A1-A2, B1-B2, and C1-C2) were found to evoke neural patterns that were more similar to each other than pairs of endpoints from different locations in the right PHC (e.g., A1-B2, B1-C2), t(2739) = 2.498, p FWE = .050, BF 01 = 0.369, d = 0.521 (see osf.io/uxhs9 for a plot of this effect). This "same-location" effect suggests that, even before the spatial relationship between scenes were known, the right PHC encoded visual properties of those scenes that generalized across different views. These data demonstrate that, despite controlling for similarity across stimuli using both the GIST descriptor and a pixel-wise correlation, and despite participants being unable to infer which endpoints were from the same location before watching the videos, we still found evidence for a "same-location" effect in the right PHC. This underlies the critical role of estimating pattern similarity before learning to identify significant increases in similarity post-video relative to pre-video (cf. Robertson et al., 2016). Note that this "same-location" effect is only seen when collapsing across Similarity between endpoints before the panoramic videos were shown (i.e., in Session 1). (B) Similarity between endpoints after the panoramic videos were shown (i.e., in Session 2). (C) Change in similarity that followed the panoramic videos (i.e., Session 2 minus Session 1). Color bars indicate both raw and baseline-adjusted Fisher z statistics (above and below the color bar, respectively). Adjusted statistics account for trivial differences in similarity across scanning sessions caused by motion and scanner drift. This is achieved by subtracting out a baseline level of similarity between nonassociated endpoints (i.e., endpoints that were not from the same location, video, or experimental condition). Note that the baseline-adjusted statistics are shown for illustrative purposes only; each RSA was conducted on the raw Fisher z statistics alone. Crosshatchings along the diagonal elements represent perfect correlations between identical BOLD responses and so were not included in the analyses. all endpoint pairs and is not evident in the Session 1 Overlap condition alone (osf.io/uxhs9).

Effects in the Right RSC
The Session × Condition and Session × Overlap interactions were not significant in any other ROI (Fs < 2.775, p FWE s > .250; similarity estimates for the right RSC plotted in Figure 4C). However, we saw a significant Session × Overlap × Behavior interaction in the right RSC, t(2728) = 2.886, p FWE = .016, BF 01 = 0.179, d = 0.602 ( Figure 4D). This suggests that the RSC only encoded viewpointindependent representations when the spatial relationships between endpoints could be retrieved during the postscanner test. No other ROIs showed a significant Session × Overlap × Behavior interaction (largest effect: t = 0.050, p FWE = 1, BF 01 = 4.567, d = −0.010).

Differentiating the PHC and RSC
We next assessed whether there was evidence for dissociable roles of the right PHC and RSC, given that both represented location-based information but were differently associated with behavioral performance. Specifically, we assessed whether location-based representations in the RSC were significantly more associated with participants' ability to match endpoints from the same location compared to representations in the PHC. This would suggest that the RSC plays a greater role in guiding behavioral performance than the PHC. We therefore tested whether the Session × Overlap × Behavior (three-way) effect was larger in the RSC than the PHC. A comparison of effect sizes did show evidence for such a dissociation, t(5311.9) = 3.931, p < .001, BF 01 = 0.021, d = 0.820.
This implies that the right PHC might have exhibited above-baseline pattern similarity between A1 and A2 endpoints even when those endpoints were not subsequently remembered as belonging to the same location. We directly tested this by rerunning the RSA having excluded A1-A2 pairs that were consistently remembered as belonging to the same location (i.e., having two correct responses during the postscanner test). Despite these exclusions, pattern similarity differences in the overlap condition remained significant, t(1188) = 2.364, p = .018, all ps > .156, BF 01 s > 1.895, ds < 0.296). Nonetheless, there was a significant association between behavioral performance and similarity changes in the overlap condition, t(2728) = 2.886, p = .004, BF 01 = 0.179, d = 0.602. All bars plot baseline-corrected similarity estimates having subtracted out correlations between nonassociated endpoints (e.g., A1-B1, A1-B2). As such, the zero line in A and C denotes the average similarity of these nonassociated endpoints in each session. Error bars indicate 95% confidence intervals.
In summary, we saw an increase in pattern similarity in the right PHC and right RSC between different scenes of the same location after they had been presented in an overlap video. Furthermore, we observed a dissociation between the PHC and the RSC. Whereas the PHC showed increased pattern similarity regardless of performance on the postscanner test, the RSC only showed increased pattern similarity when participants were able to subsequently identify those scenes as belonging to the same location.

Across-Session Decreases in Pattern Similarity
Our mixed-effects regression models were conducted on the raw Fisher z scores computed from each pair of endpoints. This ensured that effects were not driven by complex data manipulation or scaling, and so the data were not adjusted to account for across-session shifts in the similarity of all multivariate patterns (see Methods). Interestingly however, we did observe that Fisher z scores decreased from pre-video to post-video across all pairs of endpoints regardless of condition, in each ROI (see figure at osf.io/2y3pm). This is reflected by a notable session effect in each mixed-effects model indicating reduced levels of similarity between nonassociated endpoints (i.e., endpoints not belonging to the same location, video, or experimental condition; minimum effect size: t(2736) = −1.529, p = .126, BF 01 = 1.655, d = 0.319). As the size of this session effect was relatively large, the Session × Overlap and Session × Overlap × Behavior interactions involved less of a decrease in similarity scores relative to all other conditions (see Figure 3).
Given that similarity scores decrease across all endpoint pairs, it is unlikely that the Session effect was a direct result of our video manipulation (i.e., learning-induced neural differentiation). A mass differentiation on this scale would imply implausibly large amounts of information gain as the uniqueness (or entropy) of all neural representations would have to increase. Instead, it is more likely that the reduced levels of similarity were caused by systematically higher levels of noise in the second session. Most significantly, increases in temperature caused by radio frequency absorption during scanning will shift the thermal equilibrium that governs how many hydrogen nuclei are aligned to the external magnetic field (B 0 ) and can therefore contribute to the MR signal (see osf.io/8kns6/). In this case, we would expect to see similar shifts in the level of similarity across the entire brain. To test this, we measured pattern similarity in the genu of the corpus callosum, a region that should exhibit negligible levels of BOLD activity. On the basis of a seed voxel at Montreal Neurological Institute of [0, 26, 6], multivariate patterns were taken from the 122 whitematter voxels closest to that seed in native space. The size of this ROI was chosen to reflect the average size of our a priori ROIs. A mixed-effects regression model of these data did indeed show reduced levels of neural similarity from Session 1 to Session 2, t(2739) = −2.167, p = .030, BF 01 = 0.651, d = −0.452 (similar in magnitude to the session effect in all other regions; see osf.io/p9qzx/).
In summary, we conclude that the overall decrease in pattern similarity across sessions was not driven by any meaningful change in neural representations and, once controlled for, reveals a significant increase in pattern similarity in both the right PHC and RSC in the overlap condition, indicative of viewpoint-independent representations.

Laterality of RSA Effects
The above analyses identified location-based representations in both right-hemisphere ROIs but no similar effects in the left hemisphere. Given this, we explored whether each RSA effect was significantly stronger in the right versus left hemisphere. Comparing the Session × Overlap effects in the PHC did indeed reveal a significantly stronger effect in the right hemisphere, t(5390.5) = 3.798, p < .001, BF 01 = 0.028, d = 0.792. Similarly, comparing the Session × Overlap × Behavior interactions in the RSC revealed a significantly stronger effect in the right hemisphere, t(5427.4) = 2.708, p = .007, BF 01 = 0.251, d = 0.565. Note that Robertson et al. (2016) collapsed their analyses across hemisphere, potentially masking laterality effects. These results are consistent with observations and theoretical models that the right hemisphere may preferentially process spatial information in humans as a consequence of predominantly left-lateralized language function (Shulman et al., 2010;Vallortigara & Rogers, 2005;Smith & Milner, 1981).

Searchlight RSA
The searchlight analysis that tested for a Session × Overlap interaction across the whole brain revealed one small cluster in the right inferior occipital gyrus (Area V4), t(21) = 4.78, p FWE = .010, k = 38, BF 01 < 0.003, d = 0.997. However, when BOLD similarity in the cluster was modeled with the full mixed-effects analysis described above, the Session × Overlap effect was found to not be significant, t(2740) = 1.734, p = .083 (uncorrected), BF 01 = 1.259, d = 0.361. Model parameter estimates suggested that the searchlight effect was driven by below-baseline BOLD similarity in the overlap condition before the videos were shown (95% CI [−0.116, −0.021]), a result that is not consistent with any effect of interest. No other areas showed a significant Session × Overlap or Session × Overlap × Behavior interaction in the searchlight analysis. Nonetheless, both of the previously reported effects in the PHC and RSC are evident in the searchlight analysis at subthreshold levels, t(21) > 2, d > 0.417 (see neurovault.org/collections/4819/).

Univariate Responses to Endpoints
We investigated whether each of our ROIs produced univariate BOLD activations consistent with a Session × Condition interaction or a three-way interaction with behavior. No such effects were found, all Fs < 1.140, ps > .288. Furthermore, a mass univariate analysis testing for these effects at the whole-brain level yielded no significant activations.

Univariate Responses to Videos
Finally, we investigated whether univariate BOLD responses to the video clips differed between the overlap and no-overlap conditions or as a function of scene memory in the postscanner test. A group-level model was specified with predictors for (1) video type (overlap vs. no-overlap), (2) post-video performance in matching A1 and A2 endpoints, and (3) the interaction between video type and behavioral performance. This revealed two clusters that produced significantly greater BOLD responses during overlap versus no-overlap videos ( Figure 5, hot colors). The largest of these peaked in the medial pFC and extended into the anterior cingulate, left frontal pole, and left middle frontal gyrus, t(21) = 5.53, p FWE < .001, k = 600, BF 01 < 0.001, d = 1.153. The second cluster peaked in the left supramarginal gyrus, t(21) = 5.40, p FWE = .004, k = 185, BF 01 = 0.001, d = 1.126, adjacent to a smaller, subthreshold effect in the left angular gyrus.
No effects for the reverse contrast (i.e., no overlap > overlap) reached statistical significance at the whole-brain level (subthreshold effects shown in Figure 5, cool colors). However, a small volume correction for the PHC and RSC bilaterally revealed two clusters with a significant nooverlap > overlap effect. These were found in the right RSC, t(21) = −4.84, p FWE = .032, k = 26, BF 01 = 0.003, d = −1.001, and right PHC, t(21) = −4.77, p FWE = .026, k = 30, BF 01 = 0.003, d = −0.995, extending into the fusiform gyrus. Subthreshold effects for the no-overlap > overlap contrast were also evident in the left RSC and PHC. These results were mirrored by a linear mixed-effects model contrasting overlap and no-overlap video responses averaged across each ROI in native space. Here, both the right PHC and right RSC exhibited greater BOLD activity in the no-overlap video condition relative to the overlap condition, t (42)  In summary, we saw greater activity in the medial pFC during the overlap videos relative to the no-overlap videos. In contrast, the PHC and RSC showed greater activity during the no-overlap relative to overlap videos. In other words, the medial posterior regions that showed increased pattern similarity after presentation of the overlap video showed decreased activity while participants were watching the videos.

DISCUSSION
We show that scene-selective brain regions rapidly learn location-based representations of novel environments by integrating information across different viewpoints. Once participants observed the spatial relationship between two viewpoints from a given location, BOLD pattern similarity between viewpoints increased in the right PHC and RSC, implying the emergence of location-based representations. In the right PHC, these representations appeared regardless of whether participants could identify which scenes were from the same location. In contrast, representations in the right RSC only emerged for scene pairs that participants could subsequently identify as being from the same location.
The results provide further evidence that the PHC and RSC support spatial representations that are not solely driven by visual features in a scene (Robertson et al., 2016;Marchette et al., 2015;Vass & Epstein, 2013;cf. Watson, Hartley, & Andrews, 2017). Using a similar panoramic video manipulation, Robertson et al. (2016) suggested that the RSC and OPA maintain viewpoint-independent representations but found a more general associative effect in the PHC. Our results further identify the PHC in this process and highlight that RSC representations are more tightly linked to behavior. Note that the OPA was not one of our a priori ROIs, and we therefore make no claims in relation to this region supporting location-based representations (see ROIs section for further details). Our results also place constraints on models that describe how location-based representations are used. Unlike Robertson et al., we show that viewpoint-independent representations are evoked during passive viewing, in the absence of any explicit memory task (although we cannot rule out the possibility that participants engaged in active imagery, as explicitly required in Robertson et al.; see below).
Furthermore, we show that the learning of locationbased representations can take place rapidly (in a single scanning session), with few exposures to the spatial layout of a location. Consistent with this, the firing fields of place cells have been shown to emerge rapidly in the rodent hippocampus (Monaco et al., 2014). Novel locations, where rats engaged in head-scanning behavior (i.e., exploration), were associated with place fields the next time the rat visited the same location. Our results provide evidence that location-based representations form after only three learning exposures to the videos. Although we were specifically interested in the emergence of viewpoint-independent spatial representations, a similar approach could be used to track the emergence of viewpoint-independent representations of other stimulus categories (e.g., objects or faces; see Clarke et al., 2016, for a similar approach), opening the door to understanding how such representations are formed, or modulated, across the visual system.
We also found that the right RSC only exhibited locationbased representations when participants were able to identify which scenes belonged to that location in a postscanner test (PHC representations emerged regardless of behavioral performance on the postscanner test). This implies that the ability to identify differing scenes as from the same location is perhaps more dependent on representations in the RSC than PHC. Computational models hold that medial posterior and temporal regions (including the PHC and RSC) perform distinct but complementary functions in support of spatial navigation and imagery (Bicanski & Burgess, 2018;Byrne et al., 2007). Specifically, the PHC is thought to represent allocentric information related to the spatial geometry of the environment. Conversely, the posterior parietal cortex supports egocentric representations that allow the organism to actively navigate. The RSC transforms allocentric representations in the MTL into egocentric representations in the parietal cortex (and vice versa). Critically, the models predict that spatial navigation and planning is carried out in an egocentric reference frame. Thus, the RSC is critical to the translation of allocentric to more behaviorally relevant, egocentric information.
Our task required participants to match distinct scenes from the same location. This likely requires transformation from the presented egocentric viewpoint to an allocentric representation (egocentric-to-allocentric; i.e., A1 to the allocentric representation A*). In turn, the allocentric representation may allow for the retrieval of the associated viewpoint from the same location (allocentric-toegocentric; i.e., A* to A2). Under this assumption, the RSC is likely to be more tightly coupled to behavior relative to the PHC, as shown in the present data. This is because allocentric representations in the PHC only require the initial egocentric-to-allocentric transformation to be retrieved (A1-A*). If only the egocentric-to-allocentric transformation occurs, participants will not be able to perform the task. As such, it is possible to see evidence for allocentric PHC representations in the absence of accurate behavior. For allocentric representations in the RSC to be retrieved, both the initial egocentric-to-allocentric (A1-A*) and subsequent allocentric-to-egocentric (A*-A2), transformation is required. If both transformations occur, then participants should be able to perform the task accurately. Thus, location-based representations in the RSC may only be seen in the presence of accurate behavior and may reflect the transformation between reference frames rather than reflecting an allocentric representation per se.
A related possibility is that, during the passive viewing of specific scenes, participants engaged in active imagery of the associated scenes, leading to subsequent improvements in behavior for scenes from the same location. However, we note that the task did not explicitly require memory retrieval; participants responded to oddball targets leaving little time for active imagery (see Linde-Domingo, Treder, Kerrén, & Wimber, 2019). In addition, participants would only be able to engage in active imagery on the overlap trials alone. Despite this, we did not observe any univariate BOLD effects indicative of additional processing on these trials. As such, the activation of these representations does not appear to depend on any taskspecific memory demands. It is possible that the retrieval of PHC representations (i.e., egocentric-to-allocentric mapping) occurs relatively automatically, consistent with the proposal that allocentric representations in the MTL are automatically updated during self-motion in an environment (Bicanski & Burgess, 2018;Byrne et al., 2007). However, the retrieval of associated egocentric representations (i.e., allocentric-to-egocentric mapping) may not occur automatically during passive viewing, consistent with the observation that viewpoint-independent representations in the RSC are abolished when participants engage in a task that prevents them from active retrieval of spatial information (Marchette et al., 2015). Importantly, both of the above accounts are consistent with the proposal that the RSC plays a critical role in mapping between allocentric and egocentric representations.
Although consistent with models of allocentric processing, it is possible that the location-based representations we observed reflect other forms of associative learning (e.g., O'Reilly & Rudy, 2001). On this view, Scene A1 may become bound to A2 via a simple associative representation such that, after seeing the videos, A2 is covertly retrieved when presented with A1 (leading to increased pattern similarity). However, contrary to our findings, this simple account may also predict increased similarity in the no-overlap condition, where B1 and C2 are shown in the same video-particularly given that models of associative learning often rely on prediction error signals to account for incidental encoding (Den Ouden, Friston, Daw, McIntosh, & Stephan, 2009), which could be strongest in the no-overlap condition. A second possibility is that the overlapping content in the overlap videos (relative to the no-overlap videos) increases the probability of a direct association between A1 and A2. Indeed, it is the overlapping content that likely drives the increase in pattern similarity between overlap endpoints. Our current study is not able to discern whether the resulting "locationbased" representations are associative, or truly allocentric, in nature.
In terms of associative learning, a related possibility is that the overlapping content supports a more complex transitive representation (e.g., A1-AX and A2-AX where X is the overlapping scene in the center of the panorama). On this account, presentation of A1 cues the retrieval of AX and subsequently A2 (similar in nature to AB-AC inference paradigms; see Joensen, Gaskell, & Horner, 2020;Schlichting, Mumford, & Preston, 2015;Horner & Burgess, 2014;Schlichting, Zeithamova, & Preston, 2014;Zeithamova, Dominick, & Preston, 2012). Representations that encode these transitive relationships between scenes are possible and may support spatial navigation but are not directly predicted by models of spatial memory (Bicanski & Burgess, 2018;Byrne et al., 2007). Furthermore, the hippocampus and medial PFC (mPFC) are more typically associated with transitive inference (Schlichting et al., 2014(Schlichting et al., , 2015Zeithamova et al., 2012), yet we only found evidence of location-based representations in scene-selective regions. In addition, Robertson et al. have demonstrated that associative memory for scenes belonging to different locations is poor (comparable to their no-overlap condition) even when those scenes are presented in a "morphed" panorama such that they are associated with a common context. As such, our data are suggestive of processes that go beyond associative or transitive learning and provide support for models of allocentric processing, although we cannot rule out an "associative" explanation.
Finally, it is noteworthy that certain nonspatial models may be able to account for our findings. In particular, models of directed attention may predict increased levels of pattern similarity in the overlap condition if the overlap videos alerted participants to visual features that are shared across scenes (e.g., Luo, Roads, & Love, 2020;Mack, Preston, & Love, 2013). Further work will be needed to fully establish the true nature of the location-based representations that we report here. To fully match all visual features across scenes in each condition, one possibility would be to experimentally manipulate the central section of continuous panoramas so that no coherent spatial representation can be learned. Furthermore, to fully distinguish between allocentric and transitive (A1-AX-A2) representations, an imaging study incorporating the panoramic morph manipulation used by Robertson et al. may be used.
Although we directly link to computational models of spatial navigation and imagery, as well as rodent studies on spatial navigation, it is important to note that we have assessed pattern similarity during visual presentation of static scenes. This is a common approach in human fMRI (Bonner & Epstein, 2017;Robertson et al., 2016;Marchette et al., 2015;Marchette et al., 2014), as it allows one to control for many potential experimental confounds that might be present in a more ecologically valid experimental setting (e.g., using virtual reality; Julian et al., 2016;Doeller, Barry, & Burgess, 2010). However, this approach has the issue of being further removed from real-world spatial navigation. Interestingly, we saw evidence for increases in pattern similarity despite using a low-level attentional task, speaking to the potential automaticity of retrieving more location-based representations. Across the literature, there are numerous examples of evidence for spatial learning in humans and rodents during goal-directed navigation (Aoki, Igata, Ikegaya, & Sasaki, 2019;Howard et al., 2014), non-goaldirected navigation (e.g., O'Keefe & Dostrovsky, 1971;Tolman, 1948), mental imagery (e.g., Bellmund, Deuker, Schröder, & Doeller, 2016;Horner, Bisby, Zotow, Bush, & Burgess, 2016), and viewing of static images (e.g., Robertson et al., 2016;Marchette et al., 2015;Vass & Epstein, 2013). Our study adds to this growing literature suggesting that these representations can be assessed across diverse experimental environments with multiple methodologies.
The PHC has been proposed to represent several complementary spatial representations, including geometric information regarding one's location in relation to bearings and distances to environmental features (e.g., boundaries; Park, Brady, Greene, & Oliva, 2011). The representations that we observed in PHC may reflect enriched spatial representations relating specific scenes to environmental features outside the current field of view. Also consistent with our results, the PHC may represent spatial contexts more broadly . The experimental manipulation used here could be modified to learn novel locations in the same spatial context, potentially dissociating between the above accounts. A further proposal is that viewpoint-independent representations in the PHC reflect prominent landmarks that are visible across viewpoints (Marchette et al., 2015). Although this proposal yields similar predictions to above, it is less able to account for our finding of shared representations of views that did not contain any of the same landmarks.
Our PHC results are somewhat inconsistent with those of Robertson et al. (2016). Whereas our similarity increases were specific to the overlap condition, Robertson et al. saw effects in both their overlap and no-overlap conditions. One possibility is that our results reflect a Type II error, in that we failed to find an effect in the no-overlap condition when one is present. A second possibility is that Robertson et al. either (1) found an effect in the no-overlap condition when one is not present (i.e., a Type I error) or (2) failed to find a similarity effect in the overlap condition that was significantly larger than in the no-overlap condition (a Type II error). Notably, the "overlap > no-overlap" effect size that we observed in the PHC is considerably larger than the same effect reported by Robertson et al. (0.610 relative to −0.062) and more in line with their RSC and OPA effects (0.470 and 0.415, respectively). Thus, it seems plausible that the disagreement stems from a Type I error in Robertson et al. Despite this, without further information, it is not possible to draw clear conclusions.
However, one important caveat is that we also saw evidence for a "same-location" effect in the PHC before learning had occurred. This effect was seen despite controlling for visual similarity across stimuli using the GIST descriptor, accounting for pixel-wise correlations in luminance and color content, and despite participants being unable to identify which endpoints were from the same location before the videos. It is therefore possible that the PHC effects in Robertson et al. could have been driven by a similar effect not dependent on learning. This underlines the importance of including a prelearning versus postlearning estimate of pattern similarity, to definitively rule out trivial effects driven by preexisting similarities between images that are difficult to control for.
RSC representations may reflect the retrieval of spatial or conceptual information associated with the environment (Marchette et al., 2015). Further evidence suggests that the RSC contains multiple viewpoint-dependent and viewpoint-independent (Vass & Epstein, 2013), as well as local and global (Jacob et al., 2017;Marchette et al., 2014), spatial representations. This multitude of representations fits with the proposed role of the RSC as a transformation circuit, mapping between allocentric and egocentric representations. The heterogeneity of representations, relative to the PHC, may also be a further reason why we did not see clear evidence for location-based representations without taking behavior into account. Our RSC results are consistent with those of Robertson et al. in that they saw a clear effect after more extensive learning across 2 days (where behavioral performance was likely higher than in our study). However, we extend these findings to show that these effects are specifically associated with the locations that each individual participant has learned (i.e., a within-participant correlation that is consistent across participants). Regardless of the exact nature of such representations, our results provide clear evidence that we can track their emergence in both the PHC and RSC.
Although more explorative, we also examined activity during learning of new spatial relationships (i.e., video presentation). BOLD activations in medial posterior brain regions (including but not limited to the PHC and RSC ROIs) were greater for no-overlap videos compared to overlap videos. This effect perhaps reflects greater fMRI adaptation during the overlap videos because they presented the central viewpoint of the panorama more frequently than no-overlap videos (Figure 1). However, it is interesting that the same cortical regions that showed increased pattern similarity after presentation of the overlap video showed decreased activity when participants were watching the videos. This underlines the complex relationship between univariate activity during learning and resultant changes in patterns of activity after learning. More theoretically driven research would be needed to provide a robust explanation for this finding.
In addition, we found that mPFC showed a greater BOLD response in the overlap than no-overlap condition. This may reflect a mnemonic integration process that guides the learning of viewpoint-independent representations. Similar effects in mPFC have been observed in tasks that require integrating overlapping memories to support inference and generalization (Milivojevic, Vicente-Grabovetsky, & Doeller, 2015;Schlichting et al., 2015). Indeed, mPFC has been implicated in detecting new information that is congruent with previously learnt materials so that it can be integrated into a generalized representation (van Kesteren, Ruiter, Fernández, & Henson, 2012). Our results are broadly in line with this proposal, where mPFC may be detecting the presence of overlapping spatial information during the overlap videos, resulting in the integration of previously learnt representations into more coherent viewpoint-independent representations in posterior medial regions. Despite this, our results do not exclude the possibility that mPFC activations reflect disinhibition from medial-posterior inputs (which showed reduced activity) or attentional differences related to the behavioral task.
We have shown that brain regions in the scene network, specifically the right PHC and RSC, rapidly learn representations of novel environments by integrating information across different viewpoints. They appear to be relatively viewpoint-independent in that they become active regardless of which part of an environment is in the current field of view. We show that the PHC and RSC have potentially dissociable roles, consistent with models that propose the RSC plays a role in translating viewpoint-independent representations into a behaviorally relevant egocentric reference frame. Finally, our experimental approach allows for tracking the emergence of viewpoint-independent representations across the visual system.