## Abstract

Computer vision algorithms are often limited in their application by the large amount of data that must be processed. Mammalian vision systems mitigate this high bandwidth requirement by prioritizing certain regions of the visual field with neural circuits that select the most salient regions. This work introduces a novel and computationally efficient visual saliency algorithm for performing this neuromorphic attention-based data reduction. The proposed algorithm has the added advantage that it is compatible with an analog CMOS design while still achieving comparable performance to existing state-of-the-art saliency algorithms. This compatibility allows for direct integration with the analog-to-digital conversion circuitry present in CMOS image sensors. This integration leads to power savings in the converter by quantizing only the salient pixels. Further system-level power savings are gained by reducing the amount of data that must be transmitted and processed in the digital domain. The analog CMOS compatible formulation relies on a pulse width (i.e., time mode) encoding of the pixel data that is compatible with pulse-mode imagers and slope based converters often used in imager designs. This letter begins by discussing this time-mode encoding for implementing neuromorphic architectures. Next, the proposed algorithm is derived. Hardware-oriented optimizations and modifications to this algorithm are proposed and discussed. Next, a metric for quantifying saliency accuracy is proposed, and simulation results of this metric are presented. Finally, an analog synthesis approach for a time-mode architecture is outlined, and postsynthesis transistor-level simulations that demonstrate functionality of an implementation in a modern CMOS process are discussed.

## 1 Introduction

The primate brain can rapidly perceive and react to complex scenes intelligently because it does not process the entire scene at once. Instead, the early stages of the visual system prioritize a subset of the visual field for more immediate high-level processing. Visual saliency is the subjective property of these selected regions that makes them distinct from their neighboring areas (Koch & Ullman, 1987). Visual saliency has been incorporated into computational models of attention that are used to study perception (Koch & Tsuchiya, 2007). From an engineering perspective, saliency is a biologically inspired attentional operator that can be used to compress image data prior to more computationally complex algorithms. By prioritizing the processing of salient regions in an image, it may be possible to improve the performance of artificial vision (or other sensor array) systems with a minimal reduction in the accuracy of tasks such as object recognition and classification.

Visual saliency in biological visual systems is realized physically as networks of neuronal cells such as simple cells and complex cells (Itti, Koch, & Niebur, 1998; Russell, Mihalaş, von der Heydt, Niebur, & Etienne-Cummings, 2014). The function of these networks can be accurately emulated for engineering applications using software or hardware. While software implementations are useful for prototyping and understanding neuromorphic algorithms, direct hardware implementations are more practical for real-time performance in high-bandwidth applications such as image processing. Therefore, this letter discusses the codesign of a computationally efficient saliency algorithm with the underlying analog hardware because the computational and physical design constraints cannot be fully decoupled.

This letter introduces a novel, computationally efficient bottom-up saliency algorithm. Although the focus of this letter is the algorithm itself, the algorithm's structure is inspired by the advantages of an analog computation framework based on time-mode (TM) circuits. Therefore, we discuss both computational and analog hardware considerations. With respect to this proposed algorithm, a primary contribution of this letter is an open source implementation, dubbed pysaliency, written in Python. The source code for this package has been made publicly available: git://code.ece.tufts.edu/nanolab/pysaliency. The following features are included in the package:

- •
An implementation of a state-of-the-art saliency algorithm (JHU POIS) with a large number of easily adjustable parameters

- •
A multiprocess implementation of a distributed, hardware-optimized saliency algorithm with support for introducing various sources of error

- •
Theano functions and scripts for training a network representing the serialized version of the saliency algorithm as well as running the forward model on a GPU

This letter is outlined as follows. Section 2 reviews the state of the art of visual saliency algorithms. Section 3 introduces the hardware-optimized saliency algorithm. Section 4 analyzes the effects of the various optimizations on hardware and computational cost. Section 6 covers the implementation details of the algorithm in time-mode analog hardware. Finally, section 7 provides comparisons and closing remarks.

## 2 Background

### 2.1 Visual Saliency Algorithms as Preprocessors for Machine Vision

The goal of a visual saliency algorithm is to highlight the most distinct regions of an image, often mimicking the equivalent process in a biological vision system. From the perspective of designing vision systems, visual saliency can be viewed as a type of region of interest (ROI) detection or attentional operator, informing downstream processing stages which parts of a scene warrant the highest priority. Some of the earliest algorithms for computing visual saliency were proposed by Itti and Koch for modeling attention in humans (Itti & Koch, 2001; Itti et al., 1998). More recently, Russell et al. (2014) proposed an algorithm that attempts to improve the biological plausibility of the model with respect to Gestalt principles as well as take steps toward a hardware-friendly formulation (Russell et al., 2014).

Bottom-up models of saliency, such as those discussed here, are typically not used in isolation due to the fact that their accuracy is limited when the background scene is complex and forms features that appear salient (Itti et al., 1998; Lee, Kim, Kim, Kim, & Yoo, 2010; Park, Hong, Park, & Yoo, 2016). Bottom-up saliency often acts only as a preprocessor for more computationally complex object recognition algorithms such as convolutional neural networks (LeCun, Bengio, & Hinton, 2015; Park et al., 2016). Therefore, in a real-time vision system, it is vital that a bottom-up saliency algorithm should have as little latency and energy consumption as possible while still providing a data-reduction step for algorithms later in the processing pipeline.

### 2.2 A State-of-the-Art Saliency Algorithm

Figure 1 illustrates a high-level overview of the steps involved in the saliency algorithms discussed in this letter. The general approach depicted here is derived from the proto-object saliency algorithm introduced in Russell et al. (2014). This approach is the basis of all the analog saliency algorithm proposed here. Therefore, we briefly review the proto-object saliency algorithm.

The Johns Hopkins University (JHU) proto-object image saliency algorithm (JHU POIS) offers a biologically realistic model of saliency that is potentially realizable in hardware. The algorithm consists of four primary steps: edge detection, object detection, border ownership, and grouping. Each of these steps is repeated for three types of channels: intensity, color, and orientation. Within each channel, the steps operate on a gaussian image pyramid rather than just the image itself to provide scale invariance. Furthermore, the algorithm has several normalization steps that share activations between channels. The channels are combined into a final saliency map.

A brief overview of the algorithmic steps (omitting the normalization steps for clarity) follows for proper context for the reformulation, but for a full description and analysis, we refer readers to Russell et al. (2014). First, edge detection at several different orientations is performed using even and odd Gabor filters, which have a biological analog (Jones & Palmer, 1987). The even and odd responses are combined with an $\u2113$2 norm. Next, center-surround filters are used to identify light-on-dark and dark-on-light objects. An additional annular filtering step (dubbed von Mises in the JHU work) is performed on each of these contrast types, and then the result is mixed with the edge responses and summed over the pyramid levels (for each pyramid level) to create border responses. The ownership of each border response is determined by a grouping step in which the argmax over the different orientations (corresponding to the original edge detection orientations) is computed. Finally, the same annular filtering as before is performed on each pyramid level of the result. Each resulting pyramid is upsampled with bilinear interpolation, and all resulting levels are summed. These steps are performed for each channel, and then all channels are averaged together to form the final saliency map.

The algorithm we have described has two key advantages. First, it is a significant step toward a functional but biologically plausible model of saliency compared to previous saliency algorithms since it incorporates several biologically plausible computational structures such as Gabor filters and winner-take-all (WTA) networks (Russell et al., 2014). Second, several of the steps are easily realized in hardware. However, in the context of using saliency for hardware-based compression and a software-side computational speed-up, an all-analog signal path promotes power savings by reducing the energy required for quantization. The JHU POIS algorithm as proposed has several steps that are not practical for an analog implementation. Nevertheless, a few modifications to this algorithm can lead to a formulation that is analog friendly but still largely preserves the accuracy of the saliency map as measured by a standard labeled saliency data set (Cheng, Mitra, Huang, Torr, & Hu, 2015). These modifications are described in detail in section 3.

## 3 Hardware-Optimized Saliency Algorithm

Although it is theoretically possible to implement the JHU POIS algorithm in analog hardware, there are many practical constraints. Figure 2 shows a block diagram of the algorithm. With the exception of the subtractors ($-$), multipliers ($\xd7$), and argmax operations, each block represents a weighted sum corresponding to either a convolution or a pyramid sum. The steps are derived in section 3. Note that this version of the algorithm cannot be distributed due to shared computations among the convolutions. The algorithm proposed in this letter (see Figures 1 and 4) can be made fully distributed.

This section introduces an alternative to the JHU POIS algorithm that mitigates these issues. The modified version of the algorithm is shown in Figure 4. The modified version is largely motivated by circuit-level considerations. Therefore, an example design in a 45 nm complementary metal-oxide-semiconductor (CMOS) process is shown here to demonstrate the proposed algorithm's benefits in context.

### 3.1 Computational Primitives in Neuromorphic Hardware

Hardware implementations of neuromorphic architectures have traditionally focused on constructing integrate-and-fire neurons (IFN) in silicon (Bartolozzi & Indiveri, 2009; Oster, Wang, Douglas, & Liu, 2008; Sonnleithner & Indiveri, 2012). These artificial IFN structures can be used to realize weighted addition and winner-take-all (WTA) functions, which in turn can be used to implement universal approximators (e.g., perceptrons, neural networks; Gybenko, 1989; Rosenblatt, 1957). Traditionally, IFN circuits use pulse frequency modulation (PFM). However, it can be convenient to use a pulse width modulation (PWM) for constant throughput as well as the ability to efficiently realize many functions such as min and max with only logic gates despite the analog signal representation (Miyashita et al., 2014; Ravinuthula, Garg, Harris, & Fortes, 2009; Roberts & Ali-Bakhshian, 2010). Furthermore, the saliency algorithm introduced in this work requires a hardware multiplier. Going beyond this algorithm's requirements, being able to construct networks with multipliers can lead to richer functional representation and has biologically plausibility (Schmitt, 2002). Artificial neural networks can be used to approximate nonlinear functions such as multiplication (Ravinuthula, 2006). But several unit cells and stages are required, and the hardware cost is high. A common way to implement multiplication in neuromorphic systems is the coincidence detector circuit for multiplying two PFM-encoded signals (Srinivasan & Bernard, 1976). This circuit, however, assumes a probabilistic representation of spike timing and has a high error, especially for low-amplitude signals (Srinivasan & Bernard, 1976; Tal & Schwartz, 1997). Another approach is to exploit the nonlinear effects of the refractory period in a traditional IFN to coarsely approximate multiplication via a logarithmic transform; however, this approach is also inaccurate (>5% error) and low bandwidth (<20 Hz) (Tal & Schwartz, 1997). Recently, a translinear principle has been introduced for PWM-coded signals, and it allows analog synthesis of arbitrary nonlinear functions without expensive neural network approximations at real-time data rates (D'Angelo & Sonkusale, 2014, 2015a, 2015b, 2016). This discovery combined with the benefits already noted motivates the use of a PWM architecture for the algorithm introduced here. The circuits used here assume a discrete time PWM coding, sometimes referred to as a time-mode signal representation. Therefore, we use the terms *PWM* and *time mode* interchangeably in this discussion. The time-mode signal representation used in this work is illustrated in Figure 3 and compared with a voltage-mode representation of analog signals.

### 3.2 Image Channels and Normalization

The JHU POIS algorithm first processes the input image to extract the intensity and multiple color channels. The algorithm proposed here also uses these channels exactly as reported in Russell et al. (2014). The channels used are intensity (one channel), color opponency (four channels), and orientation (four channels).

In addition to the channels, the JHU POIS algorithm includes cross-communication between channels in the form of normalization. Normalization is also applied in the grouping step, discussed later in this section. In the algorithm presented in this letter, however, these normalization steps were found to have a minimal effect on the data set results. Furthermore, it was unclear whether the effect was positive or negative across the data set. Therefore, for the proposed hardware-oriented algorithm, channels are intended to operate independently in parallel as separate networks with no internal cross-connectivity. A single neuron with nine inputs, one from each channel (four color, four orientation, and one intensity), would be used to perform the final averaging of these channels. An implementation of the normalization algorithm used to inform this additional omission from the algorithm is provided in the pysaliency library.

### 3.3 Pyramid Generation

The next step of the algorithm is to compute an image pyramid, which consists of $L$ images such that each image has been downsampled by $2l\u2200l\u2208[1,L]$. In JHU POIS, this image pyramid is downsampled using an interpolation technique such as bilinear interpolation. However, implementing a 2D interpolation in analog hardware is challenging because it requires averaging circuits between each pixel in the downsampled image. Also, if the algorithm is serialized with respect to the input pixels, downsampling without interpolation allows each pixel to be computed individually from a small subset of the pixels in the image. This configuration greatly simplifies the circuit design because a homogeneous cell that computes the saliency of one pixel from its neighbors can be arrayed in parallel. The input pixels can then be serialized through the array, allowing the area and bandwidth of the network to be traded off as needed.

### 3.4 Image Convolutions

At this point, the algorithm computes a series of convolutions. The kernels in the JHU POIS algorithm used $5\xd75$ floating-point resolution kernels generated by equations reported in Russell et al. (2014). In the algorithm performance results reported in this paper, $3\xd73$ integer kernels are used. The kernel size was reduced to trade off some accuracy to save circuit area. The resolution of the kernel weights was limited to integer values to improve device matching by using a more uniform layout for the devices at the expense of dynamic range in the weights. Furthermore, the output of each convolution in the analog formulation would be half-wave-rectified, which cuts the circuit area in half because a differential circuit is required to implement signed PWM/time-mode computation. The single-ended computation was found to have only a modest impact on the performance of the algorithm. All convolutions are performed with self-padding of the image, that is, the two outermost columns and two outermost rows of the perimeter image are duplicated outward $intK2$ times. The following sections discuss the computational steps of the algorithm with additional details about the convolution kernels.

### 3.5 Edge Detection

### 3.6 Center-Surround Filters

### 3.7 Border Ownership

### 3.8 Border Grouping

The intuition for why the reduction of the argmax over the differences to the max over the ownership responses themselves is an effective approximation is as follows. Large edge response at a particular orientation at a certain location combined with a large center-surround response (indicating an object) at that location indicates a high probability that that edge belongs to that object. By simply taking the maximum of these modulated responses across the orientations and adding them up over the levels, the algorithm groups objects together in a scale-invariant manner.

The full algorithm with both von Mises kernels as well as the argmax-based WTA implementation is depicted in Figure 2. The fully hardware-optimized algorithm without a von Mises step and with the max-based WTA selection is shown in Figure 4. Four levels were chosen for the transistor-level design example. The block diagram in Figure 4 illustrates the full analog-friendly formulation of the algorithm. This formulation will be used in a motivating design example in section 6.

### 3.9 Annular Filter-Based Saliency Algorithm

*annular filter*is used to describe this approach because it traces a ring around the pixel of interest. Consider a $3\xd73$ patch, $Pin$ selected from an input image, $Iin$:

## 4 Cost Analysis and Hardware Considerations

The saliency algorithm has many potential design trade-offs in terms of computational complexity, hardware cost, and design complexity. This section analyzes several of these trade-offs.

### 4.1 Time-Mode Circuits for Computing Max and Argmax

The JHU POIS algorithm calls for an argmax operation over the kernel orientations to determine the outputs that will be passed along to a second step of von Mises filtering. This operation was simplified by simply summing the maximum response over the orientations for each level and using the result as the final saliency score. However, as there may be a benefit to a CMOS implementation of the original argmax-based algorithm, we introduce and analyze a time-mode argmax circuit here. We compare the number of gates between the time-mode max and argmax circuits.

Figure 7 plots the total number of gates for the max and argmax circuits as a function of the number of inputs assuming a standard two flip-flop implementation of the phase detector, as well as a simpler XOR-based phase detector. It can be seen from these relationships that the max function requires far fewer logic gates as the number of inputs increases. This fact motivates the reformulation of the saliency algorithm from that requiring an argmax function to that requiring only a max function. Even if it were the case that the overall accuracy was degraded by this change, if the output remains an acceptable classifier, this performance loss may be justifiable by the much higher efficiency of the max-based saliency network.

### 4.2 Computational Complexity and a Distributed Saliency Algorithm

One of the critical design decisions that makes an analog ASIC implementation of a saliency algorithm practical to implement cost-effectively is the elimination of the von Mises or annular filtering stage. More generally, this omission simplifies the design because a second layer of convolutions requires a significant number of additional convolution calculations in the first stage. In the case of the annular filters, this effect is exacerbated by the fact that several angles of rotation of the kernels must be computed. The degradation in accuracy is reported in Figure 5b, but this section discusses the improvement in computational complexity of the algorithm that directly translates to a reduction in power and area in a hardware implementation. Furthermore, the effect of parallelization of the architecture is also analyzed.

The algorithm can be parallelized in two ways. First, the saliency of each pixel can be computed separately, ignoring the redundant computations among neighboring pixels in the computation of each one's saliency. In this form, each network computes one pixel at a time with no connections to other networks. This approach has the advantage of simplifying the architecture, but some energy is wasted on duplicating the convolution computations that overlap among adjacent pixels. The alternative is to allow overlapping connections between adjacent networks, which saves power and area by eliminating the redundancy but increases the wiring complexity of the implementation. In other words, computing the saliency of groups of pixels reduces the number of computations because some computations will be identical. However, accounting for these requires connections between the neighboring computational units, preventing distributed computation in a hardware implementation.

Four versions of the algorithm are analyzed in Figure 8. The source code used to compute the number of computations in this plot is included in the pysaliency library. It can be seen that the most efficient implementation considers redundancy and skips the von Mises algorithm. Also, skipping the von Mises algorithm but opting for the less complex network (i.e., the simplest of the four designs) provides the same improvement as using the von Mises algorithm with the more complex network with shared computations. From this analysis, it was determined that the redundant and fully parallel architecture (i.e., no von Mises, redundant in Figure 8) is the most suitable for an analog circuit implementation.

For an analog implementation of an algorithm, noise, mismatch, and cross talk introduce challenges to fully parallelized computation. Conversely, serializing all of the input data limits the device bandwidth. Therefore, it is useful to formulate the algorithm such that it is fully distributed with respect to each pixel of the saliency map so that there is flexibility in the parallelization and serialization of the data flow through the circuit. Fortunately, the analog-friendly reformulation of the algorithm lends itself to a fully distributed implementation of the saliency algorithm. This benefit comes from removing the von Mises or annular filter in the second stage of the algorithm as described in section 3.7. A consequence of this omission is that there will only be one stage of image convolutions. This simplified structure allows each pixel's saliency to be computed individually from a subset of only $K2L$ pixels, where $K$ is the width of the kernel and $L$ is the number of levels in the image pyramid. This set of neighboring pixels around the current pixel of interest can be inferred from the image pyramid. A rastering algorithm was designed to precompute this subset of pixels that is needed for computing each output pixel. This rastering algorithm uses simple logic to step through each pixel in the image and sequentially select the correct $K2L$ neighbors. This rastering pattern can be stored in a look-up table (LUT) that drives the row column decoder of an image sensor in a fully integrated implementation.

### 4.3 Local Averaging and Ensemble Networks

Saliency algorithms can be top down or bottom up in both computer science and neuroscience. Bottom-up saliency refers to a quantification of the distinctiveness of a pixel and its neighbors (Melloni, van Leeuwen, Alink, & Müller, 2012; Rauss & Pourtois, 2013). Conversely, top-down saliency is the selection of the saliency of the pixels based on higher-level goals (Melloni et al., 2012; Rauss & Pourtois, 2013). Consequently, there are fundamental limitations to how accurate bottom-up saliency can be in the context of subjective higher-order reasoning, and the best attentional operator would be a combination. The algorithm presented in this letter is a bottom-up approach to saliency. Therefore, to boost the accuracy of the results in the bottom-up context, we propose two modifications.

In an analog circuit, the averaging could be implemented with $M$ time-mode latches (Ali-Bakhshian & Roberts, 2012) and an additional time-mode adder that stores the previous $M$ saliency values onto $M$ capacitors and reads out the sum. However, this structure was not included in the transistor-level design. Nevertheless, this averaging could be performed in the digital domain to demonstrate the concept. If the rastering scheme is used in an analog design, this local averaging can be seen as a type of memory. The results tagged *-mem in Figures 5a and 5b refer to those using this local averaging technique.

## 5 Saliency Algorithm Results

Various versions of the saliency algorithm were implemented, and their performance was assessed on a data set. Table 1 lists the differences between the six algorithms used in this study. The primary differences are the use of annular filters, the use of the local averaging technique, and the introduction of expected hardware errors and nonidealities into the simulation. These errors were incorporated into behavioral circuit models from transistor-level simulations and include jitter, charge injection, clock feed-through, device mismatch, nonlinearity, and leakage.

Algorithm | Description | Errors | Averaging | Normalize | Channels | Annular Filter | WTA | Kernels | Norm |

POIS | Mimic of JHU | Ideal | None | Yes | Intensity, | von Mises | argmax | Float | L2 |

Color, | |||||||||

Orientation | |||||||||

HW-ideal | Serialized | Rectified | None | No | Intensity | No | max | Integer | L1 |

HW-errors | Serialized | Circuit | None | No | Intensity | No | max | Integer | L1 |

HW-memory | Serialized | Rectified | 4 pixels | No | Intensity | No | max | Integer | L1 |

Annular | Serialized | Rectified | None | No | Intensity | Custom | max | Integer | NA |

Annular-memory | Serialized | Rectified | 4 pixels | No | Intensity | Custom | max | Binary | NA |

Algorithm | Description | Errors | Averaging | Normalize | Channels | Annular Filter | WTA | Kernels | Norm |

POIS | Mimic of JHU | Ideal | None | Yes | Intensity, | von Mises | argmax | Float | L2 |

Color, | |||||||||

Orientation | |||||||||

HW-ideal | Serialized | Rectified | None | No | Intensity | No | max | Integer | L1 |

HW-errors | Serialized | Circuit | None | No | Intensity | No | max | Integer | L1 |

HW-memory | Serialized | Rectified | 4 pixels | No | Intensity | No | max | Integer | L1 |

Annular | Serialized | Rectified | None | No | Intensity | Custom | max | Integer | NA |

Annular-memory | Serialized | Rectified | 4 pixels | No | Intensity | Custom | max | Binary | NA |

The data set performance was quantified by using a binary classification view of the saliency algorithm. The data set in Cheng et al. (2015) was restructured to fit this formulation. The data set was randomly split into standard sets for training and cross-validation. In order to keep the training time reasonable, a random subset of images was chosen, and from these images, a random subset of pixels was chosen. The rastering algorithm was then used to compute the input examples and output labels.

Because saliency is a somewhat subjective measure, metrics on the data set were used to quantify the results. In light of the binary classification formulation, the area under the curve (AUC) of the receiver operator characteristic (ROC) was used for quantification (Fogarty, Baker, & Hudson, 2005). First, a set of 40 random images was chosen from the data set for verification. The rastering algorithm was applied to each image, the saliency map of each image was computed by six different versions of the algorithm, and eight different classification thresholds were applied to each pixel. A confusion matrix and the accuracy of the saliency decision were then computed for each image. These confusion matrices were averaged across the images to generate the parametric ROC curves as a function of the threshold, as well as the accuracy over threshold depicted in Figure 5a. A naive threshold value was also computed as a baseline comparison point. This curve represents choosing saliency based on the magnitude of the pixel intensity instead of using a saliency algorithm.

The saliency metrics did vary significantly with the images themselves, as some scenes are more suitable for saliency than others. To show this variation, histograms are plotted in Figure 5b showing the ROC AUC score and the accuracy across images. It can be seen that the algorithm that incorporates the hardware nonidealities does exhibit reduced accuracy, but at a threshold of 0.4, it maintains an accuracy above 70% with a true-positive rate at nearly 80% and a false-positive rate of about 60%, indicating classification behavior from the algorithm. The use of the local averaging technique maintains the accuracy of the error-corrupted hardware result but with a significant reduction in the false-positive rate. Therefore, the performance is boosted with this simple additional processing.

A key advantage of this formulation of the saliency algorithm is that it is fully distributed across the pixels of the image and therefore lends itself to parallel processing. In the context of time-mode circuits, this formulation allows the groups of neighboring pixels to be rastered through a single analog circuit. This technique reduces the overall area substantially by taking advantage of the high bandwidth of the time-mode building blocks relative to the required real-time frame rate. Furthermore, by using the same analog circuit, matching across the image is greatly improved. In the digital domain, the same rastering technique could be used for a digital application-specific integrated circuit (ASIC) or field-programmable gate array (FPGA) implementation. Furthermore, an improvement can also be gained in a GPU implementation or on a CPU implementation using multiprocessing. Figure 9 shows the timing benchmarks for the different versions of the algorithm tested in Figure 5a. In this benchmarking test, the saliency map of a single frame was computed for two different aspect ratios with and without CPU parallelization. The algorithms (except for JHU POIS) were parallelized across eight processes, resulting in a speed of about six times. The implementation of the JHU POIS algorithm utilizes SciPy's optimized convolution method, whereas the other hardware-optimized algorithms distribute the computations into dot products of a weight matrix with a subset of neighboring pixels. The SciPy implementation results in a speedup through the use of C-level computations via Cython (Behnel et al., 2011), whereas the distributed computations allow multiprocessing to be used. Despite not using the optimized SciPy convolutions, the distributed processing results in faster computation of the saliency map by utilizing multiple CPU processes without any data sharing required between them. However, the distributed algorithm is most suitable for specialized hardware implementations such as on a GPU, and most notably for a custom analog or digital hardware implementation, because performing convolutions on the entire image in parallel can be prohibitive from a power and area perspective in an ASIC design.

Figure 10 shows an example image from the data set introduced in Cheng et al. (2015) processed with the HW-errors algorithm and compared with the custom implementation of the JHU POIS algorithm. The JHU POIS algorithm performs well, but it is computationally complex and more challenging to implement in analog hardware, as discussed in previous sections. The HW-errors model uses the compact analog algorithm and incorporates conservative circuit errors extracted from Spectre simulations. A threshold of 40% is used to create the masks for compressing the images shown on the bottom right. Only 24% of the pixels were retained to create the mask in the analog version compared to 37% in the JHU POIS algorithm. The full shape of the vehicle is preserved, rejecting the background despite the complex texture of the scene. This result suggests that this algorithm may have applications to machine vision for automated vehicles.

## 6 Circuit Considerations for the Saliency Algorithm

This section discusses circuit-level considerations for an integrated circuit design of the saliency algorithm introduced thus far. In this design example, the algorithm takes a vector of 36 pixels as input as selected from the scene by a rastering algorithm, whose source code is provided in the online appendix. The intensity of each input pixel is represented as an analog pulse width. The saliency algorithm circuit produces a single output pulse whose pulse width is proportional to the saliency value.

### 6.1 Input Rastering Scheme

The goal of the hardware-friendly algorithm is to compute a saliency map across an entire image one pixel at a time. In a fully parallelized version of the algorithm, the image is first downsampled by a factor of 2 several times to create an image pyramid. In the transistor-level design, the first four layers of this pyramid are used. The algorithm depends on several convolutions, each with $3\xd73$ kernels. Therefore, in order to serialize the algorithm, $3\xd73\xd74$ pixels are required to compute each output pixel. In order to compute the saliency map for an entire image, each set of 36 pixels is serialized through the saliency circuit at 4.4 MHz, which is sufficient for computing $256\xd7256$ resolution saliency maps from 60 Hz video. The input video resolution can be higher by appropriately downsampling and assuming that neighboring pixels share the same saliency value.

### 6.2 Analog Computation Circuits

The time-mode neuron circuit proposed in Ravinuthula et al. (2009) is used for computing the convolutions and the two level sums. XOR gates are used to perform the subtractions that are not part of the convolution operations. The multiplication stage is implemented using the translinear pulse width multiplier introduced in (D'Angelo & Sonkusale, 2015a, 2016). Finally, the logic gate–based max circuit used is depicted in Figure 7a. The convolution weights were hard-coded in the sizing of the current sources to save chip area, reduce the number of pads, and reduce the wiring complexity. The weights can be adjusted prior to fabrication via a layout automation process detailed in section 6.3.

### 6.3 Synthesis Approach

An advantage of using quasi-digital pulses to represent analog information is that synthesis techniques can be used to automate the circuit layout because standard-timing constraints and skew-reduction techniques apply to time-mode circuits. The saliency algorithm designed here was synthesized using such techniques. The analog network was designed and synthesized in IBM soi12s0 45 nm SOI CMOS process, and a 1 V supply was used for the reported data from the transistor-level simulations.

First, the algorithm was described mathematically in the Python programming language. A modular object-oriented design paradigm was used such that various aspects of the algorithm could be effectively removed or modified. This modular design was used with expert knowledge of analog circuit design to limit the algorithm's operations to those that could be performed efficiently with time-mode circuits. Furthermore, errors mimicking the nonidealities from the simulated hardware were introduced into the Python implementation to verify the error resiliency of the algorithm.

The weights in the convolution network were extracted from the software implementation of the algorithm for circuit synthesis. A Python program was then used to convert the weight description and the higher-level algorithm description into a Verilog netlist (see Figure 11). The Verilog netlist was imported into Cadence Virtuoso. A Cadence SKILL-based parameterized cell (PCell) was designed to read in the weight description and generate the layouts for the weighted adder and multiplier circuits. This process involves first designing the core of the adder and multiplier by manual analog design, then combining these cores with the cells generated by the PCell using a digital synthesis approach.

The top-level layout of the saliency algorithm transistor-level design was synthesized in Cadence's Encounter suite using automated place and route of macros representing the computational building blocks. The 1.5 mm $\xd7$ 1.5 mm layout is shown in Figure 12. A flip-chip pad array was used to achieve a larger input/output count and reduce parasitic inductance and capacitance.

### 6.4 Transistor-Level Simulations of the Saliency Network

Postlayout simulations were performed on the synthesized saliency network with the Spectre circuit simulator from the Cadence Design Suite to verify correct network synthesis. The results show an average power consumption of 3.7 mW at 60 Hz for $256\xd7256$ saliency maps. Furthermore, Monte Carlo simulations demonstrate less than a 1 ns standard deviation in the absolute output pulse width. Direct observation of the transient waveforms indicates that computations are propagating through the network successfully. A summary of the specifications is shown in Table 2. The minimum input pulse width that produces an output in one of the time-mode neurons is approximately 50 ps. We assumed 223 ps of jitter for the output dynamic range calculation, extrapolated to accumulate over the five stages from a transient noise simulation for 1 ms.

Specs | Cadence Spectre Simulation |

Area | 1.5 mm $\xd7$ 1.5 mm |

Frame rate | 60 fps |

Clock frequency | 4.4 MHz |

Power | 3.7 mW |

Aspect ratio | 256 $\xd7$ 256 pixels |

Chip inputs | 36 pulses |

Input dynamic range | 70.10 dB (160 ns max) |

Output dynamic range | 42.58 dB (30 ns max) |

Output jitter | 223 ps ($0.74%$) |

Output deviation | 1 ns ($3.3%$) |

Specs | Cadence Spectre Simulation |

Area | 1.5 mm $\xd7$ 1.5 mm |

Frame rate | 60 fps |

Clock frequency | 4.4 MHz |

Power | 3.7 mW |

Aspect ratio | 256 $\xd7$ 256 pixels |

Chip inputs | 36 pulses |

Input dynamic range | 70.10 dB (160 ns max) |

Output dynamic range | 42.58 dB (30 ns max) |

Output jitter | 223 ps ($0.74%$) |

Output deviation | 1 ns ($3.3%$) |

### 6.5 Comparison with Existing Saliency Hardware

Dedicated hardware for computing saliency maps falls under the broader category of visual attention systems. Several architectures have been proposed in both the analog and digital domains. However, to date, most analog implementations of saliency rely on WTA operations on the temporal derivatives of pixel intensities or the spatial derivatives of neighboring pixels (Bartolozzi & Indiveri, 2009; Horiuchi & Niebur, 1999; Sonnleithner & Indiveri, 2012). The latter approach is comparable to the so-called annular filter-based algorithm presented in section 3.9. Furthermore, the existing architectures often assume a saliency map as an input to an attentional tracker (Morris, Horiuchi, & DeWeerth, 1998). The algorithm presented in this work is the first analog implementation of the border ownership modulation scheme requiring an analog multiplier. Regardless, the existing work makes use of the same fundamental building blocks, neuron circuits (i.e. weighted adders), and WTA circuits, and these architectures can serve as a reasonable point of comparison. Table 3 shows a comparison of several architectures found in the literature with the one introduced here. Previous analog architectures have been limited in their real-time performance by bandwidth constraints of the spiking neuron, as well as the lower bandwidth of larger-length transistors. The closest point of comparison that could be found is the bottom-up portion of digital implementation presented in Barranco et al. (2014). A fairer comparison can be made here by extrapolating the energy per pixel down to only one feature channel, adjusting for the differences in supply voltage, and adjusting for differences in capacitive loading. This extrapolation yields an energy per pixel of 4.77 nJ/pix, still five times larger than the transistor-level simulations predict. Furthermore, significant quantization energy would still be required to digitize the entire frames.

Specification | Sonnleithner et al. (2012) | Morris, Horiuchi, and DeWeerth (1998) | Oster et al. (2008) | Barranco, Diaz, Pino, and Ros (2014) | Horiuchi et al. 1999 | This work |

Frame rate | 10 fps | NA | 4 fps | 180 fps | 4 fps | 60 fps |

Pixels/frame | $32\xd732$ | 20 | $32\xd732$ | $640\xd7480$ | 23 | $256\xd7256$ |

Energy/pix | NA | NA | NA | 47.74 nJ/pix$*$ | $58.70\mu $J/pix | 940 pJ/pix |

Dynamic range | 45 dB | 41 dB | NA | 54 dB | NA | 70.10 dB |

Algorithm type | Attention | Attention | WTA Network | Saliency | Saliency | Saliency |

Processing | Bottom up | Bottom up | NA | Top down | Bottom up | Bottom up |

Features | $\delta I/\delta t$ | Saliency | Intensity | I, $\theta $, RGB, $\delta I/\delta t$ | $\delta I/\delta t$, $\delta I\delta x$ | Intensity |

Implementation | Analog | Analog | Mixed signal | Digital | Analog | Analog CMOS |

Circuit Type | Spiking | Current | Spiking | FPGA | Current | Time mode |

Area (mm$2$) | $2\xd72$ | $2.25\xd72.25$ | $1.8\xd71.8$ | NA | $2.22\xd72.22$ | $1.5\xd71.5$ |

CMOS Process | $0.35\mu $m | $2\mu $m | $0.35\mu $m | 65 nm | $2\mu $m | 45 nm |

Specification | Sonnleithner et al. (2012) | Morris, Horiuchi, and DeWeerth (1998) | Oster et al. (2008) | Barranco, Diaz, Pino, and Ros (2014) | Horiuchi et al. 1999 | This work |

Frame rate | 10 fps | NA | 4 fps | 180 fps | 4 fps | 60 fps |

Pixels/frame | $32\xd732$ | 20 | $32\xd732$ | $640\xd7480$ | 23 | $256\xd7256$ |

Energy/pix | NA | NA | NA | 47.74 nJ/pix$*$ | $58.70\mu $J/pix | 940 pJ/pix |

Dynamic range | 45 dB | 41 dB | NA | 54 dB | NA | 70.10 dB |

Algorithm type | Attention | Attention | WTA Network | Saliency | Saliency | Saliency |

Processing | Bottom up | Bottom up | NA | Top down | Bottom up | Bottom up |

Features | $\delta I/\delta t$ | Saliency | Intensity | I, $\theta $, RGB, $\delta I/\delta t$ | $\delta I/\delta t$, $\delta I\delta x$ | Intensity |

Implementation | Analog | Analog | Mixed signal | Digital | Analog | Analog CMOS |

Circuit Type | Spiking | Current | Spiking | FPGA | Current | Time mode |

Area (mm$2$) | $2\xd72$ | $2.25\xd72.25$ | $1.8\xd71.8$ | NA | $2.22\xd72.22$ | $1.5\xd71.5$ |

CMOS Process | $0.35\mu $m | $2\mu $m | $0.35\mu $m | 65 nm | $2\mu $m | 45 nm |

$*$Bottom-up energy only.

## 7 Conclusion

A compact visual saliency algorithm has been presented. This algorithm has been shown to be effective at computing saliency maps despite aggressive optimization and reduced computational complexity compared to the prior art. We have shown that this algorithm can be implemented in standard CMOS processes with a purely analog signal path in a semiautomated manner. The architecture shows promise as a practical preprocessor for machine vision applications that require ultra-low power and low area but also require real-time video processing of scenes.

## Acknowledgments

This work was funded by Draper Laboratory in Cambridge, Massachusetts.