## Abstract

Computer vision algorithms are often limited in their application by the large amount of data that must be processed. Mammalian vision systems mitigate this high bandwidth requirement by prioritizing certain regions of the visual field with neural circuits that select the most salient regions. This work introduces a novel and computationally efficient visual saliency algorithm for performing this neuromorphic attention-based data reduction. The proposed algorithm has the added advantage that it is compatible with an analog CMOS design while still achieving comparable performance to existing state-of-the-art saliency algorithms. This compatibility allows for direct integration with the analog-to-digital conversion circuitry present in CMOS image sensors. This integration leads to power savings in the converter by quantizing only the salient pixels. Further system-level power savings are gained by reducing the amount of data that must be transmitted and processed in the digital domain. The analog CMOS compatible formulation relies on a pulse width (i.e., time mode) encoding of the pixel data that is compatible with pulse-mode imagers and slope based converters often used in imager designs. This letter begins by discussing this time-mode encoding for implementing neuromorphic architectures. Next, the proposed algorithm is derived. Hardware-oriented optimizations and modifications to this algorithm are proposed and discussed. Next, a metric for quantifying saliency accuracy is proposed, and simulation results of this metric are presented. Finally, an analog synthesis approach for a time-mode architecture is outlined, and postsynthesis transistor-level simulations that demonstrate functionality of an implementation in a modern CMOS process are discussed.

## 1  Introduction

The primate brain can rapidly perceive and react to complex scenes intelligently because it does not process the entire scene at once. Instead, the early stages of the visual system prioritize a subset of the visual field for more immediate high-level processing. Visual saliency is the subjective property of these selected regions that makes them distinct from their neighboring areas (Koch & Ullman, 1987). Visual saliency has been incorporated into computational models of attention that are used to study perception (Koch & Tsuchiya, 2007). From an engineering perspective, saliency is a biologically inspired attentional operator that can be used to compress image data prior to more computationally complex algorithms. By prioritizing the processing of salient regions in an image, it may be possible to improve the performance of artificial vision (or other sensor array) systems with a minimal reduction in the accuracy of tasks such as object recognition and classification.

Visual saliency in biological visual systems is realized physically as networks of neuronal cells such as simple cells and complex cells (Itti, Koch, & Niebur, 1998; Russell, Mihalaş, von der Heydt, Niebur, & Etienne-Cummings, 2014). The function of these networks can be accurately emulated for engineering applications using software or hardware. While software implementations are useful for prototyping and understanding neuromorphic algorithms, direct hardware implementations are more practical for real-time performance in high-bandwidth applications such as image processing. Therefore, this letter discusses the codesign of a computationally efficient saliency algorithm with the underlying analog hardware because the computational and physical design constraints cannot be fully decoupled.

This letter introduces a novel, computationally efficient bottom-up saliency algorithm. Although the focus of this letter is the algorithm itself, the algorithm's structure is inspired by the advantages of an analog computation framework based on time-mode (TM) circuits. Therefore, we discuss both computational and analog hardware considerations. With respect to this proposed algorithm, a primary contribution of this letter is an open source implementation, dubbed pysaliency, written in Python. The source code for this package has been made publicly available: git://code.ece.tufts.edu/nanolab/pysaliency. The following features are included in the package:

• An implementation of a state-of-the-art saliency algorithm (JHU POIS) with a large number of easily adjustable parameters

• A multiprocess implementation of a distributed, hardware-optimized saliency algorithm with support for introducing various sources of error

• Theano functions and scripts for training a network representing the serialized version of the saliency algorithm as well as running the forward model on a GPU

This letter is outlined as follows. Section 2 reviews the state of the art of visual saliency algorithms. Section 3 introduces the hardware-optimized saliency algorithm. Section 4 analyzes the effects of the various optimizations on hardware and computational cost. Section 6 covers the implementation details of the algorithm in time-mode analog hardware. Finally, section 7 provides comparisons and closing remarks.

## 2  Background

### 2.1  Visual Saliency Algorithms as Preprocessors for Machine Vision

The goal of a visual saliency algorithm is to highlight the most distinct regions of an image, often mimicking the equivalent process in a biological vision system. From the perspective of designing vision systems, visual saliency can be viewed as a type of region of interest (ROI) detection or attentional operator, informing downstream processing stages which parts of a scene warrant the highest priority. Some of the earliest algorithms for computing visual saliency were proposed by Itti and Koch for modeling attention in humans (Itti & Koch, 2001; Itti et al., 1998). More recently, Russell et al. (2014) proposed an algorithm that attempts to improve the biological plausibility of the model with respect to Gestalt principles as well as take steps toward a hardware-friendly formulation (Russell et al., 2014).

Bottom-up models of saliency, such as those discussed here, are typically not used in isolation due to the fact that their accuracy is limited when the background scene is complex and forms features that appear salient (Itti et al., 1998; Lee, Kim, Kim, Kim, & Yoo, 2010; Park, Hong, Park, & Yoo, 2016). Bottom-up saliency often acts only as a preprocessor for more computationally complex object recognition algorithms such as convolutional neural networks (LeCun, Bengio, & Hinton, 2015; Park et al., 2016). Therefore, in a real-time vision system, it is vital that a bottom-up saliency algorithm should have as little latency and energy consumption as possible while still providing a data-reduction step for algorithms later in the processing pipeline.

### 2.2  A State-of-the-Art Saliency Algorithm

Figure 1 illustrates a high-level overview of the steps involved in the saliency algorithms discussed in this letter. The general approach depicted here is derived from the proto-object saliency algorithm introduced in Russell et al. (2014). This approach is the basis of all the analog saliency algorithm proposed here. Therefore, we briefly review the proto-object saliency algorithm.

Figure 1:

Conceptual diagram of the saliency algorithm proposed in this letter.

Figure 1:

Conceptual diagram of the saliency algorithm proposed in this letter.

The Johns Hopkins University (JHU) proto-object image saliency algorithm (JHU POIS) offers a biologically realistic model of saliency that is potentially realizable in hardware. The algorithm consists of four primary steps: edge detection, object detection, border ownership, and grouping. Each of these steps is repeated for three types of channels: intensity, color, and orientation. Within each channel, the steps operate on a gaussian image pyramid rather than just the image itself to provide scale invariance. Furthermore, the algorithm has several normalization steps that share activations between channels. The channels are combined into a final saliency map.

A brief overview of the algorithmic steps (omitting the normalization steps for clarity) follows for proper context for the reformulation, but for a full description and analysis, we refer readers to Russell et al. (2014). First, edge detection at several different orientations is performed using even and odd Gabor filters, which have a biological analog (Jones & Palmer, 1987). The even and odd responses are combined with an $ℓ$2 norm. Next, center-surround filters are used to identify light-on-dark and dark-on-light objects. An additional annular filtering step (dubbed von Mises in the JHU work) is performed on each of these contrast types, and then the result is mixed with the edge responses and summed over the pyramid levels (for each pyramid level) to create border responses. The ownership of each border response is determined by a grouping step in which the argmax over the different orientations (corresponding to the original edge detection orientations) is computed. Finally, the same annular filtering as before is performed on each pyramid level of the result. Each resulting pyramid is upsampled with bilinear interpolation, and all resulting levels are summed. These steps are performed for each channel, and then all channels are averaged together to form the final saliency map.

The algorithm we have described has two key advantages. First, it is a significant step toward a functional but biologically plausible model of saliency compared to previous saliency algorithms since it incorporates several biologically plausible computational structures such as Gabor filters and winner-take-all (WTA) networks (Russell et al., 2014). Second, several of the steps are easily realized in hardware. However, in the context of using saliency for hardware-based compression and a software-side computational speed-up, an all-analog signal path promotes power savings by reducing the energy required for quantization. The JHU POIS algorithm as proposed has several steps that are not practical for an analog implementation. Nevertheless, a few modifications to this algorithm can lead to a formulation that is analog friendly but still largely preserves the accuracy of the saliency map as measured by a standard labeled saliency data set (Cheng, Mitra, Huang, Torr, & Hu, 2015). These modifications are described in detail in section 3.

## 3  Hardware-Optimized Saliency Algorithm

Although it is theoretically possible to implement the JHU POIS algorithm in analog hardware, there are many practical constraints. Figure 2 shows a block diagram of the algorithm. With the exception of the subtractors ($-$), multipliers ($×$), and argmax operations, each block represents a weighted sum corresponding to either a convolution or a pyramid sum. The steps are derived in section 3. Note that this version of the algorithm cannot be distributed due to shared computations among the convolutions. The algorithm proposed in this letter (see Figures 1 and 4) can be made fully distributed.

Figure 2:

Block diagram of one channel of the JHU POIS algorithm without normalization. This diagram shows an example parallel implementation with a patch of $15×9$ input pixels serialized at a time. Each connection is annotated with the number of wires in that particular bus (e.g., $8×5$) and the number of orientation duplicates of the bus in parentheses under the corresponding bus. The levels correspond to an image pyramid constructed from successively downsampled versions of the input image.

Figure 2:

Block diagram of one channel of the JHU POIS algorithm without normalization. This diagram shows an example parallel implementation with a patch of $15×9$ input pixels serialized at a time. Each connection is annotated with the number of wires in that particular bus (e.g., $8×5$) and the number of orientation duplicates of the bus in parentheses under the corresponding bus. The levels correspond to an image pyramid constructed from successively downsampled versions of the input image.

This section introduces an alternative to the JHU POIS algorithm that mitigates these issues. The modified version of the algorithm is shown in Figure 4. The modified version is largely motivated by circuit-level considerations. Therefore, an example design in a 45 nm complementary metal-oxide-semiconductor (CMOS) process is shown here to demonstrate the proposed algorithm's benefits in context.

### 3.1  Computational Primitives in Neuromorphic Hardware

Hardware implementations of neuromorphic architectures have traditionally focused on constructing integrate-and-fire neurons (IFN) in silicon (Bartolozzi & Indiveri, 2009; Oster, Wang, Douglas, & Liu, 2008; Sonnleithner & Indiveri, 2012). These artificial IFN structures can be used to realize weighted addition and winner-take-all (WTA) functions, which in turn can be used to implement universal approximators (e.g., perceptrons, neural networks; Gybenko, 1989; Rosenblatt, 1957). Traditionally, IFN circuits use pulse frequency modulation (PFM). However, it can be convenient to use a pulse width modulation (PWM) for constant throughput as well as the ability to efficiently realize many functions such as min and max with only logic gates despite the analog signal representation (Miyashita et al., 2014; Ravinuthula, Garg, Harris, & Fortes, 2009; Roberts & Ali-Bakhshian, 2010). Furthermore, the saliency algorithm introduced in this work requires a hardware multiplier. Going beyond this algorithm's requirements, being able to construct networks with multipliers can lead to richer functional representation and has biologically plausibility (Schmitt, 2002). Artificial neural networks can be used to approximate nonlinear functions such as multiplication (Ravinuthula, 2006). But several unit cells and stages are required, and the hardware cost is high. A common way to implement multiplication in neuromorphic systems is the coincidence detector circuit for multiplying two PFM-encoded signals (Srinivasan & Bernard, 1976). This circuit, however, assumes a probabilistic representation of spike timing and has a high error, especially for low-amplitude signals (Srinivasan & Bernard, 1976; Tal & Schwartz, 1997). Another approach is to exploit the nonlinear effects of the refractory period in a traditional IFN to coarsely approximate multiplication via a logarithmic transform; however, this approach is also inaccurate (>5% error) and low bandwidth (<20 Hz) (Tal & Schwartz, 1997). Recently, a translinear principle has been introduced for PWM-coded signals, and it allows analog synthesis of arbitrary nonlinear functions without expensive neural network approximations at real-time data rates (D'Angelo & Sonkusale, 2014, 2015a, 2015b, 2016). This discovery combined with the benefits already noted motivates the use of a PWM architecture for the algorithm introduced here. The circuits used here assume a discrete time PWM coding, sometimes referred to as a time-mode signal representation. Therefore, we use the terms PWM and time mode interchangeably in this discussion. The time-mode signal representation used in this work is illustrated in Figure 3 and compared with a voltage-mode representation of analog signals.

Figure 3:

Comparison of voltage-mode (dashed lines) and time-mode/PWM (solid lines) signal representations. Voltage-mode circuits have inputs $Vini$ and produce one or more output voltages, $Vout$. Time-mode circuits take in pulses with widths, $ti$, and produce an output pulse width, $tout$. Each voltage represents one voltage-mode sample (i.e., the instantaneous intensity of a pixel). Each pulse width represents one time-mode sample of the pixel intensity.

Figure 3:

Comparison of voltage-mode (dashed lines) and time-mode/PWM (solid lines) signal representations. Voltage-mode circuits have inputs $Vini$ and produce one or more output voltages, $Vout$. Time-mode circuits take in pulses with widths, $ti$, and produce an output pulse width, $tout$. Each voltage represents one voltage-mode sample (i.e., the instantaneous intensity of a pixel). Each pulse width represents one time-mode sample of the pixel intensity.

### 3.2  Image Channels and Normalization

The JHU POIS algorithm first processes the input image to extract the intensity and multiple color channels. The algorithm proposed here also uses these channels exactly as reported in Russell et al. (2014). The channels used are intensity (one channel), color opponency (four channels), and orientation (four channels).

In addition to the channels, the JHU POIS algorithm includes cross-communication between channels in the form of normalization. Normalization is also applied in the grouping step, discussed later in this section. In the algorithm presented in this letter, however, these normalization steps were found to have a minimal effect on the data set results. Furthermore, it was unclear whether the effect was positive or negative across the data set. Therefore, for the proposed hardware-oriented algorithm, channels are intended to operate independently in parallel as separate networks with no internal cross-connectivity. A single neuron with nine inputs, one from each channel (four color, four orientation, and one intensity), would be used to perform the final averaging of these channels. An implementation of the normalization algorithm used to inform this additional omission from the algorithm is provided in the pysaliency library.

### 3.3  Pyramid Generation

The next step of the algorithm is to compute an image pyramid, which consists of $L$ images such that each image has been downsampled by $2l∀l∈[1,L]$. In JHU POIS, this image pyramid is downsampled using an interpolation technique such as bilinear interpolation. However, implementing a 2D interpolation in analog hardware is challenging because it requires averaging circuits between each pixel in the downsampled image. Also, if the algorithm is serialized with respect to the input pixels, downsampling without interpolation allows each pixel to be computed individually from a small subset of the pixels in the image. This configuration greatly simplifies the circuit design because a homogeneous cell that computes the saliency of one pixel from its neighbors can be arrayed in parallel. The input pixels can then be serialized through the array, allowing the area and bandwidth of the network to be traded off as needed.

### 3.4  Image Convolutions

At this point, the algorithm computes a series of convolutions. The kernels in the JHU POIS algorithm used $5×5$ floating-point resolution kernels generated by equations reported in Russell et al. (2014). In the algorithm performance results reported in this paper, $3×3$ integer kernels are used. The kernel size was reduced to trade off some accuracy to save circuit area. The resolution of the kernel weights was limited to integer values to improve device matching by using a more uniform layout for the devices at the expense of dynamic range in the weights. Furthermore, the output of each convolution in the analog formulation would be half-wave-rectified, which cuts the circuit area in half because a differential circuit is required to implement signed PWM/time-mode computation. The single-ended computation was found to have only a modest impact on the performance of the algorithm. All convolutions are performed with self-padding of the image, that is, the two outermost columns and two outermost rows of the perimeter image are duplicated outward $intK2$ times. The following sections discuss the computational steps of the algorithm with additional details about the convolution kernels.

### 3.5  Edge Detection

The first set of kernels applied by the algorithm are edge-detection kernels, which compute approximations of the derivative of the image pixels, highlighting the edges. The JHU POIS algorithm called for Gabor-edge detectors, but Sobel filters are used here for their simplicity and integer weights. Four orientations, $θ$, each of the following even and odd Sobel edge detectors, are convolved with the input image:
$geven=-12-1-24-2-12-1,$
(3.1)
$godd=20-240-420-2.$
(3.2)
The orientations, $0,π4,π2,3π4$, are generated by simply rotating the kernels in steps of $π4$. In the $3×3$ kernel, this is equivalent to rotating the outer pixels counterclockwise by 1 pixel. The JHU POIS algorithm computes the $ℓ$2 norm of these two responses. This computation was reduced to the $ℓ$1 norm, or the absolute value of the sum, which reduces the requirement for three translinear circuits to a single linear circuit,
$Ceven,θl=Iin*geven,θ,$
(3.3)
$Codd,θl=Iin*godd,θ,$
(3.4)
$Cθl=Ceven,θl+Codd,θl,$
(3.5)
where the notation $Cθl$ indicates the image that results from the convolution for the kernel with $θ$ orientation and the $l$th level of the pyramid. The parentheses in the superscript are used to differentiate the level indexing from exponent notation.

### 3.6  Center-Surround Filters

The concept of a center-surround filter is inspired by the structure of retinal ganglion cells in the optic nerve. The concept is that a center circle and a concentric circle surround the center circle. This structure allows the saliency algorithm to make a distinction between light-on-dark and dark-on-light contrasts of objects in a scene by modulating the edge response with the center-surround response. The filters used here are integer approximations of a truncated version of the filters used in JHU POIS. The light-on-dark contrast filter used here is the following:
$CSL=1-21-24-21-21.$
(3.6)
The dark-on-light contrast is just the inverted version of the light-on-dark filter:
$CSD=-CSL.$
(3.7)

### 3.7  Border Ownership

The next step of the algorithm is to compute border ownership. The goal of this step is to solve the problem that objects in a scene often partially block one another from the perspective of the image sensors. The result of this occlusion is that borders are ambiguous, and a decision must be made about which objects own which borders or edges. Segments of an annular (i.e., ring-shaped) filter are used to map the activity from the output of the center-surround filters to the edges of different objects in the scene. These mappings are scaled and summed upward across the $L$ pyramid levels for each contrast type, $c∈L,D$, and orientation, $θ$,
$Vθ,cl=∑j≥lL12jvθ*CScj,$
(3.8)
where higher levels of the pyramid (i.e., larger $l$) correspond to more downsampling, that is, a factor of $2l$ fewer pixels in each dimension. The von Mises filters used in the JHU POIS algorithm are complex, large filters with floating-point precision. The general effect provided by these filters was approximated by the following filters:
$vθ∈0,π4,π2,3π4=0000000000000110000000000,00001000100000000000000000010000100000000000000000,1000001000000000000000000.$
(3.9)
In the analog-friendly formulation of the algorithm, the $vθ$ filters were omitted as the error incurred by skipping this step was not significant relative to the large power and area savings that come with omitting them, even in the reduced form shown. The rationale behind this design decision is discussed in detail in section 4.2. Revised equation 3.8 for the analog-friendly formulation is then
$Vcl=∑j≥lLCScj.$
(3.10)
The level sums of the light-on-dark and dark-on-light center-surround responses are used to modulate the edge responses, $Cθl$, to compute a measure of border ownership for each orientation:
$Bθ,Ll=Cθl+CθlVθ+π,Ll-Vθ,Dl,$
(3.11)
$Bθ,Dl=Cθl+CθlVθ+π,Dl-Vθ,Ll,$
(3.12)
$Bθl=Bθ,Ll-Bθ,Dl.$
(3.13)
If the von Mises filtering is skipped, the border response reduces to
$Bθl=Cθl+CθlVLl-VDl$
(3.14)
because the level sums will only have contrast and will no longer have orientation pathways. This omission physically reduces the size of the network.

### 3.8  Border Grouping

The next and final step of the core algorithm is a grouping step used to combine border responses into saliency maps. This step uses a winner-take-all structure to determine which orientation of the border signals responds the strongest. This orientation signal is then convolved with another von Mises kernel at the same angle. Therefore, an argmax circuit over the border orientation signals is required. Furthermore, the existence of another convolution with an angle parameter, $θ$, necessitates the parallel computation of all pixels that would be inputs to any of these convolutions due to the fact that the “winner” is unknown prior to this border grouping step. The massive parallelization required by this step of the algorithm is challenging for an efficient analog implementation. For each level, the grouping step will be
$θmax=argmaxθBθl,$
(3.15)
$Simg=∑lLvθmax*Bθmaxl.$
(3.16)
In the simulation studies, another optimization was found to produce a saliency map with acceptable accuracy relative to the data set used for evaluation. In the vein of serialization of the algorithm and thus avoiding the additional parallel convolutions required by the second von Mises step, the argmax function can be reduced to the max function. This modification requires the second von Mises kernel to be eliminated since the maximum angle will be unknown. Only the maximum response signal will be passed on to the next stage of computation. The maximum border response summed over the levels is taken as the saliency map. This simplification seems to actually improve the results of the algorithm on the evaluation data set compared to the Python implementation of the JHU POIS algorithm. However, a direct comparison with the original algorithm's implementation would be necessary to truly verify this result. Regardless, the performance of the simplified algorithm was sufficient to justify the massive area and power savings. The max rule for the saliency thus becomes
$Bl=maxθBθl,$
(3.17)
$Simg=∑lLBl.$
(3.18)

The intuition for why the reduction of the argmax over the differences to the max over the ownership responses themselves is an effective approximation is as follows. Large edge response at a particular orientation at a certain location combined with a large center-surround response (indicating an object) at that location indicates a high probability that that edge belongs to that object. By simply taking the maximum of these modulated responses across the orientations and adding them up over the levels, the algorithm groups objects together in a scale-invariant manner.

The full algorithm with both von Mises kernels as well as the argmax-based WTA implementation is depicted in Figure 2. The fully hardware-optimized algorithm without a von Mises step and with the max-based WTA selection is shown in Figure 4. Four levels were chosen for the transistor-level design example. The block diagram in Figure 4 illustrates the full analog-friendly formulation of the algorithm. This formulation will be used in a motivating design example in section 6.

Figure 4:

Block diagram of the hardware-optimized saliency algorithm. Each tick mark represents the number of analog wires on each bus. The numbers in parentheses underneath represent the number of distinct orientation buses there are for that set of nodes in the computation. A comparison with the algorithm in Figure 2 is also shown.

Figure 4:

Block diagram of the hardware-optimized saliency algorithm. Each tick mark represents the number of analog wires on each bus. The numbers in parentheses underneath represent the number of distinct orientation buses there are for that set of nodes in the computation. A comparison with the algorithm in Figure 2 is also shown.

### 3.9  Annular Filter-Based Saliency Algorithm

This section describes an exceptionally simple saliency algorithm with moderate performance as an alternative to the primary saliency algorithm presented thus far. However, its attractiveness lies in its simplicity. The term annular filter is used to describe this approach because it traces a ring around the pixel of interest. Consider a $3×3$ patch, $Pin$ selected from an input image, $Iin$:
$Pin(l)x,y=p0p1p2p3p4p5p6p7p8,$
(3.19)
where $p4=Iinx,y$ is the pixel at the $x,y$ coordinate of the image. The saliency decision is computed as follows. For each level of the image pyramid with $L$ levels, the absolute difference between each pair of opposing pixels around the center pixel is computed. The max of these differences is then summed for each level, and the sign of the result is taken as the saliency decision for the center pixel, $p4=Ix,y$:
$Sx,y=sgn∑l=0Lmaxi∈0,N-12-1pi(l)-pN-i-1(l).$
(3.20)
In the context of time-mode circuit implementations, this algorithm requires $L$ adders, $L$ 4-input max circuits, $4L$ subtraction circuits, and one phase detector (e.g., two flip flops and one inverter) for the $sgnx$ function (i.e., a comparator). In a time-mode implementation, the entire circuit could be implemented using only digital gates (although the signal representation remains analog) with a small area footprint and high bandwidth. However, the saliency classification performance is significantly reduced compared to the full algorithm, as shown in Figure 5a.
Figure 5:

Overview of results on the data set reported in Cheng et al. (2015). (a) Eight different thresholds and six different versions of the saliency algorithm simulation (see Table 1). Each data point represents the mean TPR/accuracy averaged across pixels within each image and averaged again across the data set images. (b) Averages of simulation results over 40 random images. The $y$-axis represents image counts over the data set, and the $x$-axes correspond to AUC in the top plot and accuracy in the bottom plot.

Figure 5:

Overview of results on the data set reported in Cheng et al. (2015). (a) Eight different thresholds and six different versions of the saliency algorithm simulation (see Table 1). Each data point represents the mean TPR/accuracy averaged across pixels within each image and averaged again across the data set images. (b) Averages of simulation results over 40 random images. The $y$-axis represents image counts over the data set, and the $x$-axes correspond to AUC in the top plot and accuracy in the bottom plot.

## 4  Cost Analysis and Hardware Considerations

The saliency algorithm has many potential design trade-offs in terms of computational complexity, hardware cost, and design complexity. This section analyzes several of these trade-offs.

### 4.1  Time-Mode Circuits for Computing Max and Argmax

The JHU POIS algorithm calls for an argmax operation over the kernel orientations to determine the outputs that will be passed along to a second step of von Mises filtering. This operation was simplified by simply summing the maximum response over the orientations for each level and using the result as the final saliency score. However, as there may be a benefit to a CMOS implementation of the original argmax-based algorithm, we introduce and analyze a time-mode argmax circuit here. We compare the number of gates between the time-mode max and argmax circuits.

Assuming a synchronous, single-ended clocking scheme and assuming that the times of the rising edges of the inputs relative to the beginning of the clock frame encode the data, a max function implemented with time-mode circuits requires an $N$-input NAND gate, which for a large number of inputs is typically implemented with $NNAND$ two-input NAND gates:
$NNAND=∑b=1⌈log2(N)⌉N2b.$
(4.1)
This relationship comes from the fact that there will be $N2b$ stages of NAND gates, where each stage has a factor 2 fewer NAND gates than the last until there is only one NAND gate. The structure, depicted in Figure 6a, will equalize the delay through the NAND gates, minimizing the offset error (i.e., nonuniform skew) introduced by the circuit.
Figure 6:

Time-mode (i.e., discrete time pulse width modulation) based implementations of max and argmax functions. (a) An inverted version of the signal with the widest pulse width is passed to the output. (b) The output is one-hot encoded. A 1 appears on the wire, $ix$, the index of the input signal with the largest pulse width.

Figure 6:

Time-mode (i.e., discrete time pulse width modulation) based implementations of max and argmax functions. (a) An inverted version of the signal with the widest pulse width is passed to the output. (b) The output is one-hot encoded. A 1 appears on the wire, $ix$, the index of the input signal with the largest pulse width.

Under the same clocking assumptions, argmax can be implemented with a combination of phase detectors (PDs), that is, time-mode comparators, and AND gates. An example of such a circuit is shown in Figure 6b. First, every combination of each pair of $N$ inputs is compared with one another, requiring $Npd$ phase detectors:
$Npd=N2=N!2!N-2!.$
(4.2)
Once each input pair has been compared, two stages of gates are required to determine which index corresponds to the maximum pulse width. First, consider the first input, $x0$ with index, $i=0$. $x0$ will be the input to $N-1$ phase detectors. If $xi$ is the max input, all of these phase detectors will output logic HIGH. If it is not, some will output logic LOW. Therefore, computing the $N-1$ input AND of the phase detectors with $x0$ as one of the input signals will determine if $x0$ is the max. Let the output of this $N-1$ input AND gate be called $y0$. The next input, $x1$, will also be the input to $N-1$ phase detectors, but the logical AND of only $N-2$ outputs, named $y1$, will be necessary to determine if $x1$ is the max because in the second stage of logic, $y0‾∧y1$ will indicate if $x1$ is the max. Continuing this pattern for $N$ inputs and assuming that the phase detectors have logic LOW outputs so that the AND gates can be replaced with NOR gates, and applying the relationship derived in equation 4.1 for converting N-input logic to two-input logic, $NNOR(1)$ gates will be required in the first stage:
$NNOR(1)=∑n=1N∑b=1⌈log2(N)⌉n-12b.$
(4.3)
The second stage can be converted to the minimum number of physical gates by changing the AND circuits to NOR circuits and connecting the inverted outputs, $yi‾$, appropriately. $N-2$ inverters and $NNOR(2)$ will be required by the second stage:
$NNOR(2)=∑n=1N∑b=1⌈log2(N)⌉n2b.$
(4.4)

Figure 7 plots the total number of gates for the max and argmax circuits as a function of the number of inputs assuming a standard two flip-flop implementation of the phase detector, as well as a simpler XOR-based phase detector. It can be seen from these relationships that the max function requires far fewer logic gates as the number of inputs increases. This fact motivates the reformulation of the saliency algorithm from that requiring an argmax function to that requiring only a max function. Even if it were the case that the overall accuracy was degraded by this change, if the output remains an acceptable classifier, this performance loss may be justifiable by the much higher efficiency of the max-based saliency network.

Figure 7:

Comparison of circuit area required by TM max and argmax circuits.

Figure 7:

Comparison of circuit area required by TM max and argmax circuits.

### 4.2  Computational Complexity and a Distributed Saliency Algorithm

One of the critical design decisions that makes an analog ASIC implementation of a saliency algorithm practical to implement cost-effectively is the elimination of the von Mises or annular filtering stage. More generally, this omission simplifies the design because a second layer of convolutions requires a significant number of additional convolution calculations in the first stage. In the case of the annular filters, this effect is exacerbated by the fact that several angles of rotation of the kernels must be computed. The degradation in accuracy is reported in Figure 5b, but this section discusses the improvement in computational complexity of the algorithm that directly translates to a reduction in power and area in a hardware implementation. Furthermore, the effect of parallelization of the architecture is also analyzed.

The algorithm can be parallelized in two ways. First, the saliency of each pixel can be computed separately, ignoring the redundant computations among neighboring pixels in the computation of each one's saliency. In this form, each network computes one pixel at a time with no connections to other networks. This approach has the advantage of simplifying the architecture, but some energy is wasted on duplicating the convolution computations that overlap among adjacent pixels. The alternative is to allow overlapping connections between adjacent networks, which saves power and area by eliminating the redundancy but increases the wiring complexity of the implementation. In other words, computing the saliency of groups of pixels reduces the number of computations because some computations will be identical. However, accounting for these requires connections between the neighboring computational units, preventing distributed computation in a hardware implementation.

Four versions of the algorithm are analyzed in Figure 8. The source code used to compute the number of computations in this plot is included in the pysaliency library. It can be seen that the most efficient implementation considers redundancy and skips the von Mises algorithm. Also, skipping the von Mises algorithm but opting for the less complex network (i.e., the simplest of the four designs) provides the same improvement as using the von Mises algorithm with the more complex network with shared computations. From this analysis, it was determined that the redundant and fully parallel architecture (i.e., no von Mises, redundant in Figure 8) is the most suitable for an analog circuit implementation.

Figure 8:

Computational complexity of the analog formulation of the saliency algorithm as a function of network parallelization (i.e., the number of pixels whose saliency is computed in parallel). This plot illustrates the effect of two design decisions. First, the required energy (GOPS/W) for computing the saliency for a real-time video application can be reduced by increasing the number of pixels computed in parallel. When computing adjacent pixels in parallel, the convolutions can be shared to save energy but increase wiring complexity. Second, eliminating the von Mises kernel (or second-stage kernels in a more generalized network) results in a substantial reduction in the required computing energy.

Figure 8:

Computational complexity of the analog formulation of the saliency algorithm as a function of network parallelization (i.e., the number of pixels whose saliency is computed in parallel). This plot illustrates the effect of two design decisions. First, the required energy (GOPS/W) for computing the saliency for a real-time video application can be reduced by increasing the number of pixels computed in parallel. When computing adjacent pixels in parallel, the convolutions can be shared to save energy but increase wiring complexity. Second, eliminating the von Mises kernel (or second-stage kernels in a more generalized network) results in a substantial reduction in the required computing energy.

For an analog implementation of an algorithm, noise, mismatch, and cross talk introduce challenges to fully parallelized computation. Conversely, serializing all of the input data limits the device bandwidth. Therefore, it is useful to formulate the algorithm such that it is fully distributed with respect to each pixel of the saliency map so that there is flexibility in the parallelization and serialization of the data flow through the circuit. Fortunately, the analog-friendly reformulation of the algorithm lends itself to a fully distributed implementation of the saliency algorithm. This benefit comes from removing the von Mises or annular filter in the second stage of the algorithm as described in section 3.7. A consequence of this omission is that there will only be one stage of image convolutions. This simplified structure allows each pixel's saliency to be computed individually from a subset of only $K2L$ pixels, where $K$ is the width of the kernel and $L$ is the number of levels in the image pyramid. This set of neighboring pixels around the current pixel of interest can be inferred from the image pyramid. A rastering algorithm was designed to precompute this subset of pixels that is needed for computing each output pixel. This rastering algorithm uses simple logic to step through each pixel in the image and sequentially select the correct $K2L$ neighbors. This rastering pattern can be stored in a look-up table (LUT) that drives the row column decoder of an image sensor in a fully integrated implementation.

### 4.3  Local Averaging and Ensemble Networks

Saliency algorithms can be top down or bottom up in both computer science and neuroscience. Bottom-up saliency refers to a quantification of the distinctiveness of a pixel and its neighbors (Melloni, van Leeuwen, Alink, & Müller, 2012; Rauss & Pourtois, 2013). Conversely, top-down saliency is the selection of the saliency of the pixels based on higher-level goals (Melloni et al., 2012; Rauss & Pourtois, 2013). Consequently, there are fundamental limitations to how accurate bottom-up saliency can be in the context of subjective higher-order reasoning, and the best attentional operator would be a combination. The algorithm presented in this letter is a bottom-up approach to saliency. Therefore, to boost the accuracy of the results in the bottom-up context, we propose two modifications.

First, we propose locally averaging the real numbered value of saliency of neighboring pixels prior to applying the threshold to boost the accuracy. Later, we will show that this strategy is effective to improve the results when measured against an image data set. The following equation describes the modified saliency update:
$Sn=∑i=n-MnSi,$
(4.5)
where $n$ is the current discrete-time index, which is also the index of the current pixel, $S$ is the saliency value of the current pixel, and $M$ is the number of local pixels to average. The order in which pixels are rastered can then be adjusted to implement averaging over different regions of pixels. In the results presented in this letter, averaging occurs only on column-wise adjacent pixels in the same row. At the left edges of the image, no average is computed, as the right edge of the image should not affect saliency on the left edge in the bottom-up approach. It is likely that a more complex rastering scheme that includes averaging of row and column neighbors would produce superior results.

In an analog circuit, the averaging could be implemented with $M$ time-mode latches (Ali-Bakhshian & Roberts, 2012) and an additional time-mode adder that stores the previous $M$ saliency values onto $M$ capacitors and reads out the sum. However, this structure was not included in the transistor-level design. Nevertheless, this averaging could be performed in the digital domain to demonstrate the concept. If the rastering scheme is used in an analog design, this local averaging can be seen as a type of memory. The results tagged *-mem in Figures 5a and 5b refer to those using this local averaging technique.

## 5  Saliency Algorithm Results

Various versions of the saliency algorithm were implemented, and their performance was assessed on a data set. Table 1 lists the differences between the six algorithms used in this study. The primary differences are the use of annular filters, the use of the local averaging technique, and the introduction of expected hardware errors and nonidealities into the simulation. These errors were incorporated into behavioral circuit models from transistor-level simulations and include jitter, charge injection, clock feed-through, device mismatch, nonlinearity, and leakage.

Table 1:
Overview of the Different Versions of the Six Saliency Algorithms Tested on the Data Set from Cheng et al. (2015).
 Algorithm Description Errors Averaging Normalize Channels Annular Filter WTA Kernels Norm POIS Mimic of JHU Ideal None Yes Intensity, von Mises argmax Float L2 Color, Orientation HW-ideal Serialized Rectified None No Intensity No max Integer L1 HW-errors Serialized Circuit None No Intensity No max Integer L1 HW-memory Serialized Rectified 4 pixels No Intensity No max Integer L1 Annular Serialized Rectified None No Intensity Custom max Integer NA Annular-memory Serialized Rectified 4 pixels No Intensity Custom max Binary NA
 Algorithm Description Errors Averaging Normalize Channels Annular Filter WTA Kernels Norm POIS Mimic of JHU Ideal None Yes Intensity, von Mises argmax Float L2 Color, Orientation HW-ideal Serialized Rectified None No Intensity No max Integer L1 HW-errors Serialized Circuit None No Intensity No max Integer L1 HW-memory Serialized Rectified 4 pixels No Intensity No max Integer L1 Annular Serialized Rectified None No Intensity Custom max Integer NA Annular-memory Serialized Rectified 4 pixels No Intensity Custom max Binary NA

The data set performance was quantified by using a binary classification view of the saliency algorithm. The data set in Cheng et al. (2015) was restructured to fit this formulation. The data set was randomly split into standard sets for training and cross-validation. In order to keep the training time reasonable, a random subset of images was chosen, and from these images, a random subset of pixels was chosen. The rastering algorithm was then used to compute the input examples and output labels.

Because saliency is a somewhat subjective measure, metrics on the data set were used to quantify the results. In light of the binary classification formulation, the area under the curve (AUC) of the receiver operator characteristic (ROC) was used for quantification (Fogarty, Baker, & Hudson, 2005). First, a set of 40 random images was chosen from the data set for verification. The rastering algorithm was applied to each image, the saliency map of each image was computed by six different versions of the algorithm, and eight different classification thresholds were applied to each pixel. A confusion matrix and the accuracy of the saliency decision were then computed for each image. These confusion matrices were averaged across the images to generate the parametric ROC curves as a function of the threshold, as well as the accuracy over threshold depicted in Figure 5a. A naive threshold value was also computed as a baseline comparison point. This curve represents choosing saliency based on the magnitude of the pixel intensity instead of using a saliency algorithm.

The saliency metrics did vary significantly with the images themselves, as some scenes are more suitable for saliency than others. To show this variation, histograms are plotted in Figure 5b showing the ROC AUC score and the accuracy across images. It can be seen that the algorithm that incorporates the hardware nonidealities does exhibit reduced accuracy, but at a threshold of 0.4, it maintains an accuracy above 70% with a true-positive rate at nearly 80% and a false-positive rate of about 60%, indicating classification behavior from the algorithm. The use of the local averaging technique maintains the accuracy of the error-corrupted hardware result but with a significant reduction in the false-positive rate. Therefore, the performance is boosted with this simple additional processing.

A key advantage of this formulation of the saliency algorithm is that it is fully distributed across the pixels of the image and therefore lends itself to parallel processing. In the context of time-mode circuits, this formulation allows the groups of neighboring pixels to be rastered through a single analog circuit. This technique reduces the overall area substantially by taking advantage of the high bandwidth of the time-mode building blocks relative to the required real-time frame rate. Furthermore, by using the same analog circuit, matching across the image is greatly improved. In the digital domain, the same rastering technique could be used for a digital application-specific integrated circuit (ASIC) or field-programmable gate array (FPGA) implementation. Furthermore, an improvement can also be gained in a GPU implementation or on a CPU implementation using multiprocessing. Figure 9 shows the timing benchmarks for the different versions of the algorithm tested in Figure 5a. In this benchmarking test, the saliency map of a single frame was computed for two different aspect ratios with and without CPU parallelization. The algorithms (except for JHU POIS) were parallelized across eight processes, resulting in a speed of about six times. The implementation of the JHU POIS algorithm utilizes SciPy's optimized convolution method, whereas the other hardware-optimized algorithms distribute the computations into dot products of a weight matrix with a subset of neighboring pixels. The SciPy implementation results in a speedup through the use of C-level computations via Cython (Behnel et al., 2011), whereas the distributed computations allow multiprocessing to be used. Despite not using the optimized SciPy convolutions, the distributed processing results in faster computation of the saliency map by utilizing multiple CPU processes without any data sharing required between them. However, the distributed algorithm is most suitable for specialized hardware implementations such as on a GPU, and most notably for a custom analog or digital hardware implementation, because performing convolutions on the entire image in parallel can be prohibitive from a power and area perspective in an ASIC design.

Figure 9:

Benchmarks for each algorithm on a single test image with multiprocessing for different resolutions. Ran on AMD Opteron Processor 6380.

Figure 9:

Benchmarks for each algorithm on a single test image with multiprocessing for different resolutions. Ran on AMD Opteron Processor 6380.

Figure 10 shows an example image from the data set introduced in Cheng et al. (2015) processed with the HW-errors algorithm and compared with the custom implementation of the JHU POIS algorithm. The JHU POIS algorithm performs well, but it is computationally complex and more challenging to implement in analog hardware, as discussed in previous sections. The HW-errors model uses the compact analog algorithm and incorporates conservative circuit errors extracted from Spectre simulations. A threshold of 40% is used to create the masks for compressing the images shown on the bottom right. Only 24% of the pixels were retained to create the mask in the analog version compared to 37% in the JHU POIS algorithm. The full shape of the vehicle is preserved, rejecting the background despite the complex texture of the scene. This result suggests that this algorithm may have applications to machine vision for automated vehicles.

Figure 10:

The top row shows the original image and its ideal saliency map from the data set (Cheng et al., 2015). The second row shows the results with the Python implementation of the JHU POIS algorithm. The bottom row shows the behavioral simulation results of the analog saliency algorithm with noise and mismatch parameters set to $2σ$ larger than the RMS deviation measured from the Cadence Spectre simulation. The compression ratio (CR) is shown as a percentage of pixels preserved by masking the image with a thresholded saliency map.

Figure 10:

The top row shows the original image and its ideal saliency map from the data set (Cheng et al., 2015). The second row shows the results with the Python implementation of the JHU POIS algorithm. The bottom row shows the behavioral simulation results of the analog saliency algorithm with noise and mismatch parameters set to $2σ$ larger than the RMS deviation measured from the Cadence Spectre simulation. The compression ratio (CR) is shown as a percentage of pixels preserved by masking the image with a thresholded saliency map.

## 6  Circuit Considerations for the Saliency Algorithm

This section discusses circuit-level considerations for an integrated circuit design of the saliency algorithm introduced thus far. In this design example, the algorithm takes a vector of 36 pixels as input as selected from the scene by a rastering algorithm, whose source code is provided in the online appendix. The intensity of each input pixel is represented as an analog pulse width. The saliency algorithm circuit produces a single output pulse whose pulse width is proportional to the saliency value.

### 6.1  Input Rastering Scheme

The goal of the hardware-friendly algorithm is to compute a saliency map across an entire image one pixel at a time. In a fully parallelized version of the algorithm, the image is first downsampled by a factor of 2 several times to create an image pyramid. In the transistor-level design, the first four layers of this pyramid are used. The algorithm depends on several convolutions, each with $3×3$ kernels. Therefore, in order to serialize the algorithm, $3×3×4$ pixels are required to compute each output pixel. In order to compute the saliency map for an entire image, each set of 36 pixels is serialized through the saliency circuit at 4.4 MHz, which is sufficient for computing $256×256$ resolution saliency maps from 60 Hz video. The input video resolution can be higher by appropriately downsampling and assuming that neighboring pixels share the same saliency value.

### 6.2  Analog Computation Circuits

The time-mode neuron circuit proposed in Ravinuthula et al. (2009) is used for computing the convolutions and the two level sums. XOR gates are used to perform the subtractions that are not part of the convolution operations. The multiplication stage is implemented using the translinear pulse width multiplier introduced in (D'Angelo & Sonkusale, 2015a, 2016). Finally, the logic gate–based max circuit used is depicted in Figure 7a. The convolution weights were hard-coded in the sizing of the current sources to save chip area, reduce the number of pads, and reduce the wiring complexity. The weights can be adjusted prior to fabrication via a layout automation process detailed in section 6.3.

### 6.3  Synthesis Approach

An advantage of using quasi-digital pulses to represent analog information is that synthesis techniques can be used to automate the circuit layout because standard-timing constraints and skew-reduction techniques apply to time-mode circuits. The saliency algorithm designed here was synthesized using such techniques. The analog network was designed and synthesized in IBM soi12s0 45 nm SOI CMOS process, and a 1 V supply was used for the reported data from the transistor-level simulations.

First, the algorithm was described mathematically in the Python programming language. A modular object-oriented design paradigm was used such that various aspects of the algorithm could be effectively removed or modified. This modular design was used with expert knowledge of analog circuit design to limit the algorithm's operations to those that could be performed efficiently with time-mode circuits. Furthermore, errors mimicking the nonidealities from the simulated hardware were introduced into the Python implementation to verify the error resiliency of the algorithm.

The weights in the convolution network were extracted from the software implementation of the algorithm for circuit synthesis. A Python program was then used to convert the weight description and the higher-level algorithm description into a Verilog netlist (see Figure 11). The Verilog netlist was imported into Cadence Virtuoso. A Cadence SKILL-based parameterized cell (PCell) was designed to read in the weight description and generate the layouts for the weighted adder and multiplier circuits. This process involves first designing the core of the adder and multiplier by manual analog design, then combining these cores with the cells generated by the PCell using a digital synthesis approach.

Figure 11:

Schematic of the all-analog single-pixel saliency algorithm. A Python program converts a numpy/matrix representation of the algorithm into a Verilog netlist that can be imported by ASIC electronic design automation tools.

Figure 11:

Schematic of the all-analog single-pixel saliency algorithm. A Python program converts a numpy/matrix representation of the algorithm into a Verilog netlist that can be imported by ASIC electronic design automation tools.

The top-level layout of the saliency algorithm transistor-level design was synthesized in Cadence's Encounter suite using automated place and route of macros representing the computational building blocks. The 1.5 mm $×$ 1.5 mm layout is shown in Figure 12. A flip-chip pad array was used to achieve a larger input/output count and reduce parasitic inductance and capacitance.

Figure 12:

Synthesized layout of the single-pixel saliency algorithm implemented with analog time-mode circuits in IBM soi12s0 45 nm SOI CMOS process. Total area is 1.5 mm $×$ 1.5 mm, but this area value is limited by the pad I/O. The top-level layout was placed and routed using Cadence Encounter. The layouts of the kernel circuits are generated using a PCell that takes a weight matrix as input and produces design rule checking/layout versus schematic clean layouts of the kernels.

Figure 12:

Synthesized layout of the single-pixel saliency algorithm implemented with analog time-mode circuits in IBM soi12s0 45 nm SOI CMOS process. Total area is 1.5 mm $×$ 1.5 mm, but this area value is limited by the pad I/O. The top-level layout was placed and routed using Cadence Encounter. The layouts of the kernel circuits are generated using a PCell that takes a weight matrix as input and produces design rule checking/layout versus schematic clean layouts of the kernels.

### 6.4  Transistor-Level Simulations of the Saliency Network

Postlayout simulations were performed on the synthesized saliency network with the Spectre circuit simulator from the Cadence Design Suite to verify correct network synthesis. The results show an average power consumption of 3.7 mW at 60 Hz for $256×256$ saliency maps. Furthermore, Monte Carlo simulations demonstrate less than a 1 ns standard deviation in the absolute output pulse width. Direct observation of the transient waveforms indicates that computations are propagating through the network successfully. A summary of the specifications is shown in Table 2. The minimum input pulse width that produces an output in one of the time-mode neurons is approximately 50 ps. We assumed 223 ps of jitter for the output dynamic range calculation, extrapolated to accumulate over the five stages from a transient noise simulation for 1 ms.

Table 2:
Specifications of Single Pixel Saliency Transistor-Level Design.
 Specs Cadence Spectre Simulation Area 1.5 mm $×$ 1.5 mm Frame rate 60 fps Clock frequency 4.4 MHz Power 3.7 mW Aspect ratio 256 $×$ 256 pixels Chip inputs 36 pulses Input dynamic range 70.10 dB (160 ns max) Output dynamic range 42.58 dB (30 ns max) Output jitter 223 ps (⁠$0.74%$⁠) Output deviation 1 ns (⁠$3.3%$⁠)
 Specs Cadence Spectre Simulation Area 1.5 mm $×$ 1.5 mm Frame rate 60 fps Clock frequency 4.4 MHz Power 3.7 mW Aspect ratio 256 $×$ 256 pixels Chip inputs 36 pulses Input dynamic range 70.10 dB (160 ns max) Output dynamic range 42.58 dB (30 ns max) Output jitter 223 ps (⁠$0.74%$⁠) Output deviation 1 ns (⁠$3.3%$⁠)

### 6.5  Comparison with Existing Saliency Hardware

Dedicated hardware for computing saliency maps falls under the broader category of visual attention systems. Several architectures have been proposed in both the analog and digital domains. However, to date, most analog implementations of saliency rely on WTA operations on the temporal derivatives of pixel intensities or the spatial derivatives of neighboring pixels (Bartolozzi & Indiveri, 2009; Horiuchi & Niebur, 1999; Sonnleithner & Indiveri, 2012). The latter approach is comparable to the so-called annular filter-based algorithm presented in section 3.9. Furthermore, the existing architectures often assume a saliency map as an input to an attentional tracker (Morris, Horiuchi, & DeWeerth, 1998). The algorithm presented in this work is the first analog implementation of the border ownership modulation scheme requiring an analog multiplier. Regardless, the existing work makes use of the same fundamental building blocks, neuron circuits (i.e. weighted adders), and WTA circuits, and these architectures can serve as a reasonable point of comparison. Table 3 shows a comparison of several architectures found in the literature with the one introduced here. Previous analog architectures have been limited in their real-time performance by bandwidth constraints of the spiking neuron, as well as the lower bandwidth of larger-length transistors. The closest point of comparison that could be found is the bottom-up portion of digital implementation presented in Barranco et al. (2014). A fairer comparison can be made here by extrapolating the energy per pixel down to only one feature channel, adjusting for the differences in supply voltage, and adjusting for differences in capacitive loading. This extrapolation yields an energy per pixel of 4.77 nJ/pix, still five times larger than the transistor-level simulations predict. Furthermore, significant quantization energy would still be required to digitize the entire frames.

Table 3:
Saliency Hardware Comparison.
 Specification Sonnleithner et al. (2012) Morris, Horiuchi, and DeWeerth (1998) Oster et al. (2008) Barranco, Diaz, Pino, and Ros (2014) Horiuchi et al. 1999 This work Frame rate 10 fps NA 4 fps 180 fps 4 fps 60 fps Pixels/frame $32×32$ 20 $32×32$ $640×480$ 23 $256×256$ Energy/pix NA NA NA 47.74 nJ/pix$*$ $58.70μ$J/pix 940 pJ/pix Dynamic range 45 dB 41 dB NA 54 dB NA 70.10 dB Algorithm type Attention Attention WTA Network Saliency Saliency Saliency Processing Bottom up Bottom up NA Top down Bottom up Bottom up Features $δI/δt$ Saliency Intensity I, $θ$⁠, RGB, $δI/δt$ $δI/δt$⁠, $δIδx$ Intensity Implementation Analog Analog Mixed signal Digital Analog Analog CMOS Circuit Type Spiking Current Spiking FPGA Current Time mode Area (mm$2$⁠) $2×2$ $2.25×2.25$ $1.8×1.8$ NA $2.22×2.22$ $1.5×1.5$ CMOS Process $0.35μ$m $2μ$m $0.35μ$m 65 nm $2μ$m 45 nm
 Specification Sonnleithner et al. (2012) Morris, Horiuchi, and DeWeerth (1998) Oster et al. (2008) Barranco, Diaz, Pino, and Ros (2014) Horiuchi et al. 1999 This work Frame rate 10 fps NA 4 fps 180 fps 4 fps 60 fps Pixels/frame $32×32$ 20 $32×32$ $640×480$ 23 $256×256$ Energy/pix NA NA NA 47.74 nJ/pix$*$ $58.70μ$J/pix 940 pJ/pix Dynamic range 45 dB 41 dB NA 54 dB NA 70.10 dB Algorithm type Attention Attention WTA Network Saliency Saliency Saliency Processing Bottom up Bottom up NA Top down Bottom up Bottom up Features $δI/δt$ Saliency Intensity I, $θ$⁠, RGB, $δI/δt$ $δI/δt$⁠, $δIδx$ Intensity Implementation Analog Analog Mixed signal Digital Analog Analog CMOS Circuit Type Spiking Current Spiking FPGA Current Time mode Area (mm$2$⁠) $2×2$ $2.25×2.25$ $1.8×1.8$ NA $2.22×2.22$ $1.5×1.5$ CMOS Process $0.35μ$m $2μ$m $0.35μ$m 65 nm $2μ$m 45 nm

$*$Bottom-up energy only.

## 7  Conclusion

A compact visual saliency algorithm has been presented. This algorithm has been shown to be effective at computing saliency maps despite aggressive optimization and reduced computational complexity compared to the prior art. We have shown that this algorithm can be implemented in standard CMOS processes with a purely analog signal path in a semiautomated manner. The architecture shows promise as a practical preprocessor for machine vision applications that require ultra-low power and low area but also require real-time video processing of scenes.

## Acknowledgments

This work was funded by Draper Laboratory in Cambridge, Massachusetts.

## References

Ali-Bakhshian
,
M.
, &
Roberts
,
G. W.
(
2012
).
A digital implementation of a dual-path time-to-time integrator
.
IEEE Transactions on Circuits and Systems I: Regular Papers
,
59
(
11
),
2578
2591
.
Barranco
,
F.
,
Diaz
,
J.
,
Pino
,
B.
, &
Ros
,
E.
(
2014
).
Real-time visual saliency architecture for FPGA with top-down attention modulation
.
IEEE Transactions on Industrial Informatics
,
10
(
3
),
1726
1735
.
Bartolozzi
,
C.
, &
Indiveri
,
G.
(
2009
).
Selective attention in multi-chip address-event systems
.
Sensors
,
9
(
7
),
5076
5098
.
Behnel
,
S.
,
,
R.
,
Citro
,
C.
,
Dalcin
,
L.
,
Seljebotn
,
D. S.
, &
Smith
,
K.
(
2011
).
Cython: The best of both worlds
.
Computing in Science and Engineering
,
13
(
2
),
31
39
.
Cheng
,
M.-M.
,
Mitra
,
N. J.
,
Huang
,
X.
,
Torr
,
P. H.
, &
Hu
,
S.-M.
(
2015
).
Global contrast based salient region detection
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
37
(
3
),
569
582
.
D'Angelo
,
R.
, &
Sonkusale
,
S.
(
2014
). A time-mode translinear principle for implementing analog multiplication. In
Proceedings of the 2014 IEEE International Symposium on Circuits and Systems
(pp.
73
76
).
Piscataway, NJ
:
IEEE
.
D'Angelo
,
R.
, &
Sonkusale
,
S.
(
2015a
).
Analogue multiplier using passive circuits and digital primitives with time-mode signal representation
.
Electronics Letters
,
51
(
22
),
1754
1756
.
D'Angelo
,
R. J.
, &
Sonkusale
,
S. R.
(
2015b
).
A time-mode translinear principle for nonlinear analog computation
.
IEEE Transactions on Circuits and Systems I: Regular Papers
,
62
(
9
),
2187
2195
.
D'Angelo
,
R.
, &
Sonkusale
,
S.
(
2016
). Precise time mode multiplier using digital primitives and passive components. In
Proceedings of the 2016 IEEE International Symposium on Circuits and Systems
(pp.
1802
1805
).
Piscataway, NJ
:
IEEE
.
Fogarty
,
J.
,
Baker
,
R. S.
, &
Hudson
,
S. E.
(
2005
). Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction. In
Proceedings of Graphics Interface 2005
(pp.
129
136
).
Mississauga, ON
:
.
Gybenko
,
G.
(
1989
).
Approximation by superposition of sigmoidal functions
.
Mathematics of Control, Signals and Systems
,
2
(
4
),
303
314
.
Horiuchi
,
T.
, &
Niebur
,
E.
(
1999
). Conjunction search using a 1-D, analog VLSI-based, attentional search/tracking chip. In
Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
(pp.
276
290
).
Mississauga, ON
:
.
Itti
,
L.
, &
Koch
,
C.
(
2001
).
Computational modelling of visual attention
.
Nature Reviews Neuroscience
,
2
(
3
),
194
.
Itti
,
L.
,
Koch
,
C.
, &
Niebur
,
E.
(
1998
).
A model of saliency-based visual attention for rapid scene analysis
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
20
(
11
),
1254
1259
.
Jones
,
J. P.
, &
Palmer
,
L. A.
(
1987
).
An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex
.
Journal of Neurophysiology
,
58
(
6
),
1233
1258
.
Koch
,
C.
, &
Tsuchiya
,
N.
(
2007
).
Attention and consciousness: Two distinct brain processes
.
Trends in Cognitive Sciences
,
11
(
1
),
16
22
.
Koch
,
C.
, &
Ullman
,
S.
(
1987
). Shifts in selective visual attention: Towards the underlying neural circuitry. In
L. M.
Vaina
(Ed.),
Matters of intelligence
(pp.
115
141
).
Berlin
:
Springer
.
LeCun
,
Y.
,
Bengio
,
Y.
, &
Hinton
,
G.
(
2015
).
Deep learning
.
Nature
,
521
(
7553
),
436
444
.
Lee
,
S.
,
Kim
,
K.
,
Kim
,
J.-Y.
,
Kim
,
M.
, &
Yoo
,
H.-J.
(
2010
).
Familiarity based unified visual attention model for fast and robust object recognition
.
Pattern Recognition
,
43
(
3
),
1116
1128
.
Melloni
,
L.
,
van Leeuwen
,
S.
,
,
A.
, &
Müller
,
N. G.
(
2012
).
Interaction between bottom-up saliency and top-down control: How saliency maps are created in the human brain
.
Cerebral Cortex
,
22
(
12
),
2943
2952
.
Miyashita
,
D.
,
Yamaki
,
R.
,
Hashiyoshi
,
K.
,
Kobayashi
,
H.
,
Kousai
,
S.
,
Oowaki
,
Y.
, &
Unekawa
,
Y.
(
2014
).
An LDPC decoder with time-domain analog and digital mixed-signal processing
.
IEEE Journal of Solid-State Circuits
,
49
(
1
),
73
83
.
Morris
,
T. G.
,
Horiuchi
,
T. K.
, &
DeWeerth
,
S. P.
(
1998
).
Object-based selection within an analog VLSI visual attention system
.
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing
,
45
(
12
),
1564
1572
.
Oster
,
M.
,
Wang
,
Y.
,
Douglas
,
R.
, &
Liu
,
S.-C.
(
2008
).
Quantification of a spike-based winner-take-all VLSI network
.
IEEE Transactions on Circuits and Systems I: Regular Papers
,
55
(
10
),
3160
3169
.
Park
,
S.
,
Hong
,
I.
,
Park
,
J.
, &
Yoo
,
H.-J.
(
2016
).
An energy-efficient embedded deep neural network processor for high speed visual attention in mobile vision recognition SoC
.
IEEE Journal of Solid-State Circuits
,
51
(
10
),
2380
2388
.
Rauss
,
K.
, &
Pourtois
,
G.
(
2013
).
What is bottom-up and what is top-down in predictive coding
?
Frontiers in Psychology
,
4
.
Ravinuthula
,
V.
(
2006
).
Time-mode circuits for analog computation
.
Ph.D diss., University of Florida
.
Ravinuthula
,
V.
,
Garg
,
V.
,
Harris
,
J. G.
, &
Fortes
,
J. A.
(
2009
).
Time-mode circuits for analog computation
.
International Journal of Circuit Theory and Applications
,
37
(
5
),
631
659
.
Roberts
,
G. W.
, &
Ali-Bakhshian
,
M.
(
2010
).
A brief introduction to time-to-digital and digital-to-time converters
.
IEEE Transactions on Circuits and Systems II: Express Briefs
,
57
(
3
),
153
157
. doi:10.1109/TCSII.2010.2043382
Rosenblatt
,
F.
(
1957
).
The perceptron, a perceiving and recognizing automaton Project Para
.
Ithaca, NY
:
Cornell Aeronautical Laboratory
.
Russell
,
A. F.
,
Mihalaş
,
S.
,
von der Heydt
,
R.
,
Niebur
,
E.
, &
Etienne-Cummings
,
R.
(
2014
).
A model of proto-object based saliency
.
Vision Research
,
94
,
1
15
. doi:10.1016/j.visres.2013.10.005
Schmitt
,
M.
(
2002
).
On the complexity of computing and learning with multiplicative neural networks
.
Neural Computation
,
14
(
2
),
241
301
.
Sonnleithner
,
D.
, &
Indiveri
,
G.
(
2012
). A real-time event-based selective attention system for active vision. In
U.
Rückert
,
S.
Joaquin
, &
W.
Felix
(Eds.),
(pp.
205
219
).
Berlin
:
Springer
.
Srinivasan
,
M. V.
, &
Bernard
,
G. D.
(
1976
).
A proposed mechanism for multiplication of neural signals
.
Biological Cybernetics
,
21
(
4
),
227
236
.
Tal
,
D.
, &
Schwartz
,
E. L.
(
1997
).
Computing with the leaky integrate-and-fire neuron: Logarithmic computation and multiplication
.
Neural Computation
,
9
(
2
),
305
318
.