Spikes are the next digits

Przemysław Gralewicz

1 year ago

Remember the anxiety felt back in the 1990s after the publications of the first quantum algorithms by Deutsh and Jozsa (1992), Shor (1994), and Grover (1996). Most of us expected quantum computers to be of practical use within a decade, at most two, and numerous popular publications depicted the soon-to-arrive bright era of unlimited computing power under everyone's desk. Three decades later, we still have only prof-of-concepts, claims-to-fame, and bulky contraptions. And sporadic articles of another “important step forwards” along a seemingly endless road. What was, and still is the biggest promise, turns out to be the biggest nightmare as well; exponential speed-up, also means exponentially growing operators to be precisely designed and controlled, and exponentially growing noise to be battled against. Perhaps one day all these obstacles will be removed, but until then it's an ongoing lesson in patience.

Meanwhile, another revolution is quietly sprouting without much of a hype, because thankfully the media is busy foaming elsewhere. Independently, from two directions, scientists and engineers are converging on the idea of moving from digitally processed probabilities to probabilistically processed digits. Spikes are the Mother Nature's choice for macroscopic quanta of information, and we are left to ponder why is this the case.

Into the world of spikes

Spikes may have accidentally appeared in some cells that later specialized into neural tissue facilitated by the information transmission properties of the invention. But one can also argue the opposite, that early neurons may have communicated via electrochemical potentials only with their direct neighbors, and that axons, dendrites, and eventually spikes evolved gradually as communication means along increasing distances. There are still such cells present in the sensory tissues which pre-date brains.

Spikes are quite different from the kind of processing used in electronics, where the two binary states of 0 and 1 are symmetric in terms of stability. Rather than the states themselves, spikes may represent the derivatives of transitions between them.

From a technical point of view, the appearance of spikes might have come from their ability to carry out information even if traveling long distances across a body, in the same way that digital signals can be correctly decoded despite attenuation over the cables. Propagating analogue signals is not feasible in the long run due to the exponential spreading of probability distributions. In contrast, spikes are like decisions, each marking a new stepping stone to support oneself on and dart further from. A good example here are the language models: at each step of text generation one token is selected based on the estimated probability over the vocabulary. If we tried to avoid these decisions, instead of an indicator, a whole probability distribution would be propagated to the next step as an initial condition. After a few steps like this, the distribution would become so uniform, that it would carry no meaning at all. This is fairly similar to forecasting the weather, a system which is so chaotic that predicting it more than several days ahead is bound to fail.

Our digital revolution is based on the same premise: that discrete signals are far more durable than analogue ones. The difference lies in the encoding of states and communication protocols.

For a long time scientist have been trying to decode the language of spikes. Some of the neuromorphic hardware (NMHW) that was built since then are purely for the purposes of simulating the brain. To many of those working in the field of machine learning this sounds like resource squandering, because simulating one system with another is typically some orders of magnitude more expensive. So, they build their own neuromorphic chips, but instead of simulating, the aim is to probe the advantages of alien talk reduced to the bare minimum.

Power efficiency

In the overheating world where never-satiated users demand ever more power-consuming gadgets, the only way forward is to significantly boost the efficiency of the devices. So rather unsurprisingly, the most mundane reason to replace ANNs running on GPUs, with spiking neural networks (SNNs) on NMHWs, is their low energy consumption. Just as graphic processors outperformed CPUs in terms of computational efficiency, so are NMHWs destined to take over neural inference from GPUs:

First, the major difference between GPU and CPU is that the former comprises multitude of far simpler and less capable "little CPUs", whose sheer number and orchestrated parallelism more than compensate for their individual weakness. Likewise, NMHW divides the work into even smaller processing units—the neurons—each with just one simple algorithm to execute around the clock.
Second, the processing bottleneck of the von Neumann architecture is the memory access. Modern chips improve on the original idea by adding successive levels of caches (L1, L2, L3...), which significantly reduce transfers. Basically, the more local a computation, the more efficient it is. Neurons, whether biological or electronic, draw information only from their connected neighbors, and this makes them very local. And because all connections are stored on the chip, it's a bit like using exclusively the L1-level cache.
Third, spikes are events, and a network of spiking neurons is event-driven. This means that only the necessary information is sent through the wires at a time. By contrast, the networks implemented on GPUs are like trains running on a schedule: no matter whether packed with passengers or empty, they keep running, with most of the power being consumed on moving the massive carriages around.

Any electronic device that cannot enjoy the comfort of mains electricity and needs a battery, is a potential beneficiary of neuromorphic HW. The range of such utilities is broad, including for example:

medical wearables that are used for vital function monitoring.
military or rescue equipment, such as drones, navigation, sensory enhancement.
precision agriculture and environmental monitoring which require field sensors in remote areas.
dormant systems designed for low-rate event detection.

Reducing power consumption by an order of magnitude can prolong battery life several times, which in turn may result in reduced environmental pollution, increased range, or durability of equipment. These are no mean feats, and can sometimes even save lives.

But the applications of NMHW are not limited to power-constrained systems. For example, SpinNNcloud is going to use them in data centers. Indeed, for a computing farm that consumes mega-watts, any improvement in efficiency translates to millions in savings. Something that is well-worth taking into consideration.

The human brain is estimated to consume 12 to 20 watts, even during intense intellectual effort, while an average gaming GPU gobbles an order of magnitude more. Any comparison of their computation capability is loaded with significant uncertainty and a grain of arbitrariness: CPUs, GPUs, and TPUs are typically gauged in FLOPs; for concreteness let’s take FLOP-32, the 32 bit IEEE floating-point operations. On the other hand, the efficiency of NMHWs are estimated in SOPs, the synaptic operations. The relation between the two units varies by orders of magnitude depending on whether we want to simulate spikes in a GPU, or floating-point arithmetic with spikes. Probably the most sensible translation is by comparison of performance on specific tasks and algorithms, for example: a) NorthPole claims 25x better efficiency measured in terms of image frames per joule than Nvidia's V100 GPU on the same task; b) SpiNNaker-2 is over 18x more efficient than A100 GPU; and c) Loihi-2 appears 24x more efficient than Jetson Orin Nano running NsNet2. Given such numbers we can estimate the number of SOPs per FLOPs to be somewhere between 2 and 45 depending on who and how does the counting. On the graph below we assume the ratio to be ~10, and every blue point should be smeared vertically by more than an order of magnitude.

Looking at the mature commercial platforms, AMD Radeon leads the pack. Its mobile RX line, which took over from the discontinued Embedded family, is near the top of efficiency scoreboard. They are on a straight path to enter the human brain efficiency (HBE) zone in about 5 years. Perhaps somewhat unexpectedly, as they consume tens to hundreds of kilowatts, among the best performers are the biggest systems too. Their HBE prediction is about a decade from now.

Towards the bottom of the graph, we find the shining star of capital markets, whose monopolist position seems to nourish complacency rather than competitive edge. Its GTX/RTX gaming line shows no signs of being under pressure; if the trend continues, we will use HBE Nvidia graphics cards sometime in the 2060s. And the Jetson family is heading nowhere in particular, definitely not into the future of IoT, as advertised. The company is green only by its paint, and focused more on holding its grip on the key AI software, than improving its devices.

Finally, neuromorphic hardware is still very diverse and immature, but set on a steep course to achieve the objective of HBE. And because of that diversity and growing competition, the target may be reached even before the 2030s. A very promising direction appears to be Josephson junctions, which according to simulations, could reduce the costs by a further six orders of magnitude, therefore surpassing the brain in efficiency by a wide margin. In the table below, we compile a list of the NMHWs that went public in the last decade.

platform	developer	release	lit. nm	neurons		synapses		on-device training	freq. MHz	efficiency		library
platform	developer	release	lit. nm	count ↑	type	count ↑	type	on-device training	freq. MHz	J/SOP ↓	W/synapse ↓	library
SpiNNaker-1	Manchester	2013-08-15	130	1.60·10⁴	prog.	1.60·10⁷		yes	150	1.13·10^-8	6.25·10^-8	PyNN
TrueNorth	IBM	2015-08-28	28	1·10⁶	exp-LIF	2.56·10⁸	CuBa, Delta	no	async	4.86·10^-11	2.54·10^-10	CoreLet
ODIN	Louvain	2018-06-01	28	2.56·10²	LIF	6.5·10⁴		STDP	75	8.40·10^-12	7.34·10^-9
Loihi-1	Intel	2018-07-10	14	1.28·10⁵	LIF	1.28·10⁸	CuBa, Delta	STDP	async	2.36·10^-11	2.34·10^-9	Lava
Tianjic	Tsinghua	2019-08-01	28	4.00·10⁴	prog.	1.00·10⁷		no	300	7.82·10^-13	9.50·10^-8	TJSim
µBrain	Eindhoven	2021-06-01	40	3.36·10²		3.7·10⁴		no	1.4	6.27·10^-13	1.97·10^-9
SpiNNaker-2	SpiNNcloud	2021-08-15	22	1.25·10⁵	prog.	1.52·10⁸		yes	200	1.50·10^-11	4.67·10^-10	Py-spinnaker-2
Loihi-2	Intel	2021-10-30	7	1·10⁶	prog.	1.28·10⁸	prog.	prog.	async	1.61·10^-11		Lava
ReckOn	Zurich	2022-02-20	28	2.56·10²		1.32·10⁵		e-prop	13	5.47·10^-12	6.06·10^-10
BrainScaleS-2	Heidelberg	2022-02-24	65	5.12·10²	AdEx-IF	1.32·10⁵	CuBa, Delta	STDP		1.00·10^-11		PyNN, hxTorch
SNE	Zurich, Bologna	2022-04-29	22	8.19·10³	LIF			no	400	2.21·10^-13		SLAYER
THOR	Eindhoven, Delft	2022-12-01	28	2.56·10²	LIF	6.5·10⁴		STDP	400	1.40·10^-12	1.69·10^-7
ANP-I	Tsinghua, SynSense	2023-07-17	28	5.22·10²	LIF	5.17·10⁵	CuBa, Delta	S-TP	40	1.50·10^-12
NorthPole	IBM	2023-10-23	12	1·10⁶		2.56·10⁸		no	async	8.76·10^-11		NorthPole
Darwin-3	Zhejiang	2023-12-29	22	2.35·10⁶	prog.	-	prog.	prog.	333	5.47·10^-12

When comparing the above devices, there are several factors worth taking into account:

First, the more neurons and synapses the better, because that lets you upload advanced neural models to. You can run simple CNNs on smaller chips, but who needs another MNIST digits classification these days?
Second, the ratio of synapses to neurons is potentially limiting. Many dense layers require mappings between ~1000 neurons, therefore a good ratio is about 3 orders of magnitude, which is—not coincidentally—similar to the brain. Few devices do satisfy this condition.
Third, there is the flexibility of implementation: Fixed hardware that offers just one type of neuron (mostly LIF), or single training method (STDP) may be good for the purposes of demonstration but, in practice it's like having only one shoe size for everyone. At least several NMHWs offer programming assemblers which allow you to define their neural operations, at least to some extent.
Last but not least, there is the issue of power efficiency, but unfortunately there are many ways that one can measure this. The joule per synaptic operation (J/SOP), which is the same as watt per synaptic operations per second (W/SOPS), become the de facto standard. In addition to the expense of individual operations, every platform incurs an overall energy cost, therefore it also makes sense to calculate the average energy consumption per neuron, or better still, per synapse. Usually, the bigger a system, and the more connections it offers, the more efficient it is with respect to this measure.

The actual list of NMHWs is much longer, where most are on the front-line of academic research, and don't even have a catchy name. But some, like Intel Loihi-2, IBM NorthPole, and SpiNNaker-2, have evident commercial ambitions, and are designed with scaling capability in mind. It is very likely that in a couple of years we may witness the first SNN applications in IoT, mobile, and the cloud.

Coding schemes

The first hypothesis about the transmission protocol employed by neurons was a rather unsophisticated rate code, where significance, or probability was proportional to the frequency of spiking. Of course this went up to the limit imposed by the refractory period of ~10 ms, or equivalently ~100 Hz. Unfortunately, the simplest does not mean the most efficient, so a number of more realistic coding schemes have been invented since then, for example: sparse temporal code, intensity to latency, inter-spike interval (ISI), time-to-first-spike (TTFS), phase code, burst code, or phasors on complex-valued neurons. All of them lie somewhere between the rate code and a dense temporal code where every bit matters, as in our digital communications. The in-built negligence of neural synapses makes strict codes unlikely candidates. We don’t know whether this faultiness is just an inescapable property of biological circuitry or an inherent part of information processing adapted to noisy environments. The most likely schemes seem those striking balance between efficiency in terms of spikes per transmitted bit, and the length of decoding temporal window. Such codes may even depend on neuron type and species: the gap in reaction times between e.g. birds and big whales is of several orders of magnitude.

Discovering all of the neural codes may take scientist a long time, but meanwhile we can build spiking neural networks, implement algorithms and observe how they transmit information. This is what many researchers already do.

The spherical cow

Back in the 1952 Hodkin & Huxley published a neuron model, which was so good that with some additions and modifications it is still used today. But what is good for simulation is often bad for computing efficiency, and since then researchers and engineers have reduced the model dramatically from an elaborate system of differential equations down to couple of difference ones, resulting in a kind of spherical cow model.

Integrate and Fire (IF) – This is the principal philosophy behind SNNs. It’s the point of contact between the two main ingredients: potentials v_it and actions |s_it| ∈ {0, 1}. There are more than few variants of the IF theme, but the core idea remains unchanged: a cell sums up the weighted spikes of its afferent neighbors to modify its own potential, then probabilistically emits its own spike, or keeps quiet. A spike is a discharge, so it significantly alters the cell potential. Depending on the model, it does so by a certain quanta (soft reset), down to a resting value (hard reset), or to an even lower value (reset with refractory period).

In order to make the network sub-critical, it is common to use potential decay (leaky IF) with the decay constant |λ| ≤ 1 as a hyper-parameter. The activation function that maps cell potential to action probability can be linear, quadratic, exponential, and so on.

There are variants that go beyond binary outputs, such as for instance graded spikes: these cells can output any positive integer value, so that their nonlinearity is at least a discretization of potentials. To some extent this can be seen as the time-saving form of spike bursts observed in the natural neurons.

Another variant is the Resonate and Fire model: in the original form it integrates only real spikes, but naturally extends to complex ones, allowing to conveniently handle oscillatory signals that often feature in audio processing.

We can consider SNNs as either an instance of ANNs with a specific kind of activation function, or conversely perceive ANNs as short-time averages over SNNs. The later interpretation, however, is consistent with the rate coding scheme and seems prevalent in the literature.

Most of the practical neuron models operate according to the general algorithm:

Imitate the internal dynamics of true neurons by v → λv, whereas system stability requires |λ| ≤ 1;
Accumulate weighted signals from afferent neurons into excitatory potential v → v*;
Probabilistically generate action potential v* → s (spike);
Relax or reset excitatory potential (v*, s) → v conditioned on emitted spike.

However note that, these are three independent processes affecting the state potential v, which can be executed in arbitrary order, even stochastically, without qualitative change of network functionality:

More formally, it can be put in the following form:

v*_it = λ_i v_{i(t - 1)} + ∑ _jw_ij s_{j(t - 1)},

s_it = nv*_it|v*_it|^-1 with probability p_spike(v*_it, n), n = 0,1,2,...,

v_it = r(v*_it, s_it),

where p_spike is a probability function that estimates the chances of firing n-graded spike given excitatory potential v*_it. By default n = 1, p_spike(v) = θ(v – v_thr) is the Heaviside step function, with a conveniently set threshold v_thr = 1. The r is a reset function that reduces the potential after discharge. In hard reset models r(v*, s) = (1 – s)v*, whereas soft reset could mean quantized discharge r(v*, s) = (v* – s). In case of resonant neurons, the state v_it, excitation potentials v*_it, spikes s_it, synaptic weights w_ij, and decay constant λ_i, are complex-valued. Note that while the decay rate |λ_i| is largely neuron-independent, i.e. either constant, or shared among large groups of neurons, the phase ω_i := arg(λ_i) is characteristic of individual neuron. In practice, λ_i = exp(-1/τ_m + iω_i), where τ_m ≥ 0 is membrane constant, and ω_i ∈ [0, 2π) is neuron's characteristic frequency.

Training with spikes

The first methods to train ANNs were—like the ANNs themselves—inspired by nature. The synaptic weights were trained using Hebb’s rule which famously states that those who fire together, wire together as well. But over the years this method was superseded by much more efficient, although hardly natural backpropagation (BP) of output errors towards the origin. Unlike Hebbian learning, which is local, BP is global, which means that a single adaptation step can modify, albeit minutely, every parameter of the model. Since the first applications back in the 1980s BP has become the dominant training algorithm for all ANN models. Major ANN development frameworks offer no alternative to BP.

SNN researchers quickly realized that they got a problem with their spherical herd. What appeared to be a beautiful simplification turned into a major stumbling block. One of the conditions for BP to be feasible is that the whole transformation is composed of differentiable functions. Point-wise discontinuities of the derivatives can somehow be tolerated as long as they remain defined in the vicinity. That’s the case for instance of the very popular and successful ReLU and MaxPool operations. But in case of SNN the situation is far worse: The default spike generator is a Heaviside step function p_spike(v) = θ(re(v) - 1), which not only is not differentiable, it is also discontinuous. This means that the gradient at the firing threshold is a Dirac delta and vanishes elsewhere, which completely ruins the BP.

At first, researchers tried to get back to the Hebbian method, or to bypass the problem and follow the approach of quantized networks converting ANNs to SNNs, but the resulting models were evidently worse than originals.

Those outcomes were hardly encouraging, but thankfully, Mother Nature came to the rescue: in biological computers nothing is exact, neither the potentials v_it, nor the firing thresholds v_thr. And probably this is for a good reason. Because if we assume that, for instance, due to inherent noise either of the two is normally distributed with standard deviation σ, then the sharp step changes to a familiar sigmoidal function used in ANNs, as illustrated below:

p_spike = θ(v - 1)

→ ½[1 + erf((v – 1)/σ)]

In practice, for the sake of computational simplicity it is common to use arctan for the sigmoid, which corresponds to assuming a Cauchy distribution instead of a Gaussian. Many mathematicians would object to such a maneuver, but from engineering point of view the particular choice doesn't make much of a difference.

Likewise, if our model permits graded spikes, or is running replicas, then the staircase output of the expectation E[n] changes to a soft version of the cherished ReLU activation, which is also differentiable:

E[n] = ∑ _{n ≥ 1} θ(v - n)

→ ½ ∑ _{n ≥ 1} [1 + erf((v – n)/σ)]

≈ log[1 + exp(e·(v - ½)]/e

This procedure is known as surrogate gradient, and has become the de facto standard in SNN training. However, despite suggestive graphics above, choosing the best surrogate is not a simple matter. That’s because most SNNs still use the sharp step function instead of a smooth probability, so the whole effect depends on the distributions of potentials v_it which are a priori unknown.

There are alternative approaches to training SNNs with gradients. We can use, for example, the adjoint state variables derived via Pontryagin’s minimum principle, however they remain computationally rather intensive, and unsurprisingly have not caught on.

Recent advances

Recurrence

The gradient is not the only problem that we face when trying to train SNNs. Since these networks are a special kind of temporal RNNs, it should not be surprising to encounter the vanishing and exploding gradients when using backpropagation through time (BPTT). As a remedy, one of the NMHW platforms is offering eligibility propagation (e-prop) for the on-device training. Another, more biologically plausible approach, though perhaps limited in terms of architecture, is equilibrium propagation (EP), where the network is using two relaxation phases, each with different energy functions to determine the parameter update without explicitly computing the gradient. Perhaps the most viable BPTT alternative is the forward propagation through time (FPTT): By eliminating the costly propagation of gradients through a temporally unrolled network, it achieves a significant reduction in both memory footprint and training time. FPTT means augmenting the loss function with a crafted dynamic regularizer term, and then limiting the gradient to just one step. The authors report experiments with significantly higher accuracies than for reference RNN networks (especially LSTM and GRU) trained with BPTT. The same strategy was successfully applied to spiking networks.

Residual Networks and Transformers

Vanishing gradients means that local minima are separated by long plateaus, that are not feasible to cross. In SNNs, even shallow ones, the issue is particularly pronounced due to aforementioned non-differentiability, which makes even the surrogate gradients very localized. In deep architectures, the remedy was found in the form of residual connections, which directly link distant parts of the network allowing signals to “skip” the heavy sub-modules until they are actually needed. This mechanism was quickly adapted to SNNs, with image classification results similar to those of ANNs. Since then, several teams have reported improvements on this subject.

In temporal domain, instead of residuals, somewhat different approach was found successful: the transformer. By utilizing attention across long time intervals these architectures are able to avoid information decay due to unrolling through time, and hence keep the gradients alive. Two years after residual networks, the spiking self-attention was demonstrated. And this year, a very nice, and fully spike-based approach of stochastic spiking attention was proposed and implemented on FPGA, beating GPU ~20 times in terms of power consumption, and outdoing earlier implementations along the way.

Reversible Blocks

Deep Transformers are beasts so memory-hungry, that training them directly would be prohibitively expensive. In order to perform the back-propagation step it is necessary to keep track of all the network states from input up to output, what can quickly exhaust your memory budget. The idea introduced in the NICE trick was to make the architecture reversible, so that, instead of using memorized states, we could easily re-compute them back when needed. This was one of the breakthroughs that allowed present LLMs to flourish. Soon afterwards, reversible blocks in recurrent architectures were proposed, and recently, such reversible blocks were also implemented with spikes in both Transformers, and RNNs.

The downside

There's a little bit worrying issue that many of these architectures employ operations like batch-norm, which are neither genuinely local and event-driven, nor implementable on the current neuromorphic hardware. Non-local potentials like average and standard deviation present in normalization layers could perhaps be compared to neurotransmitter modulation in the brain, a mechanism that is beyond the ability of our spherical cow. Its absence means that no network using batch-norm, neither the ResNet nor Transformer, could be trained on NMHW.

Another issue, is the inference of probabilities. If anybody hoped for deciphering the neural language by observing communication in artificial spiking networks, spontaneously self-organizing into optimal code, then the current state of the art could be disillusioning. It turns out that vast majority of works use the unimpressive rate code. But who knows, perhaps if left talking to themselves, artificial SNNs may some day discover better ways of exchanging information.

Perspectives

Every new topic in science and engineering starts with a phase of exponential growth in terms of interest. As the arXiv service testifies, that seems to be the case with the spiking neural networks, and neuromorphic hardware, which are on the exponential upward curve in number of publications.

Engineering walks shortly after research, and only in the last year we have seen probably as many applications, as in the previous two decades combined, for example, reinforcement learning, eye-tracking, micro-gesture recognition, audio compression, audio source localisation, voice activity detection, Hebbian learning in lateral circuits, or quadratic programming. If this trend continues, in a year or two NMHW platforms will hit the market an enable low-power SNN AI on mobile mini-devices. And services offering online access to NMHWs not only for development work but also for cloud AI should appear soon afterwards.

Common questions

Common questions

Spikes are the next digits

Into the world of spikes

Power efficiency

Coding schemes

The spherical cow

Training with spikes

Recent advances

Recurrence

Residual Networks and Transformers

Reversible Blocks

The downside

Perspectives

Further reading

Tagged with

Related posts

Harnessing the Power of LLM Models on Arm CPUs for Edge Devices

Agile essentials

Computer Vision and AI at the Edge with the Leading Thermal Camera Provider and the Major Toy Brick Manufacturer

How can we help you?