Actual performance data on current ASIC processes is hard to find, but even with the quite old and slow Skywater 130 nm process, you can make a 2.5+ GHz oscillator.[1] I strongly suspect that with the current 3nm process you could easily do a 4LUT and latch that could clock well above 10 Ghz. If you tiled those in a grid (A BitGrid)... you've got a non-Von Neumann general purpose compute engine that can clock above 10 Ghz, without the need for photonics.
It's only when you expect data to be able to cross a chip in a single clock cycle that you need to slow down to the 5 Ghz or so that CPUs run into trouble exceeding.
The idea of RAM itself is the bottleneck. If you can load data in one end of a process, and get results out the other end, without ever touching RAM, you can do wonders.
I don't think any CPUs depend on data crossing them in a clock cycle, though? The reason to clock lower is so you can do powerful arithmetic in a cycle or two. If you have to build math units out of 4LUTs or 6LUTs your latency is going to suck and your space efficiency won't be very good either.
Something I've never quite understood is where, on the spectrum of mainstream vs niche, in memory computing approaches lie. What are the proposed use cases?
I understand that you can get highly power efficient XORs, for example. But if we go down this path, would they help with a matrix multiply? Or the bias term of a FFN? Would there be any improvement (i.e. is there anything to offload) in regular business logic? Should I think of it as a more efficient but highly limited DSP? Or a fixed function accelerator replacement (e.g. "we want to encrypt this segment of memory")
The main promises in optical computing are energy consumption, latency and single core speed.
For example, in this work
Lin, Z., Shastri, B.J., Yu, S. et al. 120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training. Nat Commun 15, 9081 (2024). https://doi.org/10.1038/s41467-024-53261-x
they achieve a "weight update speed of 60Ghz" which is much faster than the average ~3-4Ghz CPU.
The von Neumann architecture is not ideal for all use cases; ML training and inference is hugely memory bound and a ton of energy is spent moving network weights around for just a few OPs. Our own squishy neural networks can be viewed as a form of in-memory computing: synapses both store network properties and execute the computation (there's no need to read out synapse weights for calculation elsewhere).
It's still very niche but could offer enormous power savings for ML inference.
sooner or later we get a NRAM - neural ram as extension which is basically this neuromorphic lattice that can be wired on the very low level, perhaps also photonic level, and then the whole AI thing trains/lives in it.
there is another CPU which was recently featured which has again a lattice which is sort of FPGA but very fast, where different modules are loaded with some tasks, and each marble pumps data to some other, where the orchestrator decides how and what goes in each of these.
I keep thinking of a dram with a row of MAC units and registers along the row outputs. A vector is then an entire dram row. Access takes longer then the math, so slower/smaller multi-cycle circuits could be used. This would probably require OS level allocation of vectors in dram, and management of the accumulator vector (it really should be a row, but we need a huge register to avoid extra reads and writes. The dram will also need some kind of command interface.
perhaps one use-case is fully-homomorphic-encryption, which performs functions on encrypted data without ever decrypting it. this paper appears to be about how in-memory processing can improve the performance of FHE: https://arxiv.org/abs/2311.16293
This might not be used in actual computing the way you're thinking, it might be in a network switch or transceiver, and increase speeds and reduce power usage.
Memories are about density. I my memory isn't playing tricks with me, a TCAM is about ~300-400F^2 where F is the feature size of the node. On a per bit level, that means that this bit is 4E^10 bigger.
Put another way, the TLB in a CPU is relatively small and definitely a hotspot but you could estimate the TLB in a CPU at ≈ 0.0003 – 0.002 mm^2. which is ~50 times smaller then the single bit in this paper. To get to 10GHz we could just make 10 copies of an existing TLB operating at 1GHz and still have a ton of headroom. There is also a electro-optical conversion penalty that you need to take into account with most optical systems.
Not trying to be a Debbie downer. It's a cool result, no doubt incredibly useful for fully optical systems. Probably something really useful here for optical switching at the datacenter infrastructure level.
Actual performance data on current ASIC processes is hard to find, but even with the quite old and slow Skywater 130 nm process, you can make a 2.5+ GHz oscillator.[1] I strongly suspect that with the current 3nm process you could easily do a 4LUT and latch that could clock well above 10 Ghz. If you tiled those in a grid (A BitGrid)... you've got a non-Von Neumann general purpose compute engine that can clock above 10 Ghz, without the need for photonics.
It's only when you expect data to be able to cross a chip in a single clock cycle that you need to slow down to the 5 Ghz or so that CPUs run into trouble exceeding.
The idea of RAM itself is the bottleneck. If you can load data in one end of a process, and get results out the other end, without ever touching RAM, you can do wonders.
[1] https://github.com/adithyasunil26/2.87GHz-MWG-MPW5
I don't think any CPUs depend on data crossing them in a clock cycle, though? The reason to clock lower is so you can do powerful arithmetic in a cycle or two. If you have to build math units out of 4LUTs or 6LUTs your latency is going to suck and your space efficiency won't be very good either.
Something I've never quite understood is where, on the spectrum of mainstream vs niche, in memory computing approaches lie. What are the proposed use cases?
I understand that you can get highly power efficient XORs, for example. But if we go down this path, would they help with a matrix multiply? Or the bias term of a FFN? Would there be any improvement (i.e. is there anything to offload) in regular business logic? Should I think of it as a more efficient but highly limited DSP? Or a fixed function accelerator replacement (e.g. "we want to encrypt this segment of memory")
The main promises in optical computing are energy consumption, latency and single core speed.
For example, in this work Lin, Z., Shastri, B.J., Yu, S. et al. 120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training. Nat Commun 15, 9081 (2024). https://doi.org/10.1038/s41467-024-53261-x
they achieve a "weight update speed of 60Ghz" which is much faster than the average ~3-4Ghz CPU.
The von Neumann architecture is not ideal for all use cases; ML training and inference is hugely memory bound and a ton of energy is spent moving network weights around for just a few OPs. Our own squishy neural networks can be viewed as a form of in-memory computing: synapses both store network properties and execute the computation (there's no need to read out synapse weights for calculation elsewhere).
It's still very niche but could offer enormous power savings for ML inference.
sooner or later we get a NRAM - neural ram as extension which is basically this neuromorphic lattice that can be wired on the very low level, perhaps also photonic level, and then the whole AI thing trains/lives in it.
IBM experimenting in this direction or at least they claim to here https://www.ibm.com/think/topics/neuromorphic-computing
there is another CPU which was recently featured which has again a lattice which is sort of FPGA but very fast, where different modules are loaded with some tasks, and each marble pumps data to some other, where the orchestrator decides how and what goes in each of these.
You're referring to Evolution, seems to be a CGRA
https://news.ycombinator.com/item?id=44685050
Yes thank you, so many news these months.
I keep thinking of a dram with a row of MAC units and registers along the row outputs. A vector is then an entire dram row. Access takes longer then the math, so slower/smaller multi-cycle circuits could be used. This would probably require OS level allocation of vectors in dram, and management of the accumulator vector (it really should be a row, but we need a huge register to avoid extra reads and writes. The dram will also need some kind of command interface.
perhaps one use-case is fully-homomorphic-encryption, which performs functions on encrypted data without ever decrypting it. this paper appears to be about how in-memory processing can improve the performance of FHE: https://arxiv.org/abs/2311.16293
This might not be used in actual computing the way you're thinking, it might be in a network switch or transceiver, and increase speeds and reduce power usage.
This came from funding from the following DARPA program: [0]
(I did a google search on the acknowledged grant in the paper, no connection)
[0] https://sam.gov/opp/e0fb2b2466cd470481b0ca5cab3d210d/view
Geez if this works. It makes TCAMs free.
Ouch found the killer it takes up 0.1 mm^2 in area. That's a show stopper. Hopefully they can scale it down or use it for server infra.
I don't understand how that is show stopper.
> bitcell achieves at least 10 GHz read, write, and compute operations entirely in the optical domain
> Validated on GlobalFoundries' 45SPCLO node
> X-pSRAM consumed 13.2 fJ energy per bit for XOR computation
Don't only think about area.
Memories are about density. I my memory isn't playing tricks with me, a TCAM is about ~300-400F^2 where F is the feature size of the node. On a per bit level, that means that this bit is 4E^10 bigger.
Put another way, the TLB in a CPU is relatively small and definitely a hotspot but you could estimate the TLB in a CPU at ≈ 0.0003 – 0.002 mm^2. which is ~50 times smaller then the single bit in this paper. To get to 10GHz we could just make 10 copies of an existing TLB operating at 1GHz and still have a ton of headroom. There is also a electro-optical conversion penalty that you need to take into account with most optical systems.
Not trying to be a Debbie downer. It's a cool result, no doubt incredibly useful for fully optical systems. Probably something really useful here for optical switching at the datacenter infrastructure level.