"I believe there are two main things holding it back."
He really science’d the heck out of that one. I’m getting tired of seeing opinions dressed up as insight—especially when they’re this detached from how real systems actually work.
I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with. There’s a reason it didn’t survive.
What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on. They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences. That’s how you end up with security holes, random crashes, and broken multi-tasking. There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.
I will take how things are today over how things used to be in a heart beat. I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.
One of the most important steps of my career was being forced to write code for an 8051 microcontroller. Then writing firmware for an ARM microcontroller to make it pretend it was that same 8051 microcontroller.
I was made to witness the horrors of archaic computer architecture in such depth that I could reproduce them on totally unrelated hardware.
I tell students today that the best way to learn is by studying the mistakes others have already made. Dismissing the solutions they found isn’t being independent or smart; it’s arrogance that sets you up to repeat the same failures.
Sounds like you had a good mentor. Buy them lunch one day.
I had a similar experience. Our professor in high school would have us program a z80 system entirely by hand: flow chart, assembly code, computing jump offsets by hand, writing the hex code by hand (looking up op-codes from the z80 data sheet) and the loading the opcodes one byte at the time on a hex keypads.
It took three hours and your of us to code an integer division start to finish (we were like 17 though).
The amount of understanding it gave has been unrivalled so far.
> I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with.
So the designers of the Cell processor made some mistakes and therefore the entire concept is bunk? Because you've seen a concept done badly, you can't imagine it done well?
To be clear, I'm not criticising those designers, they probably did a great job with what they had, but technology has moved on a long way from then... The theoretical foundations for memory models, etc. are much more advanced. We've figured out how to design languages to be memory safe without significantly compromising on performance or usability. We have decades of tooling for running and debugging programs on GPUs and we've figured out how to securely isolate "users" of the same GPU from each other. Programmers are as abstracted from the hardware as they've ever been with emulation of different architectures so fast that it's practical on most consumer hardware.
None of the things you mentioned are inherently at odds with more parallel computation. Whether something is a good idea can change. At one point in time electric cars were a bad idea. Decades of incremental improvements to battery and motor technology means they're now pretty practical. At one point landing and reusing a rocket was a bad idea. Then we had improvements to materials science, control systems, etc. that collectively changed the equation. You can't just apply the same old equation and come to the same conclusion.
> and we've figured out how to securely isolate "users" of the same GPU from each other
That's the problem, isn't it.
I don't want my programs to act independently, they need to exchange data with each other (copy-paste, drag and drop). Also i cannot do many things in parralel. Some thing must be done sequencially.
So, there are books out there. I use Computer Architecture: A Quantitative Approach by Hennessy and Patterson. Recent revisions have removed historical information. I understand why they did remove it. I wanted to use Stallings book, but the department had already made arrangements with the publisher.
The biggest problem on why we don't write books is that people don't buy them. They take the PDF and stick it on github. Publishers don't respond to the authors on take down requests, github doesn't care about authors, so why spend the time on publishing a book? We can chase grant money. I'm fortunate enough to not have to chase grant money.
While financial incentives is important to some, a lot of people write books to share their knowledge and give the book out for free. I think more people are doing this now, and there are also open collaborative textbook projects.
And I personally think that it is weird to write books during your working hour, and also get monet from selling that book.
This is the most ignorant response I've seen yet. We don't expect monetary gain from publishing a book. We expect our costs to be covered.
This is about the consumer, not the publisher. If we lived in a socialist system, they would still pirate our publications and we will still be in debt over it.
> What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on.
Isn't it much more plausible that the people who love to play with exotic (or also retro), complicated architectures (with in this case high performance opportunities) are different people than those who love to "set up or work in an assembly line for shipping stable software"?
> I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.
I rather believe that among those who love this kind of programming a hate for the incompetent fellow student will happen (including wishes that these become weed out by brutal exams).
Those students would all drop out and start meditating. That would be a fun course. Speed run developing for all the prickly architectures of the 80s and 90s.
I loved and really miss the cell. It did take quite a bit of work to shuffle things in and out of the SPUs correctly (so yeah, it took much longer to write code and greater care), but it really churned through data.
We had a generic job mechanism with the same restrictions on all platforms. This usually meant if it ran at all on Cell it would run great on PC because the data would generally be cache friendly. But it was tough getting the PowerPC to perform.
I understand why the PS4 was basically a PC after that - because it's easier. But I wish there was still SPUs off the side to take advantage of. Happy to have it off die like GPUs are.
> They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences.
Is there any reason why GPU-style parallelism couldn't have memory protection?
do you mean accessing data outside of your app's framebuffer? or just accessing neighboring pixels during a shader pass, because those are _very_ different things. GPU MMUs mean that you cant access a buffer that doesn't belong to your app that's it, its not about restricting pixel access within your own buffers
On flattening address spaces: the road not taken here is to run everything in something akin to the JVM, CLR, or WASM. Do that stuff in software not hardware.
You could also do things like having the JIT optimize the entire running system dynamically like one program, eliminating syscall and context switch overhead not to mention most MMU overhead.
Would it be faster? Maybe. The JIT would have to generate its own safety and bounds checking stuff. I’m sure some work loads would benefit a lot and others not so much.
What it would do is allow CPUs to be simpler, potentially resulting in cheaper lower power chips or more cores on a die with the same transistor budget. It would also make portability trivial. Port the core kernel and JIT and software doesn’t care.
> On flattening address spaces: the road not taken here is to run everything in something akin to the JVM, CLR, or WASM.
GPU drivers take SPIR-V code (either "kernels" for OpenCL/SYCL drivers, or "shaders" for Vulkan Compute) which is not that different at least in principle. There is also a LLVM-based soft-implementation that will just compile your SPIR-V code to run directly on the CPU.
What the ever loving hell, it was a perfectly reasonable idea in response to another idea.
They weren't saying it should be done, and went out of the way to make it explicit that they are not claiming it would be better.
It was a thought exploration, and a valid one, even if it would not pan out if carried all the way to execution at scale. Yes it was handwaving. So what? All ideas start as mere thoughts, and it is useful, productive, and interesting to trade them back and forth in these things called conversations. Even "fantasy" and "handwavy" ones. Hell especially those. It's an early stage in the pollination and generation of new ideas that later become real engineering. Or not, either way the conversation and thought was entertaining. It's a thing humans do, in case you never met any or aren't one yourself.
The brainstorming was a hell of a lot more valid, interesting, and valuable than this shit. "Just go away" indeed.
Have people really never used a higher level execution environment?
The JVM and the CLR are the most popular ones. Have people never looked at their internals? Then there's the LISP machines, Erlang, Smalltalk, etc., not to mention a lot of research work on abstract machines that just don't have the problems you get with direct access to pointers and hardware.
Some folks in the graphics programming community are allergic to these kind of modern ideas.
They are now putting up with JITs in GPGPUs, thanks to the market pressure from folks using languages like Julia and Python, that rather keep using those languages than having to rewrite their algorithms in C or C++.
These are communities that even adopting C over Assembly, and C++ over C, has been an uphill battle, let alone something like a JIT, that is like calling for the pitchforks and torches.
By the way, one of the key languages used in the Connection Machine mentioned on the article was StarLisp.
I'm going to call this out. The entire post obviously has bucket loads if aggression which can be taken as just communication style, but the last line was just uncalled for.
I have seen you make high quality responses to crazy posts.
> I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course.
The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.
- You need to compile shader source/bytecode at runtime; you can't just "run" a program.
- On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.
- You need to synchronize data access between CPU-GPU and GPU workloads.
- You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.
- You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.
What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.
>What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does.
For "embarrassingly parallel" jobs vector extensions start to eat tiny bits of the GPU pie.
Unfortunately, just slapping thousands of cores works poorly in practice. You quickly get into the synchronization wall caused by unified memory. GPUs cleverly work around this issue by using numerous tricks often hidden behind extremely complex drivers (IIRC CUDA exposes some of this complexity).
The future may be in a more explicit NUMA, i.e. in the "network of cores". Such hardware would expose a lot of cores with their own private memory (explicit caches, if you will) and you would need to explicitly transact with the bigger global memory. But, unfortunately, programming for such hardware would be much harder (especially if code has to be universal enough to target different specs), so I don't have high hopes for such paradigm to become massively popular.
Seems to me there's a trend of applying explicit distributed
systems (network of small-SRAM-ed cores each with some SIMD, explicit high-bandwidth message-passing between them, maybe some specialized ASICs such as tensor cores, FFT blocks...) looking at tenstorrent, cerebras, even kalray... out of the CUDA/GPU world, accelerators seem to be converging a bit. We're going to need a whole lot of tooling, hopefully relatively 'meta'.
Networks of cores ... Congrats you have just taken a computer and shrunk it so there are many on a single chip ... Just gonna say here AWS does exactly this network of computers thing ... Might be profitable
What I want is a Linear Algebra interface - As Gilbert Strang taught it. I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.
I'm not willing to even know about the HW at all, the higher level my code the more opportunities for the JIT to optimize my code.
What I really want is something like Mathematica that can JIT to GPU.
As another commenter mentioned all the API's assume you're a discrete GPU off the end of a slow bus, without shared memory. I would kill for an APU that could freely allocate memory for GPU or CPU and change ownership with the speed of a pagefault or kernel transition.
To expand on this link, this is probably the closest you're going to get to 'I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.' right now. JAX implements a good portion of the Numpy interface - which is the most common interface for linear algebra-heavy code in Python - so you can often just write Numpy code, but with `jax.numpy` instead of `numpy`, then wrap it in a `jax.jit` to have it run on the GPU.
You can have that today. Just go out and buy more CPUs until they have enough cores to equal the number of SMs in your GPU (or memory bandwidth, or whatever). The problem is that the overhead of being general purpose -- prefetch, speculative execution, permissions, complex shared cache hierarchies, etc -- comes at a cost. I wish it was free, too. Everyone does. But it just isn't. If you have a workload that can jettison or amortize these costs due to being embarrassingly parallel, the winning strategy is to do so, and those workloads are common enough that we have hardware for column A and hardware for column B.
Larrabee was something like that, didn't took off.
IMHO, the real issue is cache coherence. GPUs are spared from doing a lot of extra work by relaxing coherence guarantees quite a bit.
Regarding the vendor situation - that's basically how most of computing hardware is, save for the PC platform. And this exception is due to Microsoft successfully commoditizing their complements (which caused quite some woe on the software side back then).
Is cache coherence a real issue, absent cache contention? AIUI, cache coherence protocols are sophisticated enough that they should readily adapt to workloads where the same physical memory locations are mostly not accessed concurrently except in pure "read only" mode. So even with a single global address space, it should be possible to make this work well enough if the programs are written as if they were running on separate memories.
It is because cache coherence requires extra communication to make sure that the cache is coherent. There's cute stratgies for reducing the traffic, but ultimately you need to broadcast out reservations to all of the other cache coherent nodes, so there's an N^2 scaling at play.
> What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory...
I too am very interested in this model. The Linux kernel supports up to 4,096 cores [1] on a single machine. In practice, you can rent a c7a.metal-48xl [2] instance on AWS EC2 with 192 vCPU cores. As for programming models, I personally find the Java Streams API [3] extremely versatile for many programming workloads. It effectively gives a linear speedup on serial streams for free (with some caveats). If you need something more sophisticated, you can look into OpenMP [4], an API for shared-memory parallelization.
I agree it is time for some new ideas in this space.
Yep, and those printers are proprietary and mutually incompatible, and there are buggy mutually incompatible serial drivers on all the platforms which results in unique code paths and debugging & workarounds for app breaking bugs for each (platform, printer brand, printer model year) tuple combo.
(That was idealized - actually there may be ~5 alternative driver APIs even on a single platform each with its own strengths)
I really would like you to sketch out the DX you are expecting here, purely for my understanding of what it is you are looking for.
I find needing to write seperate code in a different language annoying but the UX of it is very explicit of what is happening in the memory which is very useful. With really high performance compute across multiple cores ensuring you don't get arbitrary cache misses is a pain. If we could address CPUs like we address current GPUs( well you can but it's not generally done) it would make it much much simpler.
Want to alter something in parallel, copy it to memory allocated to a specific core which is guaranteed to only be addressed by that core and the do the operations on it.
To do that currently you need to be pedantic about alignment and manually indicate thread affinity to the scheduler etc. Which ia entirely as annoying as GPU programming.
Your wish sounds to me a lot like Larrabee/Xeon Phi or manycore CPUs. Maybe I am misunderstanding something, but it sounds like a good idea to me and I don’t totally see why it inherently can’t compete with GPUs.
I think Intel should have made more of an effort to get cheap Larrabee boards to developers, they could have been ones with chips that had some broken cores or unable to make the design speed.
doesn't matter. the issues you raise are abstractable at the language level, or maybe even the runtime. unfortunately there are others like which of the many kinds of parallelism to use (ILP, thread, vector/SIMD, distributed memory with much lower performance, etc.) that are harder to hide behind a compiler with acceptable performance.
Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.
Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.
The key transformation required to make any parallel architecture work is going to be taking a program that humans can understand, and translating it into a directed acyclic graph of logical Boolean operations. This type of intermediate representation could then be broken up into little chunks for all those small CPUS. It could be executed very slowly using just a few logic gates and enough ram to hold the state, or it could run at FPGA speeds or better on a generic sea of LUTs.
Mill Computing's proposed architecture is more like VLIW with lots of custom "tricks" in the ISA and programming model to make it nearly as effective as the usual out-of-order execution than a "generic sea" of small CPU's. VLIW CPU's are far from 'tiny' in a general sense.
Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Imagine a sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors. The programming for this, even as virtual machine, allows for exploration of a virtually infinite design space of hardware with various tradeoffs for speed, size, cost, reliability, security, etc. The same graph could be refactored to run on anything in that design space.
> Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
> Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Modern parallel scheduling systems still have "queues" to manage these concerns; they're just handled in software, with patterns like "work stealing" that describe what happens when unexpected mismatches in execution time must somehow be handled. Even your "sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors" has queues, only the queue is called a "pipeline" and a mismatch in execution speed leads to "pipeline bubbles" and "stalls". You can't really avoid these issues.
the CM architecture or programming model wasn't really a DAG. It was more like tensors of arbitrary rank with power of two sizes. Tensor operations themselves were serialized, but each of them ran in parallel. It was however much nicer than coding vectors today - it included Blelloch scans, generalizied scatter-gather, and systolic-esque nearest neighbor operations (shift this tensor in the positive direction along this axis). I would love to see a language like this that runs on modern GPUs, but its really not sufficiently general to get good performance there I think.
Yep, transputers failed miserably. I wrote a ton a code for them. Everything had to be solved in a serial bus, which defeated the purpose of the transputer.
Not in those terms, but an autobiography is coming, and bits and pieces are being explained. I expect about 10 people to buy the book, as all of the socialists will want it for free. I am negotiating with a publisher as we speak on the terms.
Could you elaborate on this? How does many-small-CPUs make for a weirder programming model than a GPU?
Im no expert, but I’ve done my fair share of parallel HPC stuff using MPI, and a little bit of Cuda. And to me the GPU programming model is far far “weirder” and harder to code for than the many-CPUs model. (Granted, I’m assuming you’re describing a different regime?)
In CUDA you don't really manage the individual compute units, you start a kernel, and the drivers take care of distributing that to the compute cores and managing the data flows between them.
When programming CPUs however you are controlling and managing the individual threads. Of course, there are libraries which can do that for you, but fundamentally it's a different model.
The GPU equivalent of a single CPU "hardware thread" is called a "warp" or a "wavefront". GPU's can run many warps/wavefronts per compute unit by switching between warps to hide memory access latency. A CPU core can do this with two hardware threads, using Hyperthreading/2-way SMT, some CPU's have 4-way SMT, but GPU's push that quite a bit further.
What you say has nothing to do with CPU vs. GPU, or with CUDA, which is basically equivalent with the older OpenMP.
When you have a set of concurrent threads, each thread may run a different program. There are many applications where this is necessary, but such applications are hard to scale to very high levels of concurrency, because each thread must be handled individually by the programmer.
Another case is when all the threads run the same program, but on different data. This is equivalent with a concurrent execution of a "for" loop, which is always possible when the iterations are independent.
The execution of such a set of threads that execute the same program has been named "parallel DO instruction" by Melvin E. Conway in 1963, "array of processes" by C. A. R. Hoare in 1978, "replicated parallel" in the Occam programming language in 1985, SPMD around the same time, "PARALLEL DO" in the OpenMP Fortran language extension in 1997, "parallel for" in the OpenMP C/C++ language extension in 1998, and "kernel execution" in CUDA, which has also introduced the superfluous acronym SIMT to describe it.
When a problem can be solved by a set of concurrent threads that run the same program, then it is much simpler to scale the parallelism to extremely high levels and the parallel execution can usually be scheduled by a compiler or by a hardware controller without the programmer having to be concerned with the details.
There is no inherent difficulty in making a compiler that provides exactly the same programming model as CUDA, but which creates a program for a CPU, not for a GPU. Such compilers exist, e.g. ispc, which is mentioned in the parent article.
The difference between GPUs and CPUs is that the former appear to have some extra hardware support for what you describe as "distributing that to the compute cores and managing the data flows between them", but nobody is able to tell exactly what is done by this extra hardware support and whether it really matters, because it is a part of the GPUs that has never been documented publicly by the GPU vendors.
From the point of view of the programmer, this possible hardware advantage of the GPUs does not really matter, because there are plenty of programming language extensions for parallelism and libraries that can take care of the details of thread spawning and work distribution over SIMD lanes, regardless if the target is a CPU or a GPU.
Whenever you write a program equivalent with a "parallel for", which is the same as writing for CUDA, you do not manage individual threads, because what you write, the "kernel" in CUDA lingo, can be executed by thousands of threads, also on a CPU, not only on a GPU. A desktop CPU like Ryzen 9 9950X has the same product of threads by SIMD lanes like a big integrated GPU (obviously, discrete GPUs can be many times bigger).
Im guessing you just dont have the computational power to compete with a real GPU. It would be relatively easy for a top end graphics programmer to write the front end graphics API for your chip. Im guessing that if they did this you would just end up with a very poor performing GPU.
my take from reading this is more about programming abstractions than any particular hardware instantiation. the part of the Connection Machine that remains interesting is not building machines with CPUS with transistor counts in the hundreds running off a globally synchronous clock, but that there were a whole family of SIMD languages and let you do general purpose programming in parallel. And that those language were still relevant when the architecture changed to a MIMD machine with a bunch of vector units behind each CPU.
They aren't similar, they couldn't be more different. One is about lots of small threads of execution communicating with each other and synchronizing, one is about a few instructions being able to be run in parallel because implicitly within the CPU there are different pipelines.
They aren't just different, they are at completely opposite ends of the programming spectrum. There are literally the two extremes of trying to make throughput faster.
> The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?
What other workloads would benefit from a GPU?
Computers are so fast that in practice, many tasks don't need more performance. If a program that runs those tasks is slow, it's because that program's code is particularly bad, and the solution to make the code less bad is simpler than re-writing it for the GPU.
For example, GUIs have been imperceptibly reactive to user input for over 20 years. If an app's GUI feels sluggish, the problem is that the app's actions and rendering aren't on separate coroutines, or the action's coroutine is blocking (maybe it needs to be on a separate thread). But the rendering part of the GUI doesn't need to be on a GPU (any more than it is today, I admit I don't know much about rendering), because responsive GUIs exist today, some even written in scripting languages.
In some cases, parallelizing a task intrinsically makes it slower, because the number of sequential operations required to handle coordination mean there are more forced-sequential operations in total. In other cases, a program spawns 1000+ threads but they only run on 8-16 processors, so the program would be faster if it spawned less threads because it would still use all processors.
I do think GPU programming should be made much simpler, so this work is probably useful, but mainly to ease the implementation of tasks that already use the GPU: real-time graphics and machine learning.
Possibly compilation and linking. That's very slow for big programs like Chromium. There's really interesting work on GPU compilers (co-dfns and Voetter's work).
Optimization problems like scheduling and circuit routing. Search in theorem proving (the classical parts like model checking, not just LLM).
There's still a lot that is slow and should be faster, or at the very least made to run using less power. GPUs are good at that for graphics, and I'd like to see those techniques applied more broadly.
All of these things you mention are "thinking", meaning they require complex algorithms with a bunch of branches and edge cases.
The tasks that GPUs are good at right now - graphics, number crunching, etc - are all very simple algorithms at the core (mostly elementary linear algebra), and the problems are, in most cases, embarassingly parallel.
CPUs are not very good at branching either - see all the effort being put towards getting branch prediction right - but they are way better at it than GPUs. The main appeal of GPGPU programming is, in my opinion, that if you can get the CPU to efficiently divide the larger problem into a lot of small, simple subtasks, you can achieve faster speeds.
You mentioned compilers. See a related example, for reference all the work Daniel Lemire has been doing on SIMD parsing: the algorithms he (co)invented are all highly specialized to the language, and highly nontrivial. Branchless programming requires an entirely different mindset/intuition than "traditional" programming, and I wouldn't expect the average programmer to come up with such novel ideas.
A GPU is a specialized tool that is useful for a particular purpose, not a silver bullet to magically speed up your code. Theree is a reason that we are using it for its current purposes.
A big one is video encoding. It seems like GPUs would be ideal for it but in practice limitations in either the hardware or programming model make it hard to efficiently run on GPU shader cores. (GPUs usually include separate fixed-function video engines but these aren't programmable to support future codecs.)
Video encoding is done with fixed-function for power efficiency. A new popular codec like H26x codec appears every 5-10 years, there is no real need to support future ones.
Video encoding is two domains. And there's surprisingly little overlap between them.
You have your real time video encoding. This is video conferencing, live television broadcasts. This is done fixed-function not just for power efficiency, but also latency.
The second domain is encoding at rest. This is youtube, netflix, blu-ray, etc. This is usually done in software on the CPU for compression ratio efficiency.
The problem with fixed function video encoding is that the compression ratio is bad. You either have enormous data, or awful video quality, or both. The problem with software video encoding is that it's really slow. OP is asking why we can't/don't have the best of both worlds. Why can't/don't we write a video encoder in OpenCL/CUDA/ROCm. So that we have the speed of using the GPU's compute capability but compression ratio of software.
I haven't yet read the full blog post but so far my response is you can have this good parallel computer. See my previous HN comments the past months on building an M4 Mac mini supercomputer.
For example reverse engineering the Apple M3 Ultra GPU and Neural Engine instruction set and IOMMU and pages tables that prevent you from programming all processor cores in the chip (146 cores to over ten thousand depending on how you delineate what a core is) and making your own Abstract Syntax Tree to assembly compiler for these undocumented cores will unleash at least 50 trillion operations per second. I still have to benchmark this chip and make the roofline graphs for the M4 to be sure, it might be more.
There are many intertwined issues here. One of the reasons we can't have a good parallel computer is that you need to get a large number of people to adopt your device for development purposes, and they need to have a large community of people who can run their code. Great projects die all the time because a slightly worse, but more ubiquitous technology prevents flowering of new approaches. There are economies of scale that feed back into ever-improving iterations of existing systems.
Simply porting existing successful codes from CPU to GPU can be a major undertaking and if there aren't any experts who can write something that drive immediate sales, a project can die on the vine.
See for example https://en.wikipedia.org/wiki/Cray_MTA when I was first asked to try this machine, it was pitched as "run a million threads, the system will context switch between threads when they block on memory and run them when the memory is ready". It never really made it on its own as a supercomputer, but lots of the ideas made it to GPUs.
AMD and others have explored the idea of moving the GPU closer to the CPU by placing it directly onto the same memory crossbar. Instead of the GPU connecting to the PCI express controller, it gets dropped into a socket just like a CPU.
I've found the best strategy is to target my development for what the high end consumers are buying in 2 years - this is similar to many games, which launch with terrible performance on the fastest commericially available card, then runs great 2 years later when the next gen of cards arrives ("Can it run crysis?")
Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D.
Now, 3D renderers, we need all the help we can get.
In this context, a "renderer" is something that takes in meshes, textures, materials, transforms, and objects, and generates images. It's not an entire game development engine, such as Unreal, Unity, or Bevy. Those have several more upper levels above the renderer. Game engines know what all the objects are and what they are doing. Renderers don't.
Vulkan, incidentally, is a level below the renderer. Vulkan is a cross-hardware API for asking a GPU to do all the things a GPU can do. WGPU for Rust, incidentally, is an wrapper to extend that concept to cross-platform (Mac, Android, browsers, etc.)
While it seems you can write a general 3D renderer that works in a wide variety of situations, that does not work well in practice. I wish Rust had one. I've tried Rend3 (abandoned), and looked at Renderling (in progress), Orbit (abandoned), and Three.rs (abandoned). They all scale up badly as scene complexity increases.
There's a friction point in design here. The renderer needs more info to work efficiently than it needs to just draw in a dumb way. Modern GPSs are good enough that a dumb renderer works pretty well, until the scene complexity hits some limit. Beyond that point, problems such as lighting requiring O(lights * objects) time start to dominate. The CPU driving the GPU maxes out while the GPU is at maybe 40% utilization. The operations that can easily be parallelized have been. Now it gets hard.
In Rust 3D land, everybody seems to write My First Renderer, hit this wall, and quit.
The big game engines (Unreal, etc.) handle this by using the scene graph info of the game to guide the rendering process.
This is visually effective, very complicated, prone to bugs, and takes a huge engine dev team to make work.
Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.
I think a dynamic, fully vector-based 2D interface with fluid zoom and transformations at 120Hz+ is going to need all the GPU help it can get. Take mapping as an example: even Google Maps routinely struggles with performance on a top-of-the-line iPhone.
> Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.
A ton of 2D applications could benefit from further GPU parallelization. Games, GUIs, blurs & effects, 2D animations, map apps, text and symbol rendering, data visualization...
Canvas2D in Chrome is already hardware accelerated, so most users get better performance and reduced load on main UI & CPU threads out of the box.
Fast light transport is an incredibly hard problem to solve.
Raytracing (in its many forms) is one solution. Precomputing lightmaps, probes, occluder volumes, or other forms of precomputed visibility are another.
In the end it comes down to a combination of target hardware, art direction and requirements, and technical skill available for each game.
There's not going to be one general purpose renderer you can plug into anything, _and_ expect it to be fast, because there's no general solution to light transport and geometry processing that fits everyone's requirements. Precomputation doesn't work for dynamic scenes, and for large games leads to issues with storage size and workflow slow downs across teams. No precomputation at all requires extremely modern hardware and cutting edge research, has stability issues, and despite all that is still very slow.
It's why game engines offer several different forms of lighting methods, each with as many downsides as they have upsides. Users are supposed to pick the one that best fits their game, and hope it's good enough. If it's not, you write something custom (if you have the skills for that, or can hire someone who can), or change your game to fit the technical constraints you have to live with.
> Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.
Some games may have their own acceleration structures. Some won't. Some will only have them on the GPU, not the CPU. Some will have an approximate structure used only for specialized tasks (culling, audio, lighting, physics, etc), and cannot be generalized to other tasks without becoming worse at their original task.
Fully generalized solutions will be slow be flexible, and fully specialized solutions will be fast but inflexible. Game design is all about making good tradeoffs.
The same argument could be made against Vulkan, or OpenGL, or even SQL databases. The whole NoSQL era was based on the concept that performance would be better with less generality in the database layer. Sometimes it helped. Sometimes trying to do database stuff with key/value stores made things worse.
I'm trying to find a reasonable medium. I have a hard scaling problem - big virtual world, dynamic content - and am trying to make that work well. If that works, many games with more structured content can use the same approach, even if it is overkill.
> Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D.
You can have an unlimited number of polygons overlapping a pixel. For instance, if you zoom out a lot. Imagine you converted a layer map of a modern CPU design to svg, and tried to open it in Inkscape. Or a map of NYC. Wouldn't you think a bit of extra processing power would be welcomed?
At Vulkanised 2025 someone mentioned it is an HAL for writing GPU drivers, and they have acknowledge it has gotten as messy as OpenGL and there is now a plan in place to try to sort the complexity mess.
I believe there are two main things holding it back. One is an impoverished execution model, which makes certain tasks difficult or impossible to do efficiently; GPUs … struggle when the workload is dynamic
This sacrifice is a purposeful cornerstone of what allows GPUs to be so high throughput in the first place.
It is odd that he talks about Larabee so much, but doesn’t mention the Xeon Phis. (Or is it Xeons Phi?).
> As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.
I’ve always been slightly annoyed by the concept of E cores, because they are so close to what I want, but not quite there… I want, like, throughput cores. Let’s take E cores, give them their AVX-512 back, and give them higher throughput memory. Maybe try and pull the Phi trick of less OoO capabilities but more threads per core. Eventually the goal should be to come up with an AVX unit so big it kills iGPUs, haha.
I've always wondered if you could use iGPU compute cores with unified memory as "transparent" E-cores when needed.
Something like OpenCL/CUDA except it works with pthreads/goroutines and other (OS) kernel threading primitives, so code doesn't need to be recompiled for it. Ideally the OS scheduler would know how to split the work, similar to how E-core and P-core scheduling works today.
I don't do HPC professionally, so I assume I'm ignorant to why this isn't possible.
It is an instance of Larrabee in the same sense as AMD Zen 4 is an instance of Larrabee.
The "Larrabee New Instructions" is an instruction set that has been designed before AVX and also its first hardware implementation has been introduced before AVX, in 2010 (AVX was launched in 2011, with Sandy Bridge).
Unfortunately while the hardware design of Sandy Bridge with the inferior AVX ISA has been done by the Intel A team, the hardware implementations of Larrabee have been done by some C or D teams, which were also not able to design new CPU cores for it, but they had to reuse some obsolete x86 cores, initially a Pentium core and later an Atom Silvermont core, to which the Larrabee instructions were grafted.
"Larrabee New Instructions" have been renamed to "Many Integrated Cores" ISA, then to AVX-512, while passing through 3 generations of chips, Knights Ferry, Knights Corner and Knights Landing. A fourth generation, Knights Mill, was only intended for machine learning/AI applications. The successor of Knights Landing has been Skylake Server, when the AVX-512 ISA has come to standard Xeons, marking the disappearance of Xeon Phi.
Already in 2013, Intel Haswell has added to AVX a few of the more important instructions that were included in the Larrabee New Instructions, but which were missing in AVX, e.g. fused multiply-add and gather instructions. The 3-address FMA format, which has caused problems to AMD, who had implemented in Bulldozer a 4-address format, has also come to AVX from Larrabee, replacing the initial 4-address specification.
At each generation until Skylake Server, some of the original Larrabee instructions have been deleted, by assuming that they might be needed only for graphics, which was no longer the intended market. However a few of those instructions were really useful for some applications in which I am interested, e.g. for computations with big numbers, so I regret their disappearance.
Since Skylake Server, there have been no other instruction removals, with the exception of those introduced by Intel Tiger Lake, which are now supported only by AMD Zen 5. A few days ago Intel has committed to keeping complete compatibility in the future with the ISA implemented today by Granite Rapids, so there will be no other instruction deletions.
> It is an instance of Larrabee in the same sense as AMD Zen 4 is an instance of Larrabee.
This is an odd claim. Clearly Xeon Phi is the shipping version of Larrabee, while Zen 4 is a completely different chip design that happens to run AVX-512. The first shipping Xeon Phi (Knights Corner) used the exact same P54C cores as Larrabee, while as you point out later versions of Xeon Phi switched to Atom.
It is extremely common to refer to all these as Larrabee, for example the Ian Cutress article on the last Xeon Phi chip was entitled "The Larrabee Chapter Closes: Intel's Final Xeon Phi Processors Now in EOL" [1]. Pat Gelsinger's recent interview at GTC [2] also refers to Larrabee. The section from around 44:00 has a discussion of workloads becoming more dynamic, and at 53:36 there's a section on Larrabee proper.
I think it is not right to say that Larrabee and Phi are as distant as Larrabee and Zen. But, they did retreat a bit from the “graphics card” like functionality, and to scale back the ambitions to become something a bit more familiar.
Something that frustrates me a little is that my system (apple silicon) has unified memory, which in theory should negate the need to shuffle data between CPU and GPU. But, iiuc, the GPU programming APIs at my disposal all require me to pretend the memory is not unified - which makes sense because they want to be portable across different hardware configurations. But it would make my life a lot easier if I could just target the hardware I have, and ignore compatibility concerns.
You can. There are API extensions for persistently mapping memory, and it's up to you to ensure that you never write to a buffer at the same time the GPU is reading from it.
At least for Vulkan/DirectX12. Metal is often weird, I don't know what's available there.
Let’s be honest, saying “just fix the page tables” is like telling someone they can fly if they “just rewrite gravity.”
Yes, on Apple Silicon, the hardware supports shared physical memory, and with enough “convincing”, you can rig up a contiguous virtual address space for both the CPU and GPU. Apple’s unified memory architecture makes that possible, but Apple’s APIs and memory managers don’t expose this easily or safely for a reason. You’re messing with MMU-level mappings on a tightly integrated system that treats memory as a first-class citizen of the security model.
Oh yes I programmed all the Amiga models, mostly in assembly level. I reprogrammed the ROMs. I also published a magazine on all the Commodore computers internals and build lots of hardware for these machines.
We had the parallel Inmos Transputer systems during the heyday of the Amiga, they where much better designed than any the custom Amiga chips.
Inmos was a disaster. No application ever shipped on one. EVER. It used a serial bus to resolve the problems that should have never been problems. Clearly you never wrote code for one. Each oslink couldn't reach more than 3 feet. What a disaster that entire architecture was.
I shipped 5 applications on an 800 Inmos Transputer supercomputer. Sold my parallel C compilers, macro Assembler. Also an OS, Macintosh Nubus interface card, Transputer graphics cards, a full paper copier and laserprinter. I know of dozens of successful products.
Hey don't shit on my retro alternative timeline nostalgia. We were all writing Lisp programs on 64 CPU Transputer systems with FPGA coprocessors, dynamically reconfigured in realtime with APL.
/s/LISP/Prolog and you've basically described the old "Fifth Generation" research project. Unfortunately it turns out that trying to parallelize Prolog is quite a nightmare, the language is really, really not built for it. So the whole thing was a dead-end in practice. Arguably we didn't have a real "fifth-gen" programming language prior to Rust, given how it manages to uniquely combine ease of writing parallel+concurrent code with bare-metal C like efficiency. (And Rust is now being used to parallelize database query, which comfortably addresses the actual requirement that Prolog had been intended for back then - performing "search" tasks on large and complex knowledge bases.)
I've always admired the work that the team behind https://www.greenarraychips.com/ does, and the GA144 chip seems like a great parallel computing innovation.
I implemented some evolutionary computation stuff on the Cell BE in college. It was a really interesting machine and could be very fast for its time but it was somewhat painful to program.
The main cores were PPC and the Cell cores were… a weird proprietary architecture. You had to write kernels for them like GPGPU, so in that sense it was similar. You couldn’t use them seamlessly or have mixed work loads easily.
Larrabee and Xeon Phi are closer to what I’d want.
I’ve always wondered about many—many-many-core CPUs too. How many tiny ARM32 cores could you put on a big modern 5nm die? Give each one local RAM and connect them with an on die network fabric. That’d be an interesting machine for certain kinds of work loads. It’d be like a 1990s or 2000s era supercomputer on a chip but with much faster clock, RAM, and network.
Each AIE tile can stream 64 Gbps in and out and perform 1024 bit SIMD operations. Each shares memory with its neighbors and the streams can be interconnected in various ways.
Clearly the author never worked with a CM2 - I did though.
The CM2 was more like a co-processor which had to be controlled by a (for that age) rather beefy SUN workstation/server. The program itself ran on the workstation which then sent the data-parallel instructions to the CM2.
The CM2 was an extreme form of a MIMD design (that is why it was called data parallel). You worked with a large rectangular array (I cannot recall up to how many dimensions) which had to be a multiple of the physical processors (in your partition). All cells typically performed exactly the same operation. If you wanted to perform an operation on a subset, you had to "mask" the other cells (which were essentially idling during that time).
AMD Strix Halo APU is a CPU with very powerful integrated GPU.
It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space. This means it’s doesn’t have the same swapping/memory thrashing that a discrete GPU experiences when processing large models.
16 CPU cores and 40 GPU compute units sounds pretty parallel to me.
> It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space
No definitely. RTX4090 definitely use fast graphics RAM (though it is usually previous generation, but overclocked and very wide bus).
AMD Strix Halo definitely use standard DDR5 which is not so fast.
And yes, Strix Halo GPU using "3dcache", but as officials said, CPU don't have access to GPU cache, because "have not seen any app significantly benefited from such access".
So probably, internal SoC bus should have less delay than discrete GPU on PCIe, but not too much different.
It looks like it will be available in the Framework Desktop! I would love to see it in a more budget mini PC at some point from another company. (Framework is great but not in my price range.)
> It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space
I love AMD's Ryzen chips and will recommend their laptops over an Nvidia model all day. However, this is a pretty facetious comparison that falls apart when you normalize the memory. Any chip can be memory bottlenecked, and if we take away that arbitrary precondition the Strix Halo gets trounced in terms of compute capacity. You can look at the TDP of either chip and surmise this pretty easily.
> However, this is a pretty facetious comparison that falls apart when you normalize the memory
Why would you normalize though? You can't buy a 96 GB RTX4090. So it's fair to compare the whole deal, slowish APU with large RAM versus very fast GPU with limited RAM.
It is fair, it should just be contextualized with a comparison of 13B or 32B models as well. This is one of those Apple marketing moves where a very specific benchmark has been cherry-picked for a "2.2x improvement!" headline that people online misconstrue.
“ AMD also claims its Strix Halo APUs can deliver 2.2x more tokens per second than the RTX 4090 when running the Llama 70B LLM (Large Language Model) at 1/6th the TDP (75W).”
This is still a memory-constrained benchmark. The smallest Llama 70B model (gguf-q2) doesn't fit in-memory so is bottlenecked by your PCIe connector. It's a valid benchmark, but it's still guilty of being stacked in the exact way I described before.
A comparison of 7B/13B/32B model performance would actually test the compute performance of either card. AMD is appealing to the consumers that don't feel served by Nvidia's gaming lineup, which is fine but also doomed if Nvidia brings their DGX Spark lineup to the mobile form factor.
Are you arguing for a better software abstraction, a different hardware abstraction or both? Lots of esoteric machines are name dropped, but it isn't clear how that helps your argument.
I think a stronger essay would at the end give the reader a clear view of what Good means and how to decide if a machine is closer to Good than another machine and why.
SIMD machines can be turned into MIMD machines. Even hardware problems still need a software solution. The hardware is there to offer the right affordances for the kinds of software you want to write.
Lots of words that are in the eye of beholder. We need a checklist or that Good parallel computer won't be built.
Personal opinion: it's the software (and software tooling).
The hardware is good enough (even if we're only talking 10x efficiency).
Part of the issue seems slightly cultural, i.e. repetitively putting down the idea of traditional task parallelism (not-super-SIMD/data parallelism) on GPUs.
Obviously, one would lose a lot of efficiency if we literally ran 1 thread per warp. But it could be useful for lightly-data-parallel tasks (like typical CPU vectorization), or maybe using warp-wide semantics to implement something like a "software" microcode engine. Dumb example: implementing division with long division using multiplications and shifts.
Other things a GPU gives: insanely high memory bandwidth, programmable cache (shared memory), and (relatively) great atomic operations.
> Are you arguing for a better software abstraction, a different hardware abstraction or both?
I don't speak for Raph, but imo it seems like he was arguing for both, and I agree with him.
On the hardware side, GPUs have struggled with dynamic workloads at the API level (not e.g. thread-level dynamism, that's a separate topic) for around a decade. Indirect commands gave you some of that so at least the size of your data/workload can be variable if not the workloads themselves, then mesh shaders gave you a little more access to geometry processing, and finally workgraphs and device generated commands lets you have an actually dynamically defined workload (e.g. completely skipping dispatches for shading materials that weren't used on screen this frame). However it's still very early days, and the performance issues and lack of easy portability are problematic. See https://interplayoflight.wordpress.com/2024/09/09/an-introdu... for instance.
On the software side shading languages have been garbage for far longer than hardware has been a problem. It's only in the last year or two that a proper language server for writing shaders has even existed (Slang's LSP). Much less the innumerable driver compiler bugs, lack of well defined semantics and memory model until the last few years, or the fact that we're still manually dividing work into the correct cache-aware chunks.
I had hoped the GPU API would go away, and the entire thing would become fully programmable, but so far we just keep using these shitty APIs and horrible shader languages.
Personally I would like to use the same language I write the application in to write the rendering code(C++). Preferably with shared memory, not some separate memory system that takes forever to transfer anything too. Somelike along the lines of the new AMD 360 Max chips, but graphics written in explicit C++.
I was always fascinated by the prospects of the 1024-core Epiphany-V from Parallella.. https://parallella.org/2016/10/05/epiphany-v-a-1024-core-64-...
But it seems whatever the DARPA connection was has led to it not being for scruffs like me and is likely powering god knows what military systems..
The problem isn't address space or program counters. It's that each processor is going to need instruction memory stored in SRAM or an extremely efficient multi port memory for a shared instruction cache.
GPUs get around this limitation by executing identical instructions over multiple threads.
Instructions are the problem, you have to have an architecture which just operates on data flows all in parallel and all at once, like an FPGA, but without all the fiddly special sauce parts.
There are designs like Tilera and Phalanx that have tons of cores. Then, NUMA machines used to have 128-256 sockets in one machine with coherent memory. The SGI machines let you program them like it was one machine. Languages like Chapel were designed to make parallel programming easier.
Making more things like that with lowest, possible, unit prices could help a lot.
It lacks support for the serial portions of the execution graph, but yes. You should play around with ONNX, it can be used for a lot more than just ML stuff.
If we had distributed operating systems and SSI kernels, your computer could use the idle cycles of other computers [that aren't on battery power]. People talk about a grid of solar houses, but we could've had personal/professional grid computing like 15 years ago. Nobody wanted to invest in it, I guess because chips kept getting faster.
SSI is an interesting idea, but the actual advantage is mostly to improve efficiency when running your distributed code on a single, or few nodes. You still have to write your code with some very real awareness of the relevant issues when running on many nodes, but now you are also free to "scale down" and be highly efficient on a single node, since your code is still "natively" written for running on that kind of system. You are not going to gain much by opportunistically running bad single-node codes on larger systems, since that will be quite inefficient anyway.
Also, running a large multi-node SSI system means you mostly can't partition those nodes ever, otherwise the two now-separated sets of nodes could both progress in ways that cannot be cleanly reconciled later. This is not what people expect most of the time when connecting multiple computers together.
You could say the same thing about multiple cores or CPUs. A lot of people write apps that aren't useful past a single core or CPU. Doesn't mean we don't build OSes & hardware for multiple cores... (Remember back when nobody had an SMP kernel, because, hey, who the hell's writing their apps for more than one CPU?! Our desktops aren't big iron!)
In the worst-case, your code is just running on the CPU you already have. If you have another node/CPU, you can schedule your whole process on that one, which frees up your current CPU for more work. If you design your app to be more scalable to more nodes/CPUs, you get more benefits. So even in the worst case, everything would just be... exactly the way it is today. But there are many cases that would be benefited, and once the platform is there, more people would take advantage of it.
There is still a massive opportunity in general parallel computing that we haven't explored. Plenty of research, but along specific kinds of use cases, and with not nearly enough investment, so the little work that got done took decades. I think we could solve all the problems and make it generally useful, which could open up a whole new avenue of computing / applications, the way more bandwidth did.
(I'm referring to consumer use-cases above, but in the server world alone, a distributed OS with simple parallel computing would transform billion-dollar markets in software, making a whole lot of complicated solutions obsolete. It might take a miracle for the code to get adopted upstream by the Linux Mafia, though)
> It might take a miracle for the code to get adopted upstream by the Linux Mafia, though
The basic building block is containerization/namespacing, which has been adopted upstream. If your app is properly containerized, you can use the CRIU featureset (which is also upstream) to checkpoint it and migrate it to another node.
What about unified memory? I know these APUs are slower than traditional GPUs but still it seems like the simpler programming model will be worth it.
The biggest problem is that most APUs don't even support full unified memory (system SVM in OpenCL). From my research only Apple M series, some Qualcomm Adreno and AMD APUs support them.
I wonder if CDN server applications could use something like this, if every core had a hardware TCP/TLS stack and there was a built-in IP router to balance the load, or something like that.
There's a lot here that seems to misunderstand GPUs and SIMD.
Note that raytracing is a very dynamic problem, where the GPU isn't sure if a ray hits a geometry or if it misses. When it hits, the ray needs to bounce, possibly multiple times.
Various implementations of raytracing, recursion, dynamic parallelism or whatever. Its all there.
Now the software / compilers aren't ready (outside of specialized situations like Microsofts DirectX Raytracing, which compiles down to a very intriguing threading model). But what was accomplished with DirectX can be done in other situations.
-------
Connection Machine is before my time, but there's no way I'd consider that 80s hardware to be comparable to AVX2 let alone a modern GPU.
Connection Machine was a 1-bit computer for crying out loud, just 4096 of them in parallel.
Xeon Phi (70 core Intel Atoms) is slower and weaker than 192 core Modern EPYC chips.
-------
Today's machines are better. A lot better than the past machines. I cannot believe any serious programmer would complain about the level of parallelism we have today and wax poetic about historic and archaic computers.
The problems I'm having are very different than those for raytracing. Sure, it's dynamic, but at a fine granularity, so the problems you run into are divergence, and often also wanting function pointers, which don't work well in a SIMT model, By contrast, the way I'm doing 2D there's basically no divergence (monoids are cool that way) but there is a need to schedule dynamically at a coarser (workgroup) level.
But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.
The problem with the GPU raytracing work is that they built hardware and driver support for the specific problem, rather than more general primitives on which you could build not only raytracing but other applications. The same story goes for video encoding. Continuing that direction leads to unmanageable complexity.
Of course today's machines are better, they have orders of magnitude more transistors, and crystallize a ton of knowledge on how to build efficient, powerful machines. But from a design aesthetic perspective, they're becoming junkheaps of special-case logic. I do think there's something we can learn from the paths not taken, even if, quite obviously, it doesn't make sense to simply duplicate older designs.
> But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.
True Allocation just seems to be a "forced sequential" operation. A "stop the world, figure out what RAM is available" kind of thing.
If you can work with pre-allocated buffers, then GPUs work by reading from lists (consume operations), and then outputting to lists (append operations). Which can be done with gather / scatter, or more precisely stream-expansion and stream-compaction in a grossly parallel manner.
---------
If that's not enough "memory management" for you, then yeah, CPU is the better device to work with. At which point I again point back to the 192-core EPYC Zen5c example, we have grossly parallel CPUs today if you need them. Just a few clicks away to rent from cloud providers like Amazon or Azure.
GPUs are good at certain things (and I consider the pinnacle of "Connection Machine" style programming. Just today's GPUs are far more parallel, far easier to program and far faster than the old 1980s stuff).
Some problems cannot be split up (ex: web requests are so unique I cannot imagine they'd ever be programmed into a GPU due to their divergence). However CPUs still exist for that.
> But the biggest problem I'm having is management of buffer space for intermediate objects
My advice for right now (barring new APIs), if you can get away with it, is to pre-allocate a large scratch buffer for as big of a workload as you will have over the program's life, and then have shaders virtually sub-allocate space within that buffer.
If you are willing to have lots of loops inside of a shader (not always possible due to Windows's 2 second maximum), you can while(hits_array is not empty) kind of code, allowing your 1024-wavegroup to keep recursively calling all of the hits and efficiently processing all of the rays recursively.
--------
The important tidbit is that this technique generalizes. If you have 5 functions that need to be "called" after your current processing, then it becomes:
if (func1 needs to be called next){
push(func1, dataToContinue);
} else if (func2 needs to be called next){
push(func2, dataToContinue);
} else if (func3 needs to be called next){
push(func3, dataToContinue);
} else if (func4 needs to be called next){
push(func4, dataToContinue);
} else if (func5 needs to be called next){
push(func5, dataToContinue);
}
Now of course we can't grow "too far", GPUs can't handle divergence very well. But for "small" numbers of next-arrays and "small" amounts of divergence (ie: I'm assuming that func1 is the most common here, like 80%+ so that the buffers remain full), then this technique works.
If you have more divergence than that, then you need to think more carefully about how to continue. Maybe GPUs are a bad fit (ex: any HTTP server code will be awful on GPUs) and you're forced to use a CPU.
Having talked to many engineers using distributed compute today, they seem to think that (single-node) parallel compute haven't changed much since ~2010 or so.
It's quite frustrating, and exacerbated by frequent intro-level CUDA blog posts which often just repeat what they've read.
re: raytracing, this might be crazy but, do you think we could use RT cores to accelerate control flow on the GPU? That would be hilarious!
But there is seemingly a generalization here to the Raytracing software ecosystem. I dunno how much software / hardware needs to advance here, but we are at the point where Intel RT cores are passing the stack pointers / instruction pointers between shaders (!!!). Yes through specialist hardware but surely this can be generalized to something awesome in the future?
------
For now, I'm happy with stream expansion / stream compaction and looping over consume buffers and producer/append buffers.
I have never done GPU programming or graphics, but what feels frustating looking from the outside is the designs and constraints seems so arbitrary. They don't feel like they come from actual hardware constraints/problems. It just looks like pure path dependency going all the way back to the fixed-function days, with tons of accidental complexity and and half-finished generalizations ever since.
"I believe there are two main things holding it back."
He really science’d the heck out of that one. I’m getting tired of seeing opinions dressed up as insight—especially when they’re this detached from how real systems actually work.
I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with. There’s a reason it didn’t survive.
What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on. They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences. That’s how you end up with security holes, random crashes, and broken multi-tasking. There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.
I will take how things are today over how things used to be in a heart beat. I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.
One of the most important steps of my career was being forced to write code for an 8051 microcontroller. Then writing firmware for an ARM microcontroller to make it pretend it was that same 8051 microcontroller.
I was made to witness the horrors of archaic computer architecture in such depth that I could reproduce them on totally unrelated hardware.
I tell students today that the best way to learn is by studying the mistakes others have already made. Dismissing the solutions they found isn’t being independent or smart; it’s arrogance that sets you up to repeat the same failures.
Sounds like you had a good mentor. Buy them lunch one day.
I had a similar experience. Our professor in high school would have us program a z80 system entirely by hand: flow chart, assembly code, computing jump offsets by hand, writing the hex code by hand (looking up op-codes from the z80 data sheet) and the loading the opcodes one byte at the time on a hex keypads.
It took three hours and your of us to code an integer division start to finish (we were like 17 though).
The amount of understanding it gave has been unrivalled so far.
> I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with.
So the designers of the Cell processor made some mistakes and therefore the entire concept is bunk? Because you've seen a concept done badly, you can't imagine it done well?
To be clear, I'm not criticising those designers, they probably did a great job with what they had, but technology has moved on a long way from then... The theoretical foundations for memory models, etc. are much more advanced. We've figured out how to design languages to be memory safe without significantly compromising on performance or usability. We have decades of tooling for running and debugging programs on GPUs and we've figured out how to securely isolate "users" of the same GPU from each other. Programmers are as abstracted from the hardware as they've ever been with emulation of different architectures so fast that it's practical on most consumer hardware.
None of the things you mentioned are inherently at odds with more parallel computation. Whether something is a good idea can change. At one point in time electric cars were a bad idea. Decades of incremental improvements to battery and motor technology means they're now pretty practical. At one point landing and reusing a rocket was a bad idea. Then we had improvements to materials science, control systems, etc. that collectively changed the equation. You can't just apply the same old equation and come to the same conclusion.
> and we've figured out how to securely isolate "users" of the same GPU from each other
That's the problem, isn't it.
I don't want my programs to act independently, they need to exchange data with each other (copy-paste, drag and drop). Also i cannot do many things in parralel. Some thing must be done sequencially.
[flagged]
> There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.
Nobody teaches it, and nobody writes books about it (not that anyone reads anymore)
So, there are books out there. I use Computer Architecture: A Quantitative Approach by Hennessy and Patterson. Recent revisions have removed historical information. I understand why they did remove it. I wanted to use Stallings book, but the department had already made arrangements with the publisher.
The biggest problem on why we don't write books is that people don't buy them. They take the PDF and stick it on github. Publishers don't respond to the authors on take down requests, github doesn't care about authors, so why spend the time on publishing a book? We can chase grant money. I'm fortunate enough to not have to chase grant money.
While financial incentives is important to some, a lot of people write books to share their knowledge and give the book out for free. I think more people are doing this now, and there are also open collaborative textbook projects.
And I personally think that it is weird to write books during your working hour, and also get monet from selling that book.
"financial incentives"
This is the most ignorant response I've seen yet. We don't expect monetary gain from publishing a book. We expect our costs to be covered.
This is about the consumer, not the publisher. If we lived in a socialist system, they would still pirate our publications and we will still be in debt over it.
That's a financial incentive, I'm not sure what your rejection is exactly.
> What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on.
Isn't it much more plausible that the people who love to play with exotic (or also retro), complicated architectures (with in this case high performance opportunities) are different people than those who love to "set up or work in an assembly line for shipping stable software"?
> I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.
I rather believe that among those who love this kind of programming a hate for the incompetent fellow student will happen (including wishes that these become weed out by brutal exams).
The problem is that the exotic complexity enthusiasts cluster in places like HN and sometimes they overwhelm the voices of reason.
Those students would all drop out and start meditating. That would be a fun course. Speed run developing for all the prickly architectures of the 80s and 90s.
I see what you did there.
Guru meditation, for the uninitiated.
I loved and really miss the cell. It did take quite a bit of work to shuffle things in and out of the SPUs correctly (so yeah, it took much longer to write code and greater care), but it really churned through data.
We had a generic job mechanism with the same restrictions on all platforms. This usually meant if it ran at all on Cell it would run great on PC because the data would generally be cache friendly. But it was tough getting the PowerPC to perform.
I understand why the PS4 was basically a PC after that - because it's easier. But I wish there was still SPUs off the side to take advantage of. Happy to have it off die like GPUs are.
> They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences.
Is there any reason why GPU-style parallelism couldn't have memory protection?
It does. GPUs have full MMUs.
They do? Then how do i do the forbidden stuff by accessing neighboring pixel data?
do you mean accessing data outside of your app's framebuffer? or just accessing neighboring pixels during a shader pass, because those are _very_ different things. GPU MMUs mean that you cant access a buffer that doesn't belong to your app that's it, its not about restricting pixel access within your own buffers
TIL. Thank you.
Have you actually done that recently?
On flattening address spaces: the road not taken here is to run everything in something akin to the JVM, CLR, or WASM. Do that stuff in software not hardware.
You could also do things like having the JIT optimize the entire running system dynamically like one program, eliminating syscall and context switch overhead not to mention most MMU overhead.
Would it be faster? Maybe. The JIT would have to generate its own safety and bounds checking stuff. I’m sure some work loads would benefit a lot and others not so much.
What it would do is allow CPUs to be simpler, potentially resulting in cheaper lower power chips or more cores on a die with the same transistor budget. It would also make portability trivial. Port the core kernel and JIT and software doesn’t care.
> On flattening address spaces: the road not taken here is to run everything in something akin to the JVM, CLR, or WASM.
GPU drivers take SPIR-V code (either "kernels" for OpenCL/SYCL drivers, or "shaders" for Vulkan Compute) which is not that different at least in principle. There is also a LLVM-based soft-implementation that will just compile your SPIR-V code to run directly on the CPU.
We end up relying on software for this so much anyway. Your examples plus the use of containers and the like at OS level.
"The birth and death of JavaScript"
[flagged]
What the ever loving hell, it was a perfectly reasonable idea in response to another idea.
They weren't saying it should be done, and went out of the way to make it explicit that they are not claiming it would be better.
It was a thought exploration, and a valid one, even if it would not pan out if carried all the way to execution at scale. Yes it was handwaving. So what? All ideas start as mere thoughts, and it is useful, productive, and interesting to trade them back and forth in these things called conversations. Even "fantasy" and "handwavy" ones. Hell especially those. It's an early stage in the pollination and generation of new ideas that later become real engineering. Or not, either way the conversation and thought was entertaining. It's a thing humans do, in case you never met any or aren't one yourself.
The brainstorming was a hell of a lot more valid, interesting, and valuable than this shit. "Just go away" indeed.
It wasn't handwaving or brainstorming. Microsoft even built a research OS like this:
https://www.microsoft.com/en-us/research/project/singularity...
Have people really never used a higher level execution environment?
The JVM and the CLR are the most popular ones. Have people never looked at their internals? Then there's the LISP machines, Erlang, Smalltalk, etc., not to mention a lot of research work on abstract machines that just don't have the problems you get with direct access to pointers and hardware.
Some folks in the graphics programming community are allergic to these kind of modern ideas.
They are now putting up with JITs in GPGPUs, thanks to the market pressure from folks using languages like Julia and Python, that rather keep using those languages than having to rewrite their algorithms in C or C++.
These are communities that even adopting C over Assembly, and C++ over C, has been an uphill battle, let alone something like a JIT, that is like calling for the pitchforks and torches.
By the way, one of the key languages used in the Connection Machine mentioned on the article was StarLisp.
https://en.wikipedia.org/wiki/*Lisp
I'm going to call this out. The entire post obviously has bucket loads if aggression which can be taken as just communication style, but the last line was just uncalled for.
I have seen you make high quality responses to crazy posts.
Do better.
Don't worry, with LLMs, we're moving away from anything that remotely looks like "stable software" :)
Also, yeah, I recall the dreaded days of cooperative multitasking between apps. Moving from Windows 3.x to Linux was a revelation.
With LLM's it is just more visible. When the age of "updates" begun, the age of stable software died.
True. The quality of code yielded by LLMs would have been deemed entirely unacceptable 30 years ago.
> I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course.
Fortran is memory-safe, right? ;-)
[dead]
The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.
- You need to compile shader source/bytecode at runtime; you can't just "run" a program.
- On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.
- You need to synchronize data access between CPU-GPU and GPU workloads.
- You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.
- You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.
What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.
>What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does.
For "embarrassingly parallel" jobs vector extensions start to eat tiny bits of the GPU pie.
Unfortunately, just slapping thousands of cores works poorly in practice. You quickly get into the synchronization wall caused by unified memory. GPUs cleverly work around this issue by using numerous tricks often hidden behind extremely complex drivers (IIRC CUDA exposes some of this complexity).
The future may be in a more explicit NUMA, i.e. in the "network of cores". Such hardware would expose a lot of cores with their own private memory (explicit caches, if you will) and you would need to explicitly transact with the bigger global memory. But, unfortunately, programming for such hardware would be much harder (especially if code has to be universal enough to target different specs), so I don't have high hopes for such paradigm to become massively popular.
It’s weird that no one mentioned xeon phi cards… that’s essentially what they were. Up to 188 (iirc?) x86 atom cores, fully generically programmable.
I consider Xeon Phi to be the shipping version of Larrabee. I've updated the post to mention it.
Seems to me there's a trend of applying explicit distributed systems (network of small-SRAM-ed cores each with some SIMD, explicit high-bandwidth message-passing between them, maybe some specialized ASICs such as tensor cores, FFT blocks...) looking at tenstorrent, cerebras, even kalray... out of the CUDA/GPU world, accelerators seem to be converging a bit. We're going to need a whole lot of tooling, hopefully relatively 'meta'.
Networks of cores ... Congrats you have just taken a computer and shrunk it so there are many on a single chip ... Just gonna say here AWS does exactly this network of computers thing ... Might be profitable
What I want is a Linear Algebra interface - As Gilbert Strang taught it. I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.
I'm not willing to even know about the HW at all, the higher level my code the more opportunities for the JIT to optimize my code.
What I really want is something like Mathematica that can JIT to GPU.
As another commenter mentioned all the API's assume you're a discrete GPU off the end of a slow bus, without shared memory. I would kill for an APU that could freely allocate memory for GPU or CPU and change ownership with the speed of a pagefault or kernel transition.
> What I really want is something like Mathematica that can JIT to GPU.
https://juliagpu.org/
https://github.com/jax-ml/jax
To expand on this link, this is probably the closest you're going to get to 'I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.' right now. JAX implements a good portion of the Numpy interface - which is the most common interface for linear algebra-heavy code in Python - so you can often just write Numpy code, but with `jax.numpy` instead of `numpy`, then wrap it in a `jax.jit` to have it run on the GPU.
I was about to say that it is literally just Jax.
It genuinely deserves to exist alongside pytorch. It's not just Google's latest framework that you're forced to use to target TPUs.
Like, PyTorch? And the new Mac minis have 512gb of unified memory
You can have that today. Just go out and buy more CPUs until they have enough cores to equal the number of SMs in your GPU (or memory bandwidth, or whatever). The problem is that the overhead of being general purpose -- prefetch, speculative execution, permissions, complex shared cache hierarchies, etc -- comes at a cost. I wish it was free, too. Everyone does. But it just isn't. If you have a workload that can jettison or amortize these costs due to being embarrassingly parallel, the winning strategy is to do so, and those workloads are common enough that we have hardware for column A and hardware for column B.
> The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths.
To me it feels somewhat like programming for the segmented memory model with its near and far pointers, back in the old days. What a nightmare.
Larrabee was something like that, didn't took off.
IMHO, the real issue is cache coherence. GPUs are spared from doing a lot of extra work by relaxing coherence guarantees quite a bit.
Regarding the vendor situation - that's basically how most of computing hardware is, save for the PC platform. And this exception is due to Microsoft successfully commoditizing their complements (which caused quite some woe on the software side back then).
Is cache coherence a real issue, absent cache contention? AIUI, cache coherence protocols are sophisticated enough that they should readily adapt to workloads where the same physical memory locations are mostly not accessed concurrently except in pure "read only" mode. So even with a single global address space, it should be possible to make this work well enough if the programs are written as if they were running on separate memories.
It is because cache coherence requires extra communication to make sure that the cache is coherent. There's cute stratgies for reducing the traffic, but ultimately you need to broadcast out reservations to all of the other cache coherent nodes, so there's an N^2 scaling at play.
I miss, not exactly Larrabee, but what it could have become. I want just an insane number of very fast, very small cores with their own local memory.
In the field usually nothing takes off on the first attempt, so this is just a reason to ask "what's different this time" on the following attempts.
> What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory...
I too am very interested in this model. The Linux kernel supports up to 4,096 cores [1] on a single machine. In practice, you can rent a c7a.metal-48xl [2] instance on AWS EC2 with 192 vCPU cores. As for programming models, I personally find the Java Streams API [3] extremely versatile for many programming workloads. It effectively gives a linear speedup on serial streams for free (with some caveats). If you need something more sophisticated, you can look into OpenMP [4], an API for shared-memory parallelization.
I agree it is time for some new ideas in this space.
[1]: https://www.phoronix.com/news/Perf-Support-2048-To-4096-Core...
[2]: https://aws.amazon.com/ec2/instance-types/c7a/
[3]: https://docs.oracle.com/en/java/javase/24/docs/api/java.base...
[4]: https://docs.alliancecan.ca/wiki/OpenMP
Yep, and those printers are proprietary and mutually incompatible, and there are buggy mutually incompatible serial drivers on all the platforms which results in unique code paths and debugging & workarounds for app breaking bugs for each (platform, printer brand, printer model year) tuple combo.
(That was idealized - actually there may be ~5 alternative driver APIs even on a single platform each with its own strengths)
I really would like you to sketch out the DX you are expecting here, purely for my understanding of what it is you are looking for.
I find needing to write seperate code in a different language annoying but the UX of it is very explicit of what is happening in the memory which is very useful. With really high performance compute across multiple cores ensuring you don't get arbitrary cache misses is a pain. If we could address CPUs like we address current GPUs( well you can but it's not generally done) it would make it much much simpler.
Want to alter something in parallel, copy it to memory allocated to a specific core which is guaranteed to only be addressed by that core and the do the operations on it.
To do that currently you need to be pedantic about alignment and manually indicate thread affinity to the scheduler etc. Which ia entirely as annoying as GPU programming.
Your wish sounds to me a lot like Larrabee/Xeon Phi or manycore CPUs. Maybe I am misunderstanding something, but it sounds like a good idea to me and I don’t totally see why it inherently can’t compete with GPUs.
I think Intel should have made more of an effort to get cheap Larrabee boards to developers, they could have been ones with chips that had some broken cores or unable to make the design speed.
RAM size seemed to have been a problem, lowest end Phi only had 6GB GDDR5 for its 57 cores(228 threads).
doesn't matter. the issues you raise are abstractable at the language level, or maybe even the runtime. unfortunately there are others like which of the many kinds of parallelism to use (ILP, thread, vector/SIMD, distributed memory with much lower performance, etc.) that are harder to hide behind a compiler with acceptable performance.
So greenarrays F18? :)
Please explain how these "worker cores" should operate.
"want to protect their turd" - golden!
Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.
Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.
The key transformation required to make any parallel architecture work is going to be taking a program that humans can understand, and translating it into a directed acyclic graph of logical Boolean operations. This type of intermediate representation could then be broken up into little chunks for all those small CPUS. It could be executed very slowly using just a few logic gates and enough ram to hold the state, or it could run at FPGA speeds or better on a generic sea of LUTs.
This sounds like graph reduction as done by https://haflang.github.io/ and that flavor of special purpose CPU.
The downside of reducing a large graph is the need for high bandwidth low latency memory.
The upside is that tiny CPUs attached directly to the memory could do reduction (execution).
Reminds me of Mill Computing's stuff.
https://millcomputing.com/
Mill Computing's proposed architecture is more like VLIW with lots of custom "tricks" in the ISA and programming model to make it nearly as effective as the usual out-of-order execution than a "generic sea" of small CPU's. VLIW CPU's are far from 'tiny' in a general sense.
Like interaction nets?
Isn't that the Connection Machine architecture?
Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Imagine a sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors. The programming for this, even as virtual machine, allows for exploration of a virtually infinite design space of hardware with various tradeoffs for speed, size, cost, reliability, security, etc. The same graph could be refactored to run on anything in that design space.
> Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
> Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Modern parallel scheduling systems still have "queues" to manage these concerns; they're just handled in software, with patterns like "work stealing" that describe what happens when unexpected mismatches in execution time must somehow be handled. Even your "sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors" has queues, only the queue is called a "pipeline" and a mismatch in execution speed leads to "pipeline bubbles" and "stalls". You can't really avoid these issues.
the CM architecture or programming model wasn't really a DAG. It was more like tensors of arbitrary rank with power of two sizes. Tensor operations themselves were serialized, but each of them ran in parallel. It was however much nicer than coding vectors today - it included Blelloch scans, generalizied scatter-gather, and systolic-esque nearest neighbor operations (shift this tensor in the positive direction along this axis). I would love to see a language like this that runs on modern GPUs, but its really not sufficiently general to get good performance there I think.
I would not complain about getting my own personal Connection Machine.
So long as Tamiko Thiel does the design.
there’s a differentiable version of this that compiles to C or CUDA: difflogic
Yep, transputers failed miserably. I wrote a ton a code for them. Everything had to be solved in a serial bus, which defeated the purpose of the transputer.
Quite fascinating. Did you write about your experiences in that area? Would love to read it!
Not in those terms, but an autobiography is coming, and bits and pieces are being explained. I expect about 10 people to buy the book, as all of the socialists will want it for free. I am negotiating with a publisher as we speak on the terms.
Could you elaborate on the “serial bus” bit?
Could you elaborate on this? How does many-small-CPUs make for a weirder programming model than a GPU?
Im no expert, but I’ve done my fair share of parallel HPC stuff using MPI, and a little bit of Cuda. And to me the GPU programming model is far far “weirder” and harder to code for than the many-CPUs model. (Granted, I’m assuming you’re describing a different regime?)
In CUDA you don't really manage the individual compute units, you start a kernel, and the drivers take care of distributing that to the compute cores and managing the data flows between them.
When programming CPUs however you are controlling and managing the individual threads. Of course, there are libraries which can do that for you, but fundamentally it's a different model.
The GPU equivalent of a single CPU "hardware thread" is called a "warp" or a "wavefront". GPU's can run many warps/wavefronts per compute unit by switching between warps to hide memory access latency. A CPU core can do this with two hardware threads, using Hyperthreading/2-way SMT, some CPU's have 4-way SMT, but GPU's push that quite a bit further.
What you say has nothing to do with CPU vs. GPU, or with CUDA, which is basically equivalent with the older OpenMP.
When you have a set of concurrent threads, each thread may run a different program. There are many applications where this is necessary, but such applications are hard to scale to very high levels of concurrency, because each thread must be handled individually by the programmer.
Another case is when all the threads run the same program, but on different data. This is equivalent with a concurrent execution of a "for" loop, which is always possible when the iterations are independent.
The execution of such a set of threads that execute the same program has been named "parallel DO instruction" by Melvin E. Conway in 1963, "array of processes" by C. A. R. Hoare in 1978, "replicated parallel" in the Occam programming language in 1985, SPMD around the same time, "PARALLEL DO" in the OpenMP Fortran language extension in 1997, "parallel for" in the OpenMP C/C++ language extension in 1998, and "kernel execution" in CUDA, which has also introduced the superfluous acronym SIMT to describe it.
When a problem can be solved by a set of concurrent threads that run the same program, then it is much simpler to scale the parallelism to extremely high levels and the parallel execution can usually be scheduled by a compiler or by a hardware controller without the programmer having to be concerned with the details.
There is no inherent difficulty in making a compiler that provides exactly the same programming model as CUDA, but which creates a program for a CPU, not for a GPU. Such compilers exist, e.g. ispc, which is mentioned in the parent article.
The difference between GPUs and CPUs is that the former appear to have some extra hardware support for what you describe as "distributing that to the compute cores and managing the data flows between them", but nobody is able to tell exactly what is done by this extra hardware support and whether it really matters, because it is a part of the GPUs that has never been documented publicly by the GPU vendors.
From the point of view of the programmer, this possible hardware advantage of the GPUs does not really matter, because there are plenty of programming language extensions for parallelism and libraries that can take care of the details of thread spawning and work distribution over SIMD lanes, regardless if the target is a CPU or a GPU.
Whenever you write a program equivalent with a "parallel for", which is the same as writing for CUDA, you do not manage individual threads, because what you write, the "kernel" in CUDA lingo, can be executed by thousands of threads, also on a CPU, not only on a GPU. A desktop CPU like Ryzen 9 9950X has the same product of threads by SIMD lanes like a big integrated GPU (obviously, discrete GPUs can be many times bigger).
I mean weird compared to what already exists.
While acknowledging that it's theoretically possible other approaches might succeed, it seems quite clear the author agrees with you.
Im guessing you just dont have the computational power to compete with a real GPU. It would be relatively easy for a top end graphics programmer to write the front end graphics API for your chip. Im guessing that if they did this you would just end up with a very poor performing GPU.
my take from reading this is more about programming abstractions than any particular hardware instantiation. the part of the Connection Machine that remains interesting is not building machines with CPUS with transistor counts in the hundreds running off a globally synchronous clock, but that there were a whole family of SIMD languages and let you do general purpose programming in parallel. And that those language were still relevant when the architecture changed to a MIMD machine with a bunch of vector units behind each CPU.
Reminds me of Itanium
How is that at all like Itanium except for the superficial headline level where people say they are hard to program?
Because the main feature that made Itanium hard to program for was its explicit instruction-level parallelism.
They weren't talking about instruction level parallelism.
Similarities between two things don't require them to be identical.
They aren't similar, they couldn't be more different. One is about lots of small threads of execution communicating with each other and synchronizing, one is about a few instructions being able to be run in parallel because implicitly within the CPU there are different pipelines.
They aren't just different, they are at completely opposite ends of the programming spectrum. There are literally the two extremes of trying to make throughput faster.
Picochip?
> The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?
What other workloads would benefit from a GPU?
Computers are so fast that in practice, many tasks don't need more performance. If a program that runs those tasks is slow, it's because that program's code is particularly bad, and the solution to make the code less bad is simpler than re-writing it for the GPU.
For example, GUIs have been imperceptibly reactive to user input for over 20 years. If an app's GUI feels sluggish, the problem is that the app's actions and rendering aren't on separate coroutines, or the action's coroutine is blocking (maybe it needs to be on a separate thread). But the rendering part of the GUI doesn't need to be on a GPU (any more than it is today, I admit I don't know much about rendering), because responsive GUIs exist today, some even written in scripting languages.
In some cases, parallelizing a task intrinsically makes it slower, because the number of sequential operations required to handle coordination mean there are more forced-sequential operations in total. In other cases, a program spawns 1000+ threads but they only run on 8-16 processors, so the program would be faster if it spawned less threads because it would still use all processors.
I do think GPU programming should be made much simpler, so this work is probably useful, but mainly to ease the implementation of tasks that already use the GPU: real-time graphics and machine learning.
Possibly compilation and linking. That's very slow for big programs like Chromium. There's really interesting work on GPU compilers (co-dfns and Voetter's work).
Optimization problems like scheduling and circuit routing. Search in theorem proving (the classical parts like model checking, not just LLM).
There's still a lot that is slow and should be faster, or at the very least made to run using less power. GPUs are good at that for graphics, and I'd like to see those techniques applied more broadly.
All of these things you mention are "thinking", meaning they require complex algorithms with a bunch of branches and edge cases.
The tasks that GPUs are good at right now - graphics, number crunching, etc - are all very simple algorithms at the core (mostly elementary linear algebra), and the problems are, in most cases, embarassingly parallel.
CPUs are not very good at branching either - see all the effort being put towards getting branch prediction right - but they are way better at it than GPUs. The main appeal of GPGPU programming is, in my opinion, that if you can get the CPU to efficiently divide the larger problem into a lot of small, simple subtasks, you can achieve faster speeds.
You mentioned compilers. See a related example, for reference all the work Daniel Lemire has been doing on SIMD parsing: the algorithms he (co)invented are all highly specialized to the language, and highly nontrivial. Branchless programming requires an entirely different mindset/intuition than "traditional" programming, and I wouldn't expect the average programmer to come up with such novel ideas.
A GPU is a specialized tool that is useful for a particular purpose, not a silver bullet to magically speed up your code. Theree is a reason that we are using it for its current purposes.
> Possibly compilation and linking. That's very slow for big programs like Chromium.
So instead of fixing the problem (Chromium's bloat) we just trow more memory and computing power at it, hopping that the problem will go away.
Maybe we shall teach programmers to programm. /s
A big one is video encoding. It seems like GPUs would be ideal for it but in practice limitations in either the hardware or programming model make it hard to efficiently run on GPU shader cores. (GPUs usually include separate fixed-function video engines but these aren't programmable to support future codecs.)
Video encoding is done with fixed-function for power efficiency. A new popular codec like H26x codec appears every 5-10 years, there is no real need to support future ones.
Video encoding is two domains. And there's surprisingly little overlap between them.
You have your real time video encoding. This is video conferencing, live television broadcasts. This is done fixed-function not just for power efficiency, but also latency.
The second domain is encoding at rest. This is youtube, netflix, blu-ray, etc. This is usually done in software on the CPU for compression ratio efficiency.
The problem with fixed function video encoding is that the compression ratio is bad. You either have enormous data, or awful video quality, or both. The problem with software video encoding is that it's really slow. OP is asking why we can't/don't have the best of both worlds. Why can't/don't we write a video encoder in OpenCL/CUDA/ROCm. So that we have the speed of using the GPU's compute capability but compression ratio of software.
I haven't yet read the full blog post but so far my response is you can have this good parallel computer. See my previous HN comments the past months on building an M4 Mac mini supercomputer.
For example reverse engineering the Apple M3 Ultra GPU and Neural Engine instruction set and IOMMU and pages tables that prevent you from programming all processor cores in the chip (146 cores to over ten thousand depending on how you delineate what a core is) and making your own Abstract Syntax Tree to assembly compiler for these undocumented cores will unleash at least 50 trillion operations per second. I still have to benchmark this chip and make the roofline graphs for the M4 to be sure, it might be more.
https://en.wikipedia.org/wiki/Roofline_model
There are many intertwined issues here. One of the reasons we can't have a good parallel computer is that you need to get a large number of people to adopt your device for development purposes, and they need to have a large community of people who can run their code. Great projects die all the time because a slightly worse, but more ubiquitous technology prevents flowering of new approaches. There are economies of scale that feed back into ever-improving iterations of existing systems.
Simply porting existing successful codes from CPU to GPU can be a major undertaking and if there aren't any experts who can write something that drive immediate sales, a project can die on the vine.
See for example https://en.wikipedia.org/wiki/Cray_MTA when I was first asked to try this machine, it was pitched as "run a million threads, the system will context switch between threads when they block on memory and run them when the memory is ready". It never really made it on its own as a supercomputer, but lots of the ideas made it to GPUs.
AMD and others have explored the idea of moving the GPU closer to the CPU by placing it directly onto the same memory crossbar. Instead of the GPU connecting to the PCI express controller, it gets dropped into a socket just like a CPU.
I've found the best strategy is to target my development for what the high end consumers are buying in 2 years - this is similar to many games, which launch with terrible performance on the fastest commericially available card, then runs great 2 years later when the next gen of cards arrives ("Can it run crysis?")
Interesting article.
Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.
In this context, a "renderer" is something that takes in meshes, textures, materials, transforms, and objects, and generates images. It's not an entire game development engine, such as Unreal, Unity, or Bevy. Those have several more upper levels above the renderer. Game engines know what all the objects are and what they are doing. Renderers don't.
Vulkan, incidentally, is a level below the renderer. Vulkan is a cross-hardware API for asking a GPU to do all the things a GPU can do. WGPU for Rust, incidentally, is an wrapper to extend that concept to cross-platform (Mac, Android, browsers, etc.)
While it seems you can write a general 3D renderer that works in a wide variety of situations, that does not work well in practice. I wish Rust had one. I've tried Rend3 (abandoned), and looked at Renderling (in progress), Orbit (abandoned), and Three.rs (abandoned). They all scale up badly as scene complexity increases.
There's a friction point in design here. The renderer needs more info to work efficiently than it needs to just draw in a dumb way. Modern GPSs are good enough that a dumb renderer works pretty well, until the scene complexity hits some limit. Beyond that point, problems such as lighting requiring O(lights * objects) time start to dominate. The CPU driving the GPU maxes out while the GPU is at maybe 40% utilization. The operations that can easily be parallelized have been. Now it gets hard.
In Rust 3D land, everybody seems to write My First Renderer, hit this wall, and quit.
The big game engines (Unreal, etc.) handle this by using the scene graph info of the game to guide the rendering process. This is visually effective, very complicated, prone to bugs, and takes a huge engine dev team to make work.
Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.
[1] https://github.com/linebender/vello/
2D rendering is harder in fact, because antialiased curves are harder than triangle soup.
It's an issue of code complexity, not fill rate
https://faultlore.com/blah/text-hates-you/
I think a dynamic, fully vector-based 2D interface with fluid zoom and transformations at 120Hz+ is going to need all the GPU help it can get. Take mapping as an example: even Google Maps routinely struggles with performance on a top-of-the-line iPhone.
> even Google Maps routinely struggles with performance on a top-of-the-line iPhone.
It has to download the jpegs.
> Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.
A ton of 2D applications could benefit from further GPU parallelization. Games, GUIs, blurs & effects, 2D animations, map apps, text and symbol rendering, data visualization...
Canvas2D in Chrome is already hardware accelerated, so most users get better performance and reduced load on main UI & CPU threads out of the box.
Fast light transport is an incredibly hard problem to solve.
Raytracing (in its many forms) is one solution. Precomputing lightmaps, probes, occluder volumes, or other forms of precomputed visibility are another.
In the end it comes down to a combination of target hardware, art direction and requirements, and technical skill available for each game.
There's not going to be one general purpose renderer you can plug into anything, _and_ expect it to be fast, because there's no general solution to light transport and geometry processing that fits everyone's requirements. Precomputation doesn't work for dynamic scenes, and for large games leads to issues with storage size and workflow slow downs across teams. No precomputation at all requires extremely modern hardware and cutting edge research, has stability issues, and despite all that is still very slow.
It's why game engines offer several different forms of lighting methods, each with as many downsides as they have upsides. Users are supposed to pick the one that best fits their game, and hope it's good enough. If it's not, you write something custom (if you have the skills for that, or can hire someone who can), or change your game to fit the technical constraints you have to live with.
> Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.
Some games may have their own acceleration structures. Some won't. Some will only have them on the GPU, not the CPU. Some will have an approximate structure used only for specialized tasks (culling, audio, lighting, physics, etc), and cannot be generalized to other tasks without becoming worse at their original task.
Fully generalized solutions will be slow be flexible, and fully specialized solutions will be fast but inflexible. Game design is all about making good tradeoffs.
The same argument could be made against Vulkan, or OpenGL, or even SQL databases. The whole NoSQL era was based on the concept that performance would be better with less generality in the database layer. Sometimes it helped. Sometimes trying to do database stuff with key/value stores made things worse.
I'm trying to find a reasonable medium. I have a hard scaling problem - big virtual world, dynamic content - and am trying to make that work well. If that works, many games with more structured content can use the same approach, even if it is overkill.
> Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D.
Depends on how complicated your artwork is.
There are only so many screen pixels.
You can have an unlimited number of polygons overlapping a pixel. For instance, if you zoom out a lot. Imagine you converted a layer map of a modern CPU design to svg, and tried to open it in Inkscape. Or a map of NYC. Wouldn't you think a bit of extra processing power would be welcomed?
At Vulkanised 2025 someone mentioned it is an HAL for writing GPU drivers, and they have acknowledge it has gotten as messy as OpenGL and there is now a plan in place to try to sort the complexity mess.
> Modern GPUs are overkill for 2D
That explains why modern GUI are crap: because they are not able to draw a bloody rectangle, and fill it with colour. /s
I believe there are two main things holding it back. One is an impoverished execution model, which makes certain tasks difficult or impossible to do efficiently; GPUs … struggle when the workload is dynamic
This sacrifice is a purposeful cornerstone of what allows GPUs to be so high throughput in the first place.
It is odd that he talks about Larabee so much, but doesn’t mention the Xeon Phis. (Or is it Xeons Phi?).
> As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.
I’ve always been slightly annoyed by the concept of E cores, because they are so close to what I want, but not quite there… I want, like, throughput cores. Let’s take E cores, give them their AVX-512 back, and give them higher throughput memory. Maybe try and pull the Phi trick of less OoO capabilities but more threads per core. Eventually the goal should be to come up with an AVX unit so big it kills iGPUs, haha.
I've always wondered if you could use iGPU compute cores with unified memory as "transparent" E-cores when needed.
Something like OpenCL/CUDA except it works with pthreads/goroutines and other (OS) kernel threading primitives, so code doesn't need to be recompiled for it. Ideally the OS scheduler would know how to split the work, similar to how E-core and P-core scheduling works today.
I don't do HPC professionally, so I assume I'm ignorant to why this isn't possible.
Isn't Xeon Phi just an instance of Larrabee?
It is an instance of Larrabee in the same sense as AMD Zen 4 is an instance of Larrabee.
The "Larrabee New Instructions" is an instruction set that has been designed before AVX and also its first hardware implementation has been introduced before AVX, in 2010 (AVX was launched in 2011, with Sandy Bridge).
Unfortunately while the hardware design of Sandy Bridge with the inferior AVX ISA has been done by the Intel A team, the hardware implementations of Larrabee have been done by some C or D teams, which were also not able to design new CPU cores for it, but they had to reuse some obsolete x86 cores, initially a Pentium core and later an Atom Silvermont core, to which the Larrabee instructions were grafted.
"Larrabee New Instructions" have been renamed to "Many Integrated Cores" ISA, then to AVX-512, while passing through 3 generations of chips, Knights Ferry, Knights Corner and Knights Landing. A fourth generation, Knights Mill, was only intended for machine learning/AI applications. The successor of Knights Landing has been Skylake Server, when the AVX-512 ISA has come to standard Xeons, marking the disappearance of Xeon Phi.
Already in 2013, Intel Haswell has added to AVX a few of the more important instructions that were included in the Larrabee New Instructions, but which were missing in AVX, e.g. fused multiply-add and gather instructions. The 3-address FMA format, which has caused problems to AMD, who had implemented in Bulldozer a 4-address format, has also come to AVX from Larrabee, replacing the initial 4-address specification.
At each generation until Skylake Server, some of the original Larrabee instructions have been deleted, by assuming that they might be needed only for graphics, which was no longer the intended market. However a few of those instructions were really useful for some applications in which I am interested, e.g. for computations with big numbers, so I regret their disappearance.
Since Skylake Server, there have been no other instruction removals, with the exception of those introduced by Intel Tiger Lake, which are now supported only by AMD Zen 5. A few days ago Intel has committed to keeping complete compatibility in the future with the ISA implemented today by Granite Rapids, so there will be no other instruction deletions.
> It is an instance of Larrabee in the same sense as AMD Zen 4 is an instance of Larrabee.
This is an odd claim. Clearly Xeon Phi is the shipping version of Larrabee, while Zen 4 is a completely different chip design that happens to run AVX-512. The first shipping Xeon Phi (Knights Corner) used the exact same P54C cores as Larrabee, while as you point out later versions of Xeon Phi switched to Atom.
It is extremely common to refer to all these as Larrabee, for example the Ian Cutress article on the last Xeon Phi chip was entitled "The Larrabee Chapter Closes: Intel's Final Xeon Phi Processors Now in EOL" [1]. Pat Gelsinger's recent interview at GTC [2] also refers to Larrabee. The section from around 44:00 has a discussion of workloads becoming more dynamic, and at 53:36 there's a section on Larrabee proper.
[1]: https://www.anandtech.com/show/14305/intel-xeon-phi-knights-...
[2]: https://www.youtube.com/live/pgLdJq9FRBQ
I think it is not right to say that Larrabee and Phi are as distant as Larrabee and Zen. But, they did retreat a bit from the “graphics card” like functionality, and to scale back the ambitions to become something a bit more familiar.
Something that frustrates me a little is that my system (apple silicon) has unified memory, which in theory should negate the need to shuffle data between CPU and GPU. But, iiuc, the GPU programming APIs at my disposal all require me to pretend the memory is not unified - which makes sense because they want to be portable across different hardware configurations. But it would make my life a lot easier if I could just target the hardware I have, and ignore compatibility concerns.
You can. There are API extensions for persistently mapping memory, and it's up to you to ensure that you never write to a buffer at the same time the GPU is reading from it.
At least for Vulkan/DirectX12. Metal is often weird, I don't know what's available there.
Unified memory doesn't mean unified address space. It frustrates me when no one understands unified memory.
If you fix the pages tables (partial tutorial online) you can have continuous unified address space on Apple Silicon.
Let’s be honest, saying “just fix the page tables” is like telling someone they can fly if they “just rewrite gravity.”
Yes, on Apple Silicon, the hardware supports shared physical memory, and with enough “convincing”, you can rig up a contiguous virtual address space for both the CPU and GPU. Apple’s unified memory architecture makes that possible, but Apple’s APIs and memory managers don’t expose this easily or safely for a reason. You’re messing with MMU-level mappings on a tightly integrated system that treats memory as a first-class citizen of the security model.
I can tell you never programmed on an Amiga.
Oh yes I programmed all the Amiga models, mostly in assembly level. I reprogrammed the ROMs. I also published a magazine on all the Commodore computers internals and build lots of hardware for these machines.
We had the parallel Inmos Transputer systems during the heyday of the Amiga, they where much better designed than any the custom Amiga chips.
Inmos was a disaster. No application ever shipped on one. EVER. It used a serial bus to resolve the problems that should have never been problems. Clearly you never wrote code for one. Each oslink couldn't reach more than 3 feet. What a disaster that entire architecture was.
I shipped 5 applications on an 800 Inmos Transputer supercomputer. Sold my parallel C compilers, macro Assembler. Also an OS, Macintosh Nubus interface card, Transputer graphics cards, a full paper copier and laserprinter. I know of dozens of successful products.
Sure you did. What were they? The only successful transputer was the T414 and it never made it outside academia.
Well, I believe there were military radar projects that shipped in reasonable quantities and served for reasonable lifetimes.
I think I remember some medical imaging products as well?
I don’t dispute that the Transputer was ultimately unsuccessful, but it wasn’t completely unused in real-world products.
See also TI’s C40, which was quite similar and similarly successful.
Hey don't shit on my retro alternative timeline nostalgia. We were all writing Lisp programs on 64 CPU Transputer systems with FPGA coprocessors, dynamically reconfigured in realtime with APL.
/s/LISP/Prolog and you've basically described the old "Fifth Generation" research project. Unfortunately it turns out that trying to parallelize Prolog is quite a nightmare, the language is really, really not built for it. So the whole thing was a dead-end in practice. Arguably we didn't have a real "fifth-gen" programming language prior to Rust, given how it manages to uniquely combine ease of writing parallel+concurrent code with bare-metal C like efficiency. (And Rust is now being used to parallelize database query, which comfortably addresses the actual requirement that Prolog had been intended for back then - performing "search" tasks on large and complex knowledge bases.)
You probably already know about https://github.com/mthom/scryer-prolog are you saying we should accelerate scryer-prolog with a Grayskull board?
Also, the Fifth Generation computer project subscribed to the old AI paradigm, not based on statistics.
Not true. "just fix the page tables" took me 4 hours. And only 15 minutes with the Linux kernel on Apple Silicon.
Obviously you missed the sarcasm.
I know the APIs don't make it easy, that's precisely why I want different APIs.
I've always admired the work that the team behind https://www.greenarraychips.com/ does, and the GA144 chip seems like a great parallel computing innovation.
I implemented some evolutionary computation stuff on the Cell BE in college. It was a really interesting machine and could be very fast for its time but it was somewhat painful to program.
The main cores were PPC and the Cell cores were… a weird proprietary architecture. You had to write kernels for them like GPGPU, so in that sense it was similar. You couldn’t use them seamlessly or have mixed work loads easily.
Larrabee and Xeon Phi are closer to what I’d want.
I’ve always wondered about many—many-many-core CPUs too. How many tiny ARM32 cores could you put on a big modern 5nm die? Give each one local RAM and connect them with an on die network fabric. That’d be an interesting machine for certain kinds of work loads. It’d be like a 1990s or 2000s era supercomputer on a chip but with much faster clock, RAM, and network.
When this topic comes up, I always think of uFork [1]. They are even working on an FPGA prototype.
[1] https://ufork.org/
The AIE arrays on Versal and Ryzen with XDNA are a big grid of cores (400 in an 8 x 50 array) that you program with streaming work graphs.
https://docs.amd.com/r/en-US/am009-versal-ai-engine/Overview
Each AIE tile can stream 64 Gbps in and out and perform 1024 bit SIMD operations. Each shares memory with its neighbors and the streams can be interconnected in various ways.
Clearly the author never worked with a CM2 - I did though. The CM2 was more like a co-processor which had to be controlled by a (for that age) rather beefy SUN workstation/server. The program itself ran on the workstation which then sent the data-parallel instructions to the CM2. The CM2 was an extreme form of a MIMD design (that is why it was called data parallel). You worked with a large rectangular array (I cannot recall up to how many dimensions) which had to be a multiple of the physical processors (in your partition). All cells typically performed exactly the same operation. If you wanted to perform an operation on a subset, you had to "mask" the other cells (which were essentially idling during that time).
That is hardly what the author describes.
Did you used StarLisp? It is always a bit hard to find testimonials about the experience.
AMD Strix Halo APU is a CPU with very powerful integrated GPU.
It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space. This means it’s doesn’t have the same swapping/memory thrashing that a discrete GPU experiences when processing large models.
16 CPU cores and 40 GPU compute units sounds pretty parallel to me.
Doesn’t that fit the bill?
> It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space
No definitely. RTX4090 definitely use fast graphics RAM (though it is usually previous generation, but overclocked and very wide bus). AMD Strix Halo definitely use standard DDR5 which is not so fast.
And yes, Strix Halo GPU using "3dcache", but as officials said, CPU don't have access to GPU cache, because "have not seen any app significantly benefited from such access".
So probably, internal SoC bus should have less delay than discrete GPU on PCIe, but not too much different.
It looks like it will be available in the Framework Desktop! I would love to see it in a more budget mini PC at some point from another company. (Framework is great but not in my price range.)
> It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space
I love AMD's Ryzen chips and will recommend their laptops over an Nvidia model all day. However, this is a pretty facetious comparison that falls apart when you normalize the memory. Any chip can be memory bottlenecked, and if we take away that arbitrary precondition the Strix Halo gets trounced in terms of compute capacity. You can look at the TDP of either chip and surmise this pretty easily.
> However, this is a pretty facetious comparison that falls apart when you normalize the memory
Why would you normalize though? You can't buy a 96 GB RTX4090. So it's fair to compare the whole deal, slowish APU with large RAM versus very fast GPU with limited RAM.
> You can't buy a 96 GB RTX4090
You can now buy a 96 GB RTX5090.[1] NVidia gives it a "Pro" designation and charges more, but it's the same chip.
[1] https://www.tomshardware.com/pc-components/gpus/nvidia-rtx-p...
It is fair, it should just be contextualized with a comparison of 13B or 32B models as well. This is one of those Apple marketing moves where a very specific benchmark has been cherry-picked for a "2.2x improvement!" headline that people online misconstrue.
“ AMD also claims its Strix Halo APUs can deliver 2.2x more tokens per second than the RTX 4090 when running the Llama 70B LLM (Large Language Model) at 1/6th the TDP (75W).”
https://www.tomshardware.com/pc-components/cpus/amd-slides-c...
You could argue it’s invalid claim because it’s from AMD not independent.
This is still a memory-constrained benchmark. The smallest Llama 70B model (gguf-q2) doesn't fit in-memory so is bottlenecked by your PCIe connector. It's a valid benchmark, but it's still guilty of being stacked in the exact way I described before.
A comparison of 7B/13B/32B model performance would actually test the compute performance of either card. AMD is appealing to the consumers that don't feel served by Nvidia's gaming lineup, which is fine but also doomed if Nvidia brings their DGX Spark lineup to the mobile form factor.
This essay needs more work.
Are you arguing for a better software abstraction, a different hardware abstraction or both? Lots of esoteric machines are name dropped, but it isn't clear how that helps your argument.
Why not link to Vello? https://github.com/linebender/vello
I think a stronger essay would at the end give the reader a clear view of what Good means and how to decide if a machine is closer to Good than another machine and why.
SIMD machines can be turned into MIMD machines. Even hardware problems still need a software solution. The hardware is there to offer the right affordances for the kinds of software you want to write.
Lots of words that are in the eye of beholder. We need a checklist or that Good parallel computer won't be built.
Personal opinion: it's the software (and software tooling).
The hardware is good enough (even if we're only talking 10x efficiency). Part of the issue seems slightly cultural, i.e. repetitively putting down the idea of traditional task parallelism (not-super-SIMD/data parallelism) on GPUs. Obviously, one would lose a lot of efficiency if we literally ran 1 thread per warp. But it could be useful for lightly-data-parallel tasks (like typical CPU vectorization), or maybe using warp-wide semantics to implement something like a "software" microcode engine. Dumb example: implementing division with long division using multiplications and shifts.
Other things a GPU gives: insanely high memory bandwidth, programmable cache (shared memory), and (relatively) great atomic operations.
I agree.
Many things in software are in the "you're doing it wrong" but that wrong way is subjective and arbitrary.
> maybe using warp-wide semantics to implement something like a "software" microcode engine.
https://github.com/beehive-lab/ProtonVM
Thanks for the share (and reminder)! Turns out I had this bookmarked somehow, lol.
email is on profile, drop me a line if you want to discuss gpu meta machines
The domain doesn't seem to be available?
I was being too obtuse in my obfuscation, it is a proton.me email address
> Are you arguing for a better software abstraction, a different hardware abstraction or both?
I don't speak for Raph, but imo it seems like he was arguing for both, and I agree with him.
On the hardware side, GPUs have struggled with dynamic workloads at the API level (not e.g. thread-level dynamism, that's a separate topic) for around a decade. Indirect commands gave you some of that so at least the size of your data/workload can be variable if not the workloads themselves, then mesh shaders gave you a little more access to geometry processing, and finally workgraphs and device generated commands lets you have an actually dynamically defined workload (e.g. completely skipping dispatches for shading materials that weren't used on screen this frame). However it's still very early days, and the performance issues and lack of easy portability are problematic. See https://interplayoflight.wordpress.com/2024/09/09/an-introdu... for instance.
On the software side shading languages have been garbage for far longer than hardware has been a problem. It's only in the last year or two that a proper language server for writing shaders has even existed (Slang's LSP). Much less the innumerable driver compiler bugs, lack of well defined semantics and memory model until the last few years, or the fact that we're still manually dividing work into the correct cache-aware chunks.
Absolutely. And the fact that we need to evolve both is one of the reasons progress has been difficult.
I had hoped the GPU API would go away, and the entire thing would become fully programmable, but so far we just keep using these shitty APIs and horrible shader languages.
Personally I would like to use the same language I write the application in to write the rendering code(C++). Preferably with shared memory, not some separate memory system that takes forever to transfer anything too. Somelike along the lines of the new AMD 360 Max chips, but graphics written in explicit C++.
I was always fascinated by the prospects of the 1024-core Epiphany-V from Parallella.. https://parallella.org/2016/10/05/epiphany-v-a-1024-core-64-... But it seems whatever the DARPA connection was has led to it not being for scruffs like me and is likely powering god knows what military systems..
Any computing model that tries to parallelize von Neumann machines, that is, has program counters or address space, just isn't going to scale.
The problem isn't address space or program counters. It's that each processor is going to need instruction memory stored in SRAM or an extremely efficient multi port memory for a shared instruction cache.
GPUs get around this limitation by executing identical instructions over multiple threads.
Instructions are the problem, you have to have an architecture which just operates on data flows all in parallel and all at once, like an FPGA, but without all the fiddly special sauce parts.
There are designs like Tilera and Phalanx that have tons of cores. Then, NUMA machines used to have 128-256 sockets in one machine with coherent memory. The SGI machines let you program them like it was one machine. Languages like Chapel were designed to make parallel programming easier.
Making more things like that with lowest, possible, unit prices could help a lot.
Isn't the ONNX standard already going into the direction of programming a GPU using a computation graph? Could it be made more general?
It lacks support for the serial portions of the execution graph, but yes. You should play around with ONNX, it can be used for a lot more than just ML stuff.
What do you mean by serial portions? Aren't operations automatically serialized if there are dependencies between them?
s/serial/scalar
ONNX doesn't have the direct capabilities to be the compilation target for regular imperative code.
If we had distributed operating systems and SSI kernels, your computer could use the idle cycles of other computers [that aren't on battery power]. People talk about a grid of solar houses, but we could've had personal/professional grid computing like 15 years ago. Nobody wanted to invest in it, I guess because chips kept getting faster.
SSI is an interesting idea, but the actual advantage is mostly to improve efficiency when running your distributed code on a single, or few nodes. You still have to write your code with some very real awareness of the relevant issues when running on many nodes, but now you are also free to "scale down" and be highly efficient on a single node, since your code is still "natively" written for running on that kind of system. You are not going to gain much by opportunistically running bad single-node codes on larger systems, since that will be quite inefficient anyway.
Also, running a large multi-node SSI system means you mostly can't partition those nodes ever, otherwise the two now-separated sets of nodes could both progress in ways that cannot be cleanly reconciled later. This is not what people expect most of the time when connecting multiple computers together.
You could say the same thing about multiple cores or CPUs. A lot of people write apps that aren't useful past a single core or CPU. Doesn't mean we don't build OSes & hardware for multiple cores... (Remember back when nobody had an SMP kernel, because, hey, who the hell's writing their apps for more than one CPU?! Our desktops aren't big iron!)
In the worst-case, your code is just running on the CPU you already have. If you have another node/CPU, you can schedule your whole process on that one, which frees up your current CPU for more work. If you design your app to be more scalable to more nodes/CPUs, you get more benefits. So even in the worst case, everything would just be... exactly the way it is today. But there are many cases that would be benefited, and once the platform is there, more people would take advantage of it.
There is still a massive opportunity in general parallel computing that we haven't explored. Plenty of research, but along specific kinds of use cases, and with not nearly enough investment, so the little work that got done took decades. I think we could solve all the problems and make it generally useful, which could open up a whole new avenue of computing / applications, the way more bandwidth did.
(I'm referring to consumer use-cases above, but in the server world alone, a distributed OS with simple parallel computing would transform billion-dollar markets in software, making a whole lot of complicated solutions obsolete. It might take a miracle for the code to get adopted upstream by the Linux Mafia, though)
> It might take a miracle for the code to get adopted upstream by the Linux Mafia, though
The basic building block is containerization/namespacing, which has been adopted upstream. If your app is properly containerized, you can use the CRIU featureset (which is also upstream) to checkpoint it and migrate it to another node.
What about unified memory? I know these APUs are slower than traditional GPUs but still it seems like the simpler programming model will be worth it.
The biggest problem is that most APUs don't even support full unified memory (system SVM in OpenCL). From my research only Apple M series, some Qualcomm Adreno and AMD APUs support them.
Huh. The Blelloch mentioned n the thinking machines section taught my parallel algorithms class in 1994 or so.
I wonder if CDN server applications could use something like this, if every core had a hardware TCP/TLS stack and there was a built-in IP router to balance the load, or something like that.
I think Tim was right, it's 2025, Nvidia just released their 50 series, but I don't see any cards, let alone GPUs.
There's a lot here that seems to misunderstand GPUs and SIMD.
Note that raytracing is a very dynamic problem, where the GPU isn't sure if a ray hits a geometry or if it misses. When it hits, the ray needs to bounce, possibly multiple times.
Various implementations of raytracing, recursion, dynamic parallelism or whatever. Its all there.
Now the software / compilers aren't ready (outside of specialized situations like Microsofts DirectX Raytracing, which compiles down to a very intriguing threading model). But what was accomplished with DirectX can be done in other situations.
-------
Connection Machine is before my time, but there's no way I'd consider that 80s hardware to be comparable to AVX2 let alone a modern GPU.
Connection Machine was a 1-bit computer for crying out loud, just 4096 of them in parallel.
Xeon Phi (70 core Intel Atoms) is slower and weaker than 192 core Modern EPYC chips.
-------
Today's machines are better. A lot better than the past machines. I cannot believe any serious programmer would complain about the level of parallelism we have today and wax poetic about historic and archaic computers.
The problems I'm having are very different than those for raytracing. Sure, it's dynamic, but at a fine granularity, so the problems you run into are divergence, and often also wanting function pointers, which don't work well in a SIMT model, By contrast, the way I'm doing 2D there's basically no divergence (monoids are cool that way) but there is a need to schedule dynamically at a coarser (workgroup) level.
But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.
The problem with the GPU raytracing work is that they built hardware and driver support for the specific problem, rather than more general primitives on which you could build not only raytracing but other applications. The same story goes for video encoding. Continuing that direction leads to unmanageable complexity.
Of course today's machines are better, they have orders of magnitude more transistors, and crystallize a ton of knowledge on how to build efficient, powerful machines. But from a design aesthetic perspective, they're becoming junkheaps of special-case logic. I do think there's something we can learn from the paths not taken, even if, quite obviously, it doesn't make sense to simply duplicate older designs.
> But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.
True Allocation just seems to be a "forced sequential" operation. A "stop the world, figure out what RAM is available" kind of thing.
If you can work with pre-allocated buffers, then GPUs work by reading from lists (consume operations), and then outputting to lists (append operations). Which can be done with gather / scatter, or more precisely stream-expansion and stream-compaction in a grossly parallel manner.
---------
If that's not enough "memory management" for you, then yeah, CPU is the better device to work with. At which point I again point back to the 192-core EPYC Zen5c example, we have grossly parallel CPUs today if you need them. Just a few clicks away to rent from cloud providers like Amazon or Azure.
GPUs are good at certain things (and I consider the pinnacle of "Connection Machine" style programming. Just today's GPUs are far more parallel, far easier to program and far faster than the old 1980s stuff).
Some problems cannot be split up (ex: web requests are so unique I cannot imagine they'd ever be programmed into a GPU due to their divergence). However CPUs still exist for that.
> But the biggest problem I'm having is management of buffer space for intermediate objects
My advice for right now (barring new APIs), if you can get away with it, is to pre-allocate a large scratch buffer for as big of a workload as you will have over the program's life, and then have shaders virtually sub-allocate space within that buffer.
Agreed, there are two different problems being described here.
1. Divergence of threads within a workgroup/SM/whatever
2. Dynamically scheduling new workloads (i.e. dispatches, draws, etc) in response to the output of a previous workload
Raytracing is problem #1 (and has it's own solutions, like shader execution reodering), while Raph is talking about problem #2.
> Raytracing is problem #1 (and has it's own solutions, like shader execution reodering)
The "solution" to Raytracing (ignoring hardware acceleration like shader reordering), is stream compaction and stream expansion.
If you are willing to have lots of loops inside of a shader (not always possible due to Windows's 2 second maximum), you can while(hits_array is not empty) kind of code, allowing your 1024-wavegroup to keep recursively calling all of the hits and efficiently processing all of the rays recursively.--------
The important tidbit is that this technique generalizes. If you have 5 functions that need to be "called" after your current processing, then it becomes:
Now of course we can't grow "too far", GPUs can't handle divergence very well. But for "small" numbers of next-arrays and "small" amounts of divergence (ie: I'm assuming that func1 is the most common here, like 80%+ so that the buffers remain full), then this technique works.If you have more divergence than that, then you need to think more carefully about how to continue. Maybe GPUs are a bad fit (ex: any HTTP server code will be awful on GPUs) and you're forced to use a CPU.
Having talked to many engineers using distributed compute today, they seem to think that (single-node) parallel compute haven't changed much since ~2010 or so.
It's quite frustrating, and exacerbated by frequent intro-level CUDA blog posts which often just repeat what they've read.
re: raytracing, this might be crazy but, do you think we could use RT cores to accelerate control flow on the GPU? That would be hilarious!
RT cores? No. Too primitive and specific.
But there is seemingly a generalization here to the Raytracing software ecosystem. I dunno how much software / hardware needs to advance here, but we are at the point where Intel RT cores are passing the stack pointers / instruction pointers between shaders (!!!). Yes through specialist hardware but surely this can be generalized to something awesome in the future?
------
For now, I'm happy with stream expansion / stream compaction and looping over consume buffers and producer/append buffers.
Are they really too specific? https://arxiv.org/abs/2303.01139
Well, it's not that the results were good though, lol.
No mention of the Transputer :(
Agreed with the premise here
I have never done GPU programming or graphics, but what feels frustating looking from the outside is the designs and constraints seems so arbitrary. They don't feel like they come from actual hardware constraints/problems. It just looks like pure path dependency going all the way back to the fixed-function days, with tons of accidental complexity and and half-finished generalizations ever since.
[dead]
[dead]
[flagged]
[flagged]
[flagged]