I want a good parallel computer

(raphlinus.github.io)

233 points | by raphlinus 3 months ago ago

93 comments

deviantbit 3 months ago
"I believe there are two main things holding it back."
He really science’d the heck out of that one. I’m getting tired of seeing opinions dressed up as insight—especially when they’re this detached from how real systems actually work.
I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with. There’s a reason it didn’t survive.
What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on. They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences. That’s how you end up with security holes, random crashes, and broken multi-tasking. There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.
I will take how things are today over how things used to be in a heart beat. I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.
[-]
- ryukoposting 3 months ago
  One of the most important steps of my career was being forced to write code for an 8051 microcontroller. Then writing firmware for an ARM microcontroller to make it pretend it was that same 8051 microcontroller.
  I was made to witness the horrors of archaic computer architecture in such depth that I could reproduce them on totally unrelated hardware.
- Diggsey 3 months ago
  > I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with.
  So the designers of the Cell processor made some mistakes and therefore the entire concept is bunk? Because you've seen a concept done badly, you can't imagine it done well?
  To be clear, I'm not criticising those designers, they probably did a great job with what they had, but technology has moved on a long way from then... The theoretical foundations for memory models, etc. are much more advanced. We've figured out how to design languages to be memory safe without significantly compromising on performance or usability. We have decades of tooling for running and debugging programs on GPUs and we've figured out how to securely isolate "users" of the same GPU from each other. Programmers are as abstracted from the hardware as they've ever been with emulation of different architectures so fast that it's practical on most consumer hardware.
  None of the things you mentioned are inherently at odds with more parallel computation. Whether something is a good idea can change. At one point in time electric cars were a bad idea. Decades of incremental improvements to battery and motor technology means they're now pretty practical. At one point landing and reusing a rocket was a bad idea. Then we had improvements to materials science, control systems, etc. that collectively changed the equation. You can't just apply the same old equation and come to the same conclusion.
- 0xbadcafebee 3 months ago
  > There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.
  Nobody teaches it, and nobody writes books about it (not that anyone reads anymore)
- aleph_minus_one 3 months ago
  > What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on.
  Isn't it much more plausible that the people who love to play with exotic (or also retro), complicated architectures (with in this case high performance opportunities) are different people than those who love to "set up or work in an assembly line for shipping stable software"?
  > I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.
  I rather believe that among those who love this kind of programming a hate for the incompetent fellow student will happen (including wishes that these become weed out by brutal exams).
- sitkack 3 months ago
  Those students would all drop out and start meditating. That would be a fun course. Speed run developing for all the prickly architectures of the 80s and 90s.
- nicoburns 3 months ago
  > They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences.
  Is there any reason why GPU-style parallelism couldn't have memory protection?
- mabster 3 months ago
  I loved and really miss the cell. It did take quite a bit of work to shuffle things in and out of the SPUs correctly (so yeah, it took much longer to write code and greater care), but it really churned through data.
  We had a generic job mechanism with the same restrictions on all platforms. This usually meant if it ran at all on Cell it would run great on PC because the data would generally be cache friendly. But it was tough getting the PowerPC to perform.
  I understand why the PS4 was basically a PC after that - because it's easier. But I wish there was still SPUs off the side to take advantage of. Happy to have it off die like GPUs are.
- api 3 months ago
  On flattening address spaces: the road not taken here is to run everything in something akin to the JVM, CLR, or WASM. Do that stuff in software not hardware.
  You could also do things like having the JIT optimize the entire running system dynamically like one program, eliminating syscall and context switch overhead not to mention most MMU overhead.
  Would it be faster? Maybe. The JIT would have to generate its own safety and bounds checking stuff. I’m sure some work loads would benefit a lot and others not so much.
  What it would do is allow CPUs to be simpler, potentially resulting in cheaper lower power chips or more cores on a die with the same transistor budget. It would also make portability trivial. Port the core kernel and JIT and software doesn’t care.
- Yoric 3 months ago
  Don't worry, with LLMs, we're moving away from anything that remotely looks like "stable software" :)
  Also, yeah, I recall the dreaded days of cooperative multitasking between apps. Moving from Windows 3.x to Linux was a revelation.
- musicale 3 months ago
  > I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course.
  Fortran is memory-safe, right? ;-)
- 3 months ago
  [deleted]
- varelse 3 months ago
  [dead]
grg0 3 months ago
The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.
- You need to compile shader source/bytecode at runtime; you can't just "run" a program.
- On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.
- You need to synchronize data access between CPU-GPU and GPU workloads.
- You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.
- You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.
What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.
[-]
- newpavlov 3 months ago
  >What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does.
  For "embarrassingly parallel" jobs vector extensions start to eat tiny bits of the GPU pie.
  Unfortunately, just slapping thousands of cores works poorly in practice. You quickly get into the synchronization wall caused by unified memory. GPUs cleverly work around this issue by using numerous tricks often hidden behind extremely complex drivers (IIRC CUDA exposes some of this complexity).
  The future may be in a more explicit NUMA, i.e. in the "network of cores". Such hardware would expose a lot of cores with their own private memory (explicit caches, if you will) and you would need to explicitly transact with the bigger global memory. But, unfortunately, programming for such hardware would be much harder (especially if code has to be universal enough to target different specs), so I don't have high hopes for such paradigm to become massively popular.
- Grosvenor 3 months ago
  What I want is a Linear Algebra interface - As Gilbert Strang taught it. I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.
  I'm not willing to even know about the HW at all, the higher level my code the more opportunities for the JIT to optimize my code.
  What I really want is something like Mathematica that can JIT to GPU.
  As another commenter mentioned all the API's assume you're a discrete GPU off the end of a slow bus, without shared memory. I would kill for an APU that could freely allocate memory for GPU or CPU and change ownership with the speed of a pagefault or kernel transition.
- smallmancontrov 3 months ago
  You can have that today. Just go out and buy more CPUs until they have enough cores to equal the number of SMs in your GPU (or memory bandwidth, or whatever). The problem is that the overhead of being general purpose -- prefetch, speculative execution, permissions, complex shared cache hierarchies, etc -- comes at a cost. I wish it was free, too. Everyone does. But it just isn't. If you have a workload that can jettison or amortize these costs due to being embarrassingly parallel, the winning strategy is to do so, and those workloads are common enough that we have hardware for column A and hardware for column B.
- amelius 3 months ago
  > The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths.
  To me it feels somewhat like programming for the segmented memory model with its near and far pointers, back in the old days. What a nightmare.
- tliltocatl 3 months ago
  Larrabee was something like that, didn't took off.
  IMHO, the real issue is cache coherence. GPUs are spared from doing a lot of extra work by relaxing coherence guarantees quite a bit.
  Regarding the vendor situation - that's basically how most of computing hardware is, save for the PC platform. And this exception is due to Microsoft successfully commoditizing their complements (which caused quite some woe on the software side back then).
- brzozowski 3 months ago
  > What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory...
  I too am very interested in this model. The Linux kernel supports up to 4,096 cores [1] on a single machine. In practice, you can rent a c7a.metal-48xl [2] instance on AWS EC2 with 192 vCPU cores. As for programming models, I personally find the Java Streams API [3] extremely versatile for many programming workloads. It effectively gives a linear speedup on serial streams for free (with some caveats). If you need something more sophisticated, you can look into OpenMP [4], an API for shared-memory parallelization.
  I agree it is time for some new ideas in this space.
  [1]: https://www.phoronix.com/news/Perf-Support-2048-To-4096-Core...
  [2]: https://aws.amazon.com/ec2/instance-types/c7a/
  [3]: https://docs.oracle.com/en/java/javase/24/docs/api/java.base...
  [4]: https://docs.alliancecan.ca/wiki/OpenMP
- fulafel 3 months ago
  Yep, and those printers are proprietary and mutually incompatible, and there are buggy mutually incompatible serial drivers on all the platforms which results in unique code paths and debugging & workarounds for app breaking bugs for each (platform, printer brand, printer model year) tuple combo.
  (That was idealized - actually there may be ~5 alternative driver APIs even on a single platform each with its own strengths)
- jpc0 3 months ago
  I really would like you to sketch out the DX you are expecting here, purely for my understanding of what it is you are looking for.
  I find needing to write seperate code in a different language annoying but the UX of it is very explicit of what is happening in the memory which is very useful. With really high performance compute across multiple cores ensuring you don't get arbitrary cache misses is a pain. If we could address CPUs like we address current GPUs( well you can but it's not generally done) it would make it much much simpler.
  Want to alter something in parallel, copy it to memory allocated to a specific core which is guaranteed to only be addressed by that core and the do the operations on it.
  To do that currently you need to be pedantic about alignment and manually indicate thread affinity to the scheduler etc. Which ia entirely as annoying as GPU programming.
- turtletontine 3 months ago
  Your wish sounds to me a lot like Larrabee/Xeon Phi or manycore CPUs. Maybe I am misunderstanding something, but it sounds like a good idea to me and I don’t totally see why it inherently can’t compete with GPUs.
- convolvatron 3 months ago
  doesn't matter. the issues you raise are abstractable at the language level, or maybe even the runtime. unfortunately there are others like which of the many kinds of parallelism to use (ILP, thread, vector/SIMD, distributed memory with much lower performance, etc.) that are harder to hide behind a compiler with acceptable performance.
- deviantbit 3 months ago
  Please explain how these "worker cores" should operate.
- thuanao 3 months ago
  So greenarrays F18? :)
- EVa5I7bHFq9mnYK 3 months ago
  "want to protect their turd" - golden!
IshKebab 3 months ago
Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.
Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.
[-]
- mikewarot 3 months ago
  The key transformation required to make any parallel architecture work is going to be taking a program that humans can understand, and translating it into a directed acyclic graph of logical Boolean operations. This type of intermediate representation could then be broken up into little chunks for all those small CPUS. It could be executed very slowly using just a few logic gates and enough ram to hold the state, or it could run at FPGA speeds or better on a generic sea of LUTs.
- deviantbit 3 months ago
  Yep, transputers failed miserably. I wrote a ton a code for them. Everything had to be solved in a serial bus, which defeated the purpose of the transputer.
- turtletontine 3 months ago
  Could you elaborate on this? How does many-small-CPUs make for a weirder programming model than a GPU?
  Im no expert, but I’ve done my fair share of parallel HPC stuff using MPI, and a little bit of Cuda. And to me the GPU programming model is far far “weirder” and harder to code for than the many-CPUs model. (Granted, I’m assuming you’re describing a different regime?)
- bryanlarsen 3 months ago
  While acknowledging that it's theoretically possible other approaches might succeed, it seems quite clear the author agrees with you.
- petermcneeley 3 months ago
  Im guessing you just dont have the computational power to compete with a real GPU. It would be relatively easy for a top end graphics programmer to write the front end graphics API for your chip. Im guessing that if they did this you would just end up with a very poor performing GPU.
- convolvatron 3 months ago
  my take from reading this is more about programming abstractions than any particular hardware instantiation. the part of the Connection Machine that remains interesting is not building machines with CPUS with transistor counts in the hundreds running off a globally synchronous clock, but that there were a whole family of SIMD languages and let you do general purpose programming in parallel. And that those language were still relevant when the architecture changed to a MIMD machine with a bunch of vector units behind each CPU.
- snovymgodym 3 months ago
  Reminds me of Itanium
- audiofish 3 months ago
  Picochip?
armchairhacker 3 months ago
> The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?
What other workloads would benefit from a GPU?
Computers are so fast that in practice, many tasks don't need more performance. If a program that runs those tasks is slow, it's because that program's code is particularly bad, and the solution to make the code less bad is simpler than re-writing it for the GPU.
For example, GUIs have been imperceptibly reactive to user input for over 20 years. If an app's GUI feels sluggish, the problem is that the app's actions and rendering aren't on separate coroutines, or the action's coroutine is blocking (maybe it needs to be on a separate thread). But the rendering part of the GUI doesn't need to be on a GPU (any more than it is today, I admit I don't know much about rendering), because responsive GUIs exist today, some even written in scripting languages.
In some cases, parallelizing a task intrinsically makes it slower, because the number of sequential operations required to handle coordination mean there are more forced-sequential operations in total. In other cases, a program spawns 1000+ threads but they only run on 8-16 processors, so the program would be faster if it spawned less threads because it would still use all processors.
I do think GPU programming should be made much simpler, so this work is probably useful, but mainly to ease the implementation of tasks that already use the GPU: real-time graphics and machine learning.
[-]
- raphlinus 3 months ago
  Possibly compilation and linking. That's very slow for big programs like Chromium. There's really interesting work on GPU compilers (co-dfns and Voetter's work).
  Optimization problems like scheduling and circuit routing. Search in theorem proving (the classical parts like model checking, not just LLM).
  There's still a lot that is slow and should be faster, or at the very least made to run using less power. GPUs are good at that for graphics, and I'd like to see those techniques applied more broadly.
- wmf 3 months ago
  A big one is video encoding. It seems like GPUs would be ideal for it but in practice limitations in either the hardware or programming model make it hard to efficiently run on GPU shader cores. (GPUs usually include separate fixed-function video engines but these aren't programmable to support future codecs.)
morphle 3 months ago
I haven't yet read the full blog post but so far my response is you can have this good parallel computer. See my previous HN comments the past months on building an M4 Mac mini supercomputer.
For example reverse engineering the Apple M3 Ultra GPU and Neural Engine instruction set and IOMMU and pages tables that prevent you from programming all processor cores in the chip (146 cores to over ten thousand depending on how you delineate what a core is) and making your own Abstract Syntax Tree to assembly compiler for these undocumented cores will unleash at least 50 trillion operations per second. I still have to benchmark this chip and make the roofline graphs for the M4 to be sure, it might be more.
https://en.wikipedia.org/wiki/Roofline_model
dekhn 3 months ago
There are many intertwined issues here. One of the reasons we can't have a good parallel computer is that you need to get a large number of people to adopt your device for development purposes, and they need to have a large community of people who can run their code. Great projects die all the time because a slightly worse, but more ubiquitous technology prevents flowering of new approaches. There are economies of scale that feed back into ever-improving iterations of existing systems.
Simply porting existing successful codes from CPU to GPU can be a major undertaking and if there aren't any experts who can write something that drive immediate sales, a project can die on the vine.
See for example https://en.wikipedia.org/wiki/Cray_MTA when I was first asked to try this machine, it was pitched as "run a million threads, the system will context switch between threads when they block on memory and run them when the memory is ready". It never really made it on its own as a supercomputer, but lots of the ideas made it to GPUs.
AMD and others have explored the idea of moving the GPU closer to the CPU by placing it directly onto the same memory crossbar. Instead of the GPU connecting to the PCI express controller, it gets dropped into a socket just like a CPU.
I've found the best strategy is to target my development for what the high end consumers are buying in 2 years - this is similar to many games, which launch with terrible performance on the fastest commericially available card, then runs great 2 years later when the next gen of cards arrives ("Can it run crysis?")
Animats 3 months ago
Interesting article.
Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.
In this context, a "renderer" is something that takes in meshes, textures, materials, transforms, and objects, and generates images. It's not an entire game development engine, such as Unreal, Unity, or Bevy. Those have several more upper levels above the renderer. Game engines know what all the objects are and what they are doing. Renderers don't.
Vulkan, incidentally, is a level below the renderer. Vulkan is a cross-hardware API for asking a GPU to do all the things a GPU can do. WGPU for Rust, incidentally, is an wrapper to extend that concept to cross-platform (Mac, Android, browsers, etc.)
While it seems you can write a general 3D renderer that works in a wide variety of situations, that does not work well in practice. I wish Rust had one. I've tried Rend3 (abandoned), and looked at Renderling (in progress), Orbit (abandoned), and Three.rs (abandoned). They all scale up badly as scene complexity increases.
There's a friction point in design here. The renderer needs more info to work efficiently than it needs to just draw in a dumb way. Modern GPSs are good enough that a dumb renderer works pretty well, until the scene complexity hits some limit. Beyond that point, problems such as lighting requiring O(lights * objects) time start to dominate. The CPU driving the GPU maxes out while the GPU is at maybe 40% utilization. The operations that can easily be parallelized have been. Now it gets hard.
In Rust 3D land, everybody seems to write My First Renderer, hit this wall, and quit.
The big game engines (Unreal, etc.) handle this by using the scene graph info of the game to guide the rendering process. This is visually effective, very complicated, prone to bugs, and takes a huge engine dev team to make work.
Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.
[1] https://github.com/linebender/vello/
[-]
- 01HNNWZ0MV43FF 3 months ago
  2D rendering is harder in fact, because antialiased curves are harder than triangle soup.
  It's an issue of code complexity, not fill rate
  https://faultlore.com/blah/text-hates-you/
- archagon 3 months ago
  I think a dynamic, fully vector-based 2D interface with fluid zoom and transformations at 120Hz+ is going to need all the GPU help it can get. Take mapping as an example: even Google Maps routinely struggles with performance on a top-of-the-line iPhone.
- mattdesl 3 months ago
  > Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.
  A ton of 2D applications could benefit from further GPU parallelization. Games, GUIs, blurs & effects, 2D animations, map apps, text and symbol rendering, data visualization...
  Canvas2D in Chrome is already hardware accelerated, so most users get better performance and reduced load on main UI & CPU threads out of the box.
- jms55 3 months ago
  Fast light transport is an incredibly hard problem to solve.
  Raytracing (in its many forms) is one solution. Precomputing lightmaps, probes, occluder volumes, or other forms of precomputed visibility are another.
  In the end it comes down to a combination of target hardware, art direction and requirements, and technical skill available for each game.
  There's not going to be one general purpose renderer you can plug into anything, _and_ expect it to be fast, because there's no general solution to light transport and geometry processing that fits everyone's requirements. Precomputation doesn't work for dynamic scenes, and for large games leads to issues with storage size and workflow slow downs across teams. No precomputation at all requires extremely modern hardware and cutting edge research, has stability issues, and despite all that is still very slow.
  It's why game engines offer several different forms of lighting methods, each with as many downsides as they have upsides. Users are supposed to pick the one that best fits their game, and hope it's good enough. If it's not, you write something custom (if you have the skills for that, or can hire someone who can), or change your game to fit the technical constraints you have to live with.
  > Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.
  Some games may have their own acceleration structures. Some won't. Some will only have them on the GPU, not the CPU. Some will have an approximate structure used only for specialized tasks (culling, audio, lighting, physics, etc), and cannot be generalized to other tasks without becoming worse at their original task.
  Fully generalized solutions will be slow be flexible, and fully specialized solutions will be fast but inflexible. Game design is all about making good tradeoffs.
- amelius 3 months ago
  > Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D.
  Depends on how complicated your artwork is.
- pjmlp 3 months ago
  At Vulkanised 2025 someone mentioned it is an HAL for writing GPU drivers, and they have acknowledge it has gotten as messy as OpenGL and there is now a plan in place to try to sort the complexity mess.
- hulitu 3 months ago
  > Modern GPUs are overkill for 2D
  That explains why modern GUI are crap: because they are not able to draw a bloody rectangle, and fill it with colour. /s
ip26 3 months ago
I believe there are two main things holding it back. One is an impoverished execution model, which makes certain tasks difficult or impossible to do efficiently; GPUs … struggle when the workload is dynamic
This sacrifice is a purposeful cornerstone of what allows GPUs to be so high throughput in the first place.
bee_rider 3 months ago
It is odd that he talks about Larabee so much, but doesn’t mention the Xeon Phis. (Or is it Xeons Phi?).
> As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.
I’ve always been slightly annoyed by the concept of E cores, because they are so close to what I want, but not quite there… I want, like, throughput cores. Let’s take E cores, give them their AVX-512 back, and give them higher throughput memory. Maybe try and pull the Phi trick of less OoO capabilities but more threads per core. Eventually the goal should be to come up with an AVX unit so big it kills iGPUs, haha.
[-]
- nullpoint420 3 months ago
  I've always wondered if you could use iGPU compute cores with unified memory as "transparent" E-cores when needed.
  Something like OpenCL/CUDA except it works with pthreads/goroutines and other (OS) kernel threading primitives, so code doesn't need to be recompiled for it. Ideally the OS scheduler would know how to split the work, similar to how E-core and P-core scheduling works today.
  I don't do HPC professionally, so I assume I'm ignorant to why this isn't possible.
- Retr0id 3 months ago
  Isn't Xeon Phi just an instance of Larrabee?
Retr0id 3 months ago
Something that frustrates me a little is that my system (apple silicon) has unified memory, which in theory should negate the need to shuffle data between CPU and GPU. But, iiuc, the GPU programming APIs at my disposal all require me to pretend the memory is not unified - which makes sense because they want to be portable across different hardware configurations. But it would make my life a lot easier if I could just target the hardware I have, and ignore compatibility concerns.
[-]
- jms55 3 months ago
  You can. There are API extensions for persistently mapping memory, and it's up to you to ensure that you never write to a buffer at the same time the GPU is reading from it.
  At least for Vulkan/DirectX12. Metal is often weird, I don't know what's available there.
- deviantbit 3 months ago
  Unified memory doesn't mean unified address space. It frustrates me when no one understands unified memory.
svmhdvn 3 months ago
I've always admired the work that the team behind https://www.greenarraychips.com/ does, and the GA144 chip seems like a great parallel computing innovation.
api 3 months ago
I implemented some evolutionary computation stuff on the Cell BE in college. It was a really interesting machine and could be very fast for its time but it was somewhat painful to program.
The main cores were PPC and the Cell cores were… a weird proprietary architecture. You had to write kernels for them like GPGPU, so in that sense it was similar. You couldn’t use them seamlessly or have mixed work loads easily.
Larrabee and Xeon Phi are closer to what I’d want.
I’ve always wondered about many—many-many-core CPUs too. How many tiny ARM32 cores could you put on a big modern 5nm die? Give each one local RAM and connect them with an on die network fabric. That’d be an interesting machine for certain kinds of work loads. It’d be like a 1990s or 2000s era supercomputer on a chip but with much faster clock, RAM, and network.
scroot 3 months ago
When this topic comes up, I always think of uFork [1]. They are even working on an FPGA prototype.
[1] https://ufork.org/
throwawayabcdef 3 months ago
The AIE arrays on Versal and Ryzen with XDNA are a big grid of cores (400 in an 8 x 50 array) that you program with streaming work graphs.
https://docs.amd.com/r/en-US/am009-versal-ai-engine/Overview
Each AIE tile can stream 64 Gbps in and out and perform 1024 bit SIMD operations. Each shares memory with its neighbors and the streams can be interconnected in various ways.
andrewstuart 3 months ago
AMD Strix Halo APU is a CPU with very powerful integrated GPU.
It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space. This means it’s doesn’t have the same swapping/memory thrashing that a discrete GPU experiences when processing large models.
16 CPU cores and 40 GPU compute units sounds pretty parallel to me.
Doesn’t that fit the bill?
[-]
- simne 3 months ago
  > It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space
  No definitely. RTX4090 definitely use fast graphics RAM (though it is usually previous generation, but overclocked and very wide bus). AMD Strix Halo definitely use standard DDR5 which is not so fast.
  And yes, Strix Halo GPU using "3dcache", but as officials said, CPU don't have access to GPU cache, because "have not seen any app significantly benefited from such access".
  So probably, internal SoC bus should have less delay than discrete GPU on PCIe, but not too much different.
- dr_kiszonka 3 months ago
  It looks like it will be available in the Framework Desktop! I would love to see it in a more budget mini PC at some point from another company. (Framework is great but not in my price range.)
- bigyabai 3 months ago
  > It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space
  I love AMD's Ryzen chips and will recommend their laptops over an Nvidia model all day. However, this is a pretty facetious comparison that falls apart when you normalize the memory. Any chip can be memory bottlenecked, and if we take away that arbitrary precondition the Strix Halo gets trounced in terms of compute capacity. You can look at the TDP of either chip and surmise this pretty easily.
Quis_sum 3 months ago
Clearly the author never worked with a CM2 - I did though. The CM2 was more like a co-processor which had to be controlled by a (for that age) rather beefy SUN workstation/server. The program itself ran on the workstation which then sent the data-parallel instructions to the CM2. The CM2 was an extreme form of a MIMD design (that is why it was called data parallel). You worked with a large rectangular array (I cannot recall up to how many dimensions) which had to be a multiple of the physical processors (in your partition). All cells typically performed exactly the same operation. If you wanted to perform an operation on a subset, you had to "mask" the other cells (which were essentially idling during that time).
That is hardly what the author describes.
[-]
- pjmlp 3 months ago
  Did you used StarLisp? It is always a bit hard to find testimonials about the experience.
sitkack 3 months ago
This essay needs more work.
Are you arguing for a better software abstraction, a different hardware abstraction or both? Lots of esoteric machines are name dropped, but it isn't clear how that helps your argument.
Why not link to Vello? https://github.com/linebender/vello
I think a stronger essay would at the end give the reader a clear view of what Good means and how to decide if a machine is closer to Good than another machine and why.
SIMD machines can be turned into MIMD machines. Even hardware problems still need a software solution. The hardware is there to offer the right affordances for the kinds of software you want to write.
Lots of words that are in the eye of beholder. We need a checklist or that Good parallel computer won't be built.
[-]
- winwang 3 months ago
  Personal opinion: it's the software (and software tooling).
  The hardware is good enough (even if we're only talking 10x efficiency). Part of the issue seems slightly cultural, i.e. repetitively putting down the idea of traditional task parallelism (not-super-SIMD/data parallelism) on GPUs. Obviously, one would lose a lot of efficiency if we literally ran 1 thread per warp. But it could be useful for lightly-data-parallel tasks (like typical CPU vectorization), or maybe using warp-wide semantics to implement something like a "software" microcode engine. Dumb example: implementing division with long division using multiplications and shifts.
  Other things a GPU gives: insanely high memory bandwidth, programmable cache (shared memory), and (relatively) great atomic operations.
- jms55 3 months ago
  > Are you arguing for a better software abstraction, a different hardware abstraction or both?
  I don't speak for Raph, but imo it seems like he was arguing for both, and I agree with him.
  On the hardware side, GPUs have struggled with dynamic workloads at the API level (not e.g. thread-level dynamism, that's a separate topic) for around a decade. Indirect commands gave you some of that so at least the size of your data/workload can be variable if not the workloads themselves, then mesh shaders gave you a little more access to geometry processing, and finally workgraphs and device generated commands lets you have an actually dynamically defined workload (e.g. completely skipping dispatches for shading materials that weren't used on screen this frame). However it's still very early days, and the performance issues and lack of easy portability are problematic. See https://interplayoflight.wordpress.com/2024/09/09/an-introdu... for instance.
  On the software side shading languages have been garbage for far longer than hardware has been a problem. It's only in the last year or two that a proper language server for writing shaders has even existed (Slang's LSP). Much less the innumerable driver compiler bugs, lack of well defined semantics and memory model until the last few years, or the fact that we're still manually dividing work into the correct cache-aware chunks.
SergeAx 3 months ago
The thing is that most of our everyday software will not benefit from parallelism. What we really have a use for is concurrency, which is a totally different beast.
Wumpnot 3 months ago
I had hoped the GPU API would go away, and the entire thing would become fully programmable, but so far we just keep using these shitty APIs and horrible shader languages.
Personally I would like to use the same language I write the application in to write the rendering code(C++). Preferably with shared memory, not some separate memory system that takes forever to transfer anything too. Somelike along the lines of the new AMD 360 Max chips, but graphics written in explicit C++.
muziq 3 months ago
I was always fascinated by the prospects of the 1024-core Epiphany-V from Parallella.. https://parallella.org/2016/10/05/epiphany-v-a-1024-core-64-... But it seems whatever the DARPA connection was has led to it not being for scruffs like me and is likely powering god knows what military systems..
mikewarot 3 months ago
Any computing model that tries to parallelize von Neumann machines, that is, has program counters or address space, just isn't going to scale.
[-]
- imtringued 3 months ago
  The problem isn't address space or program counters. It's that each processor is going to need instruction memory stored in SRAM or an extremely efficient multi port memory for a shared instruction cache.
  GPUs get around this limitation by executing identical instructions over multiple threads.
nickpsecurity 3 months ago
There are designs like Tilera and Phalanx that have tons of cores. Then, NUMA machines used to have 128-256 sockets in one machine with coherent memory. The SGI machines let you program them like it was one machine. Languages like Chapel were designed to make parallel programming easier.
Making more things like that with lowest, possible, unit prices could help a lot.
amelius 3 months ago
Isn't the ONNX standard already going into the direction of programming a GPU using a computation graph? Could it be made more general?
[-]
- sitkack 3 months ago
  It lacks support for the serial portions of the execution graph, but yes. You should play around with ONNX, it can be used for a lot more than just ML stuff.
0xbadcafebee 3 months ago
If we had distributed operating systems and SSI kernels, your computer could use the idle cycles of other computers [that aren't on battery power]. People talk about a grid of solar houses, but we could've had personal/professional grid computing like 15 years ago. Nobody wanted to invest in it, I guess because chips kept getting faster.
[-]
- zozbot234 3 months ago
  SSI is an interesting idea, but the actual advantage is mostly to improve efficiency when running your distributed code on a single, or few nodes. You still have to write your code with some very real awareness of the relevant issues when running on many nodes, but now you are also free to "scale down" and be highly efficient on a single node, since your code is still "natively" written for running on that kind of system. You are not going to gain much by opportunistically running bad single-node codes on larger systems, since that will be quite inefficient anyway.
  Also, running a large multi-node SSI system means you mostly can't partition those nodes ever, otherwise the two now-separated sets of nodes could both progress in ways that cannot be cleanly reconciled later. This is not what people expect most of the time when connecting multiple computers together.
nromiun 3 months ago
What about unified memory? I know these APUs are slower than traditional GPUs but still it seems like the simpler programming model will be worth it.
The biggest problem is that most APUs don't even support full unified memory (system SVM in OpenCL). From my research only Apple M series, some Qualcomm Adreno and AMD APUs support them.
joshu 3 months ago
Huh. The Blelloch mentioned n the thinking machines section taught my parallel algorithms class in 1994 or so.
eternityforest 3 months ago
I wonder if CDN server applications could use something like this, if every core had a hardware TCP/TLS stack and there was a built-in IP router to balance the load, or something like that.
casey2 3 months ago
I think Tim was right, it's 2025, Nvidia just released their 50 series, but I don't see any cards, let alone GPUs.
dragontamer 3 months ago
There's a lot here that seems to misunderstand GPUs and SIMD.
Note that raytracing is a very dynamic problem, where the GPU isn't sure if a ray hits a geometry or if it misses. When it hits, the ray needs to bounce, possibly multiple times.
Various implementations of raytracing, recursion, dynamic parallelism or whatever. Its all there.
Now the software / compilers aren't ready (outside of specialized situations like Microsofts DirectX Raytracing, which compiles down to a very intriguing threading model). But what was accomplished with DirectX can be done in other situations.
-------
Connection Machine is before my time, but there's no way I'd consider that 80s hardware to be comparable to AVX2 let alone a modern GPU.
Connection Machine was a 1-bit computer for crying out loud, just 4096 of them in parallel.
Xeon Phi (70 core Intel Atoms) is slower and weaker than 192 core Modern EPYC chips.
-------
Today's machines are better. A lot better than the past machines. I cannot believe any serious programmer would complain about the level of parallelism we have today and wax poetic about historic and archaic computers.
[-]
- raphlinus 3 months ago
  The problems I'm having are very different than those for raytracing. Sure, it's dynamic, but at a fine granularity, so the problems you run into are divergence, and often also wanting function pointers, which don't work well in a SIMT model, By contrast, the way I'm doing 2D there's basically no divergence (monoids are cool that way) but there is a need to schedule dynamically at a coarser (workgroup) level.
  But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.
  The problem with the GPU raytracing work is that they built hardware and driver support for the specific problem, rather than more general primitives on which you could build not only raytracing but other applications. The same story goes for video encoding. Continuing that direction leads to unmanageable complexity.
  Of course today's machines are better, they have orders of magnitude more transistors, and crystallize a ton of knowledge on how to build efficient, powerful machines. But from a design aesthetic perspective, they're becoming junkheaps of special-case logic. I do think there's something we can learn from the paths not taken, even if, quite obviously, it doesn't make sense to simply duplicate older designs.
- winwang 3 months ago
  Having talked to many engineers using distributed compute today, they seem to think that (single-node) parallel compute haven't changed much since ~2010 or so.
  It's quite frustrating, and exacerbated by frequent intro-level CUDA blog posts which often just repeat what they've read.
  re: raytracing, this might be crazy but, do you think we could use RT cores to accelerate control flow on the GPU? That would be hilarious!
pikuseru 3 months ago
No mention of the Transputer :(
Ericson2314 3 months ago
Agreed with the premise here
I have never done GPU programming or graphics, but what feels frustating looking from the outside is the designs and constraints seems so arbitrary. They don't feel like they come from actual hardware constraints/problems. It just looks like pure path dependency going all the way back to the fixed-function days, with tons of accidental complexity and and half-finished generalizations ever since.
hackburg 3 months ago
[dead]
helf 3 months ago
[dead]
chimyy 3 months ago
[flagged]
chimyy 3 months ago
[flagged]
chimyy 3 months ago
[flagged]