more to the point, I hope that the electrical certification for computer's is getting ago over, as these GPU's make more heat than the smallest electrical heaters on sale
cant help but consider a whole different mounting
option, where,
I keep toying with the idea of doing a solar powered, crypto mine, that pre heats water for a coin-opp laundromat business. There are water to water heat pumps that are for boosting the water temp up to whats required for comercial hot water, while cooling the per-heat tank back down to a good temp for the GPU's.
Laundromats are one of the places that are truely timeless, with a particular and wide cross section of society rubbing shoulders, proper diners are similar, but are now mostly gone, unfortunate as both enforce a strangely formal civility and even camradery on people who rarely otherwise share space.
We had a GPU cluster in a office building, and were asked to vacate given the HVAC for the entire floor was overwhelmed. It smelled like a tropical beach dumpster most of the time, and the lights would dim when a new job was queued.
It's not elitist. The Nvidia 'pro' cards (quadro etc.) have always been a slightly unlocked, wildly more expensive version of the consumer cards. The v100, a100, h100 are meaningful hardware upgrades to the consumer line.
Whatever your profession is mate but I hate to break it to you the 6000 series is usually 20% slower in gaming performance than the best consumer card of the generation.
GDDR memory buses are so fast that the RAM chips have to be packed tightly around the GPU core to maintain signal integrity, so the limit is more or less how many chips they can physically fit multiplied by the biggest chip capacity their suppliers can provide.
But theoretically RAM chips do not need to be synchronous with each other. Even more, the data lanes on the chip do not need to be synchronous - you can treat each lane as an independent serial channel. And GDDR latency is high enough that longer lanes won't change anything.
You can't quite treat each lane as an independent serial channel; DRAM chips are at least 8 bits wide, and this GPU needs to either connect 16 bits to each die or connect groups of 32 bits to two dies, and the DRAM die does want to get the whole word on the same clock cycle. There's no SERDES at either end like you get with PCIe or Ethernet. Just 512 PHYs doing 32+Gbps PAM3. If you want those to be long-reach PHYs, you're not going to have much die space or power budget left for compute.
Does that mean it's perfectly feasible to have more if one accepts a higher latency? Seems like there could be plenty of use cases where that's preferable.
Not exactly. The name of the game with GDDR memory is "speed on the cheap." To do this, it uses a parallel bus with data rates pushed to the max. Not much headroom for things that could compromise signal integrity like socketed parts, or even board traces longer than they absolutely need to be. That's why the DRAM modules are close to the GPU and they're always soldered down.
Also, the latency with GDDR7 is pretty terrible. It uses PAM3 signaling with a cursed packet encoding scheme. At least they were nice enough to add in a static data scrambler this time around! The lack of RLL was kind of a pain in GDDR6.
Apples architecture comes with its own trade-offs, it gives them huge capacity and pretty good bandwidth, but not nearly as much as Nvidia's architectures have. The M3 Ultra is 800GB/sec, the RTX 5090 is 1.8TB/sec, and the H200 is 4.8TB/s(!). Huge capacity with middling bandwidth is in vogue because it's a good fit for AI inference, but AI training and most other applications of GPUs need as much bandwidth as they can get.
Well, if you have 16 M3-equivalent chips you can multiply the bandwidth by 16, right? Also, as I understand, ML is basically matrix multiplication and it has O(N³) operations on O(N²) numbers, so bandwidth might be not as important as number of ALUs.
That the memory is on a PCB close to the CPU/GPU certainly helps with signal integrity, but it is not by any means relevant here. The Apple platform has high memory bandwidth compared to x86 PCs because the CPU has a wide memory bus. You can get similar memory bandwidth out of high-end Epyc and Xeon CPUs which use standard DIMMs but with many more memory channels than a regular desktop computer.
It's actually not within the chip's package. It's soldered to the board. It's just regular, fairly high spec LPDDR5X IIRC, there are just a TON of memory channels.
It's not in the package? TIL... My misunderstanding seems to be common across the interwebs.
I remember Apple used to show slides depicting the M1 SoC as one unit containing a CPU, GPU, Neural Engine, cache, and DRAM all together. But slides shown at an Apple event definitely qualify for artistic license.
You could call that on-package, but it's not on-package in the same way that GPU HBM is, with that the main die and memory dies are packaged together on the same substrate. That's a much more difficult and expensive process, apparently packaging is the main bottleneck for H100/H200 production.
For that H100 image, how do they get the top surface of all the chips on the same plane? Do they solder them "upside down", so the chips are sitting on a reference plane, and the PCB is allowed to "float"? Or is the soldering controlled enough that they can just heat and be done?
While there are definitely physical limits, the core limitation here is greed. They would sell less cards. Same reason why their consumer cards are limited to ridiculously low amounts of VRAM (16GB on an RTX5080, only 8 on the RTX4060, etc) so if you want to do any serious AI you have to buy their overpriced enterprise cards.
Not just less cards; by making their cards have less ram, they are helping prop up cloud-based inference, which in turn generates revenue for their most expensive line. It is the reason you used to be able to get 12GB ram on a 3060, and that now you have to move up 3 increasingly expensive models to get the same, restricted only by drivers and not capability. They made it clear that this was all intentional because they didn’t want consumer hardware in data centers as it costs profits.
Bandwidth/GPU real estate in terms of area. Biggest GDDR7 chips are 3GB 32-bit, it has 512bit wide bus. And even this is going to moonlight as a space heater
From PCB perspective, LPDDR5(X) interface is quite different from regular DDR5. Same with DDR4 and LPDDR4. Source: have designed few boards with different memory interfaces.
What's the usecase of 512 GB memory that you cannot achieve through multi GPUs? Maybe you can make it bit cheaper as you don't require multiple chip, but I would say it is just a maybe because the chip is not the costliest part for Nvidia to manufacture, it is the memory[1].
What a beast. Like past generations there is a variant with a blower-style cooler which is limited to 300W, but now they're also doing variants modelled after their gaming cards but with even higher TDPs. Triple the memory of the gaming flagship too, you used to only get double.
As far as I can tell, it uses the same infamous power connector as 5090. I wonder if there are any differences there, maybe some additional balancing/safety features?
Not yet, but the NVIDIA RTX 5000 (Blackwell) series does not use a newer manufacturing process, so their energy efficiency is slightly worse than that of the RTX 4000 SUPER (Ada) series, which remain the most efficient GPUs (e.g. RTX 4080 SUPER).
The RTX 5000 (Blackwell) series has increased the performance only by using bigger chips and a higher power consumption. The RTX Pro Blackwell series use the same chips as the consumer series.
Would you rather have it say 50A @ 12V? The headline is written incorrectly (because the article writer didn’t write it), but the article says ‘needs 600W of power’.
Keep in mind the significant cost savings incurred by the fact it doubles as a home heating solution
Just keep in mind that a heat pump will often be 300-400% efficient at adding heat. This is 100% efficient and for once that's not actually very good.
Every room in my house with a desktop computer is consistently +5F the rest. There's some difference because of monitors and sun exposure, but yeah.
I actually did have to alter my HVAC automation to account for it, lol.
The more you buy, the more you save!
more to the point, I hope that the electrical certification for computer's is getting ago over, as these GPU's make more heat than the smallest electrical heaters on sale cant help but consider a whole different mounting option, where, I keep toying with the idea of doing a solar powered, crypto mine, that pre heats water for a coin-opp laundromat business. There are water to water heat pumps that are for boosting the water temp up to whats required for comercial hot water, while cooling the per-heat tank back down to a good temp for the GPU's. Laundromats are one of the places that are truely timeless, with a particular and wide cross section of society rubbing shoulders, proper diners are similar, but are now mostly gone, unfortunate as both enforce a strangely formal civility and even camradery on people who rarely otherwise share space.
We had a GPU cluster in a office building, and were asked to vacate given the HVAC for the entire floor was overwhelmed. It smelled like a tropical beach dumpster most of the time, and the lights would dim when a new job was queued.
ML is so boring, =3
No price indicated. If you have to ask, you're not the market.
A Blank check to nvidia will get you a spot in their buying invitation raffle.
Wasn't the ADA edition like $6k or so?
Good luck finding it for less than $9K, and it has half the VRAM, and older chip. I predict MSRP of no less than $15K, with IRL prices above $20K.
Ouch, I was just asking for a friend.
Send the bill to my manager
It still has to compete with renting an actual professional card(s).
These are workstation cards and a lot of professionals need them to do other work than ML. The elitist attitude is pretty amusing, though.
It's not elitist. The Nvidia 'pro' cards (quadro etc.) have always been a slightly unlocked, wildly more expensive version of the consumer cards. The v100, a100, h100 are meaningful hardware upgrades to the consumer line.
[flagged]
> need them to do other work than ML
I need more frames in CS2.
Whatever your profession is mate but I hate to break it to you the 6000 series is usually 20% slower in gaming performance than the best consumer card of the generation.
vram mafia
What's the limitation that keeps memory limited to 96GB? Could one put 512GB of memory on a card? I'm curious about what is the limiting factor.
GDDR memory buses are so fast that the RAM chips have to be packed tightly around the GPU core to maintain signal integrity, so the limit is more or less how many chips they can physically fit multiplied by the biggest chip capacity their suppliers can provide.
But theoretically RAM chips do not need to be synchronous with each other. Even more, the data lanes on the chip do not need to be synchronous - you can treat each lane as an independent serial channel. And GDDR latency is high enough that longer lanes won't change anything.
You can't quite treat each lane as an independent serial channel; DRAM chips are at least 8 bits wide, and this GPU needs to either connect 16 bits to each die or connect groups of 32 bits to two dies, and the DRAM die does want to get the whole word on the same clock cycle. There's no SERDES at either end like you get with PCIe or Ethernet. Just 512 PHYs doing 32+Gbps PAM3. If you want those to be long-reach PHYs, you're not going to have much die space or power budget left for compute.
Does that mean it's perfectly feasible to have more if one accepts a higher latency? Seems like there could be plenty of use cases where that's preferable.
Not exactly. The name of the game with GDDR memory is "speed on the cheap." To do this, it uses a parallel bus with data rates pushed to the max. Not much headroom for things that could compromise signal integrity like socketed parts, or even board traces longer than they absolutely need to be. That's why the DRAM modules are close to the GPU and they're always soldered down.
Also, the latency with GDDR7 is pretty terrible. It uses PAM3 signaling with a cursed packet encoding scheme. At least they were nice enough to add in a static data scrambler this time around! The lack of RLL was kind of a pain in GDDR6.
Like the gtx 970 with 3,5 + 0,4 memory.
GDDR chips already have very high latency.
Also limited by heat dissipation.
Apple sells a Mac Studio with the M3 Ultra chip and 512GB VRAM (unified memory between CPU and GPU). It costs $9,500.
Their secret is that the memory is manufactured within the chip package.
LPDDR5X ~550GB/s vs GDDR7 which is ~1.8 TB/s
Why NVIDIA cannot manufacture 512 Gb chips and put 16 of them on the board?
Apples architecture comes with its own trade-offs, it gives them huge capacity and pretty good bandwidth, but not nearly as much as Nvidia's architectures have. The M3 Ultra is 800GB/sec, the RTX 5090 is 1.8TB/sec, and the H200 is 4.8TB/s(!). Huge capacity with middling bandwidth is in vogue because it's a good fit for AI inference, but AI training and most other applications of GPUs need as much bandwidth as they can get.
Well, if you have 16 M3-equivalent chips you can multiply the bandwidth by 16, right? Also, as I understand, ML is basically matrix multiplication and it has O(N³) operations on O(N²) numbers, so bandwidth might be not as important as number of ALUs.
This doesn’t really answer the parent’s question.
That the memory is on a PCB close to the CPU/GPU certainly helps with signal integrity, but it is not by any means relevant here. The Apple platform has high memory bandwidth compared to x86 PCs because the CPU has a wide memory bus. You can get similar memory bandwidth out of high-end Epyc and Xeon CPUs which use standard DIMMs but with many more memory channels than a regular desktop computer.
It's actually not within the chip's package. It's soldered to the board. It's just regular, fairly high spec LPDDR5X IIRC, there are just a TON of memory channels.
It's not in the package? TIL... My misunderstanding seems to be common across the interwebs.
I remember Apple used to show slides depicting the M1 SoC as one unit containing a CPU, GPU, Neural Engine, cache, and DRAM all together. But slides shown at an Apple event definitely qualify for artistic license.
The memory is soldered on top of the package, here's an actual real life photo of the M1 package with two LPDDR packages sitting on it:
https://en.wikipedia.org/wiki/Apple_M1#/media/File:Mac_Mini_...
You could call that on-package, but it's not on-package in the same way that GPU HBM is, with that the main die and memory dies are packaged together on the same substrate. That's a much more difficult and expensive process, apparently packaging is the main bottleneck for H100/H200 production.
https://i.imgur.com/hFTPyjk.jpeg
For that H100 image, how do they get the top surface of all the chips on the same plane? Do they solder them "upside down", so the chips are sitting on a reference plane, and the PCB is allowed to "float"? Or is the soldering controlled enough that they can just heat and be done?
Nor is it a chiplet, in which case it would qualify as "in-SoC".
Wait till you learn that unified memory was in common use by pc's long before Apple "invented" it.
The statement is correct; it's not on the substrate similar to AMD's 3D cache or it doesn't use interposer like HBM.
You can consider it like a small PCB that has the CPU die and the memory soldered very closely nearby (and like mentioned +memory channels)
While there are definitely physical limits, the core limitation here is greed. They would sell less cards. Same reason why their consumer cards are limited to ridiculously low amounts of VRAM (16GB on an RTX5080, only 8 on the RTX4060, etc) so if you want to do any serious AI you have to buy their overpriced enterprise cards.
Not just less cards; by making their cards have less ram, they are helping prop up cloud-based inference, which in turn generates revenue for their most expensive line. It is the reason you used to be able to get 12GB ram on a 3060, and that now you have to move up 3 increasingly expensive models to get the same, restricted only by drivers and not capability. They made it clear that this was all intentional because they didn’t want consumer hardware in data centers as it costs profits.
Bandwidth/GPU real estate in terms of area. Biggest GDDR7 chips are 3GB 32-bit, it has 512bit wide bus. And even this is going to moonlight as a space heater
Why Apple packs something like 64 Gb on the CPU chip and NVIDIA cannot?
Apple uses LPDDR4 which comes in densities of up to 16Gb AFAICT. So they can have ~5X as much memory using the same bus width.
Apple uses LPDDR5X, not LPDDR4, iirc
Of course it's ddr5 not 4.
From PCB perspective, LPDDR5(X) interface is quite different from regular DDR5. Same with DDR4 and LPDDR4. Source: have designed few boards with different memory interfaces.
LPDDR dies can stack but GDDR cannot?
Wiring also, chips for ddr7 look like they have a lot of pads. And you need pads on the graphics chip for them too.
What's the usecase of 512 GB memory that you cannot achieve through multi GPUs? Maybe you can make it bit cheaper as you don't require multiple chip, but I would say it is just a maybe because the chip is not the costliest part for Nvidia to manufacture, it is the memory[1].
[1]: https://www.nextplatform.com/2024/02/27/he-who-can-pay-top-d...
One card with twice the memory would let you run the model in half as many cards, at half the speed.
Planned obsolescence, got to sell 7xxx cards somehow.
What a beast. Like past generations there is a variant with a blower-style cooler which is limited to 300W, but now they're also doing variants modelled after their gaming cards but with even higher TDPs. Triple the memory of the gaming flagship too, you used to only get double.
This looks like the same TDP as the gaming flagship of the same chip (5090, also GB202 based).
The 5090 reference design is 575W. Not a huge difference, but the workstation card is slightly more.
Will be fun to see if it has all the ROPs it should[1], or if NVIDIA has really gone all in on making this the worst product launch ever...
[1]: https://www.techpowerup.com/332884/nvidia-geforce-rtx-50-car...
I just hope whatever organ I’m selling to afford this GPU is one of the paired ones.
Too bad SLI isn't a thing anymore.
As far as I can tell, it uses the same infamous power connector as 5090. I wonder if there are any differences there, maybe some additional balancing/safety features?
Is this real 600W or 750W when we are in burst mode? I am too accustomed to the TDP lies from CPUs.
T is for thermal, it's mostly about the need to be able to dissipate that much heat on average not peak (or transient) power.
> 600W
is this a sign that semiconductor scaling is completely dead now?
Not yet, but the NVIDIA RTX 5000 (Blackwell) series does not use a newer manufacturing process, so their energy efficiency is slightly worse than that of the RTX 4000 SUPER (Ada) series, which remain the most efficient GPUs (e.g. RTX 4080 SUPER).
The RTX 5000 (Blackwell) series has increased the performance only by using bigger chips and a higher power consumption. The RTX Pro Blackwell series use the same chips as the consumer series.
[dead]
600W of power? Would you sell a car with "35l/100km of power"?
Because the people buying these things have power supplies that are also rated by their wattage.
If someone has a 750w PSU they'll need to replace it with something higher, e.g., 1000w, to run this card and the rest of the computer components.
Having power ratings in these simple terms is helpful.
what on earth is wrong with watts as a unit of power?
Would you rather have it say 50A @ 12V? The headline is written incorrectly (because the article writer didn’t write it), but the article says ‘needs 600W of power’.
Headlines are misleading, film at 11.