Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs

(zml.ai)

69 points | by steeve 5 days ago ago

11 comments

rdyro 10 hours ago
Looks cool!
nvtop can actually support TPUs too via https://github.com/rdyro/libtpuinfo/ https://github.com/Syllo/nvtop/blob/76890233d759199f50ad3bdb...
synergy20 an hour ago
would be nice to have cpu usage added so I have all in one?
currently I use btop which shows basic gpu load along with cpu, network, etc.
serialx 5 hours ago
Look into all-smi https://github.com/lablup/all-smi It supports all GPUs thinkable including Apple Silicon and many AI accelerator cards.
imcritic 2 hours ago
Is it capable of exposing metrics in Prometheus format?
[-]
- steeve 2 hours ago
  consider it done
mrflop 5 days ago
Renaming fopen64 to intercept library calls feels like a brittle hack masquerading as "sandboxing." Why not just upstream this hardware support to nvtop instead of fragmenting the ecosystem?
[-]
- steeve 5 days ago
  sadly, sandboxing is something that can't be upstreamed. this way, sandboxing is kept in zml instead of patching mesa.
  as for nvtop, great program, but we missed a few features (such as sandboxing)
  [-]
  - pstuart 8 hours ago
    It looks cool and I was excited to get monitoring for the NPU on my Ryzen AI 395+, unfortunately it does not show. NPU support in linux really seems to be an afterthought.
    [-]
    - steeve 8 hours ago
      Weird, because we tried it. It doesn’t show anything?
      We use the amdsmi to get metrics. I’ll investigate.
- marwanet 8 hours ago
  If this logic were pushed into nvtop, wouldn't the codebase become unmaintainable? Each vendor's interception method is going to be different.
152334H 7 hours ago
"NPU" seems to refer to trainium only?