Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework

(github.com)

150 points | by ashvardanian 4 months ago ago

43 comments

bloomingkales 4 months ago
Built in Rust for performance and in Python for extensibility
Omg, a team that knows how to selectively use tech as needed. Looking at the Rust web developers in corner.
[-]
- tormeh 4 months ago
  I do think Rust would be better for web dev if it had GC, but it doesn't, and no other language comes close to having as good ergonomics otherwise. And the memory management is something you learn once and then it's a bit verbose but no big deal. If you feel like you absolutely have to write custom data structures with circular references for your web server then I tentatively suggest that maybe you're doing web dev wrong.
  In my team we onboarded a data scientist used to working in Python who had never used Rust onto a Rust project, and it was just not a big deal. Maybe I'm just fortunate when it comes to colleagues.
  [-]
  - bloomingkales 4 months ago
    Sounds like nonsense. Again, I asked in another comment for an example of some Rust “web” code that exemplifies what you are talking about. You mentioned “ergonomics”, and you mentioned “noob friendly”. I’d love to see some of this Rust code.
    Show us, we’ll discuss.
    I feel like some devs are so insecure that they really think , ugh, I can’t even fully explain the pathology of the Rust people without cursing them out.
    You are not a better developer, that’s ALL I want to say to the Rust people. In fact, most of you are bad developers for doing what you have been doing with this language. You ALL must find a better way to show your intellectual prowess.
    I heard you guys are even bugging the Linux people.
- seangrogg 4 months ago
  Unsure if the implication is that Rust is poorly suited for web development or what.
  [-]
  - cayley_graph 4 months ago
    It is, in my opinion (as an avid Rust user!). The type errors from most of the major web frameworks/ORMs (diesel, sqlx) are just awful, more often than not. Usually some inscrutable thing involving Send/Sync. Or some hilariously complicated type/trait hackery on the part of the library, attempting to save me from the former, that I'm never going to figure out.
    Great language in many other settings, but not this one. At least not right now, but given my experience with async Rust in general, I'm not sure it ever will be.
    [-]
    - echelon 4 months ago
      > Usually some inscrutable thing involving Send/Sync.
      What kind of queries are you writing? I never see SQLx emit anything like that. I always get back SQL errors.
      Column mismatches are what I hit most in development, and they're pretty explicit.
      The hairiest thing I see with SQLx is when I try to write custom type conversions for my own types to SQL fields. I sometimes have to delve into macros. But those errors are pretty self-explanatory too.
  - echelon 4 months ago
    Rust is emerging as one of the best web programming languages out there.
    Actix and Axum feel like Python's Flask.
    Rust has decent Redis and connection pool libraries, but the SQL space needs more work. Diesel SQL is too ORM-y (I've never liked ORMs). While SQLx allows you to write "typechecked" SQL, it still has really annoying edge cases (WHERE IN clauses can't be typechecked, type bindings can get hairy, etc.)
    I'm not very happy about the state of Rust's Elasticsearch libraries, either.
    Rust probably needs a Rails/Django-like framework too for those that prefer a framework-oriented development lifecycle.
    Rust also needs some observability frameworks. There are a few, but the choices are sparse.
    I'd give Rust a 7.5/10 for web programming, and as far as the promise of the language goes, I'd give it an 11/10. Developing in Actix and Axum feels amazing. It's honestly better than Go and Python. The other pieces (database, API clients, etc.) will presumably get better in time.
    And because of the way HTTP request flow logic is typically structured, 99.9% of the time you'll never hit Rust's borrow checker or have to worry about lifetimes. It's as if you've been given one of the best typed languages, best package managers, and nearly no tradeoffs. The server compiles down to a single static binary. It's multithreaded, and it's blindingly fast.
    I'm picking Rust for every new web service I write these days.
    [-]
    - impulser_ 4 months ago
      This is obviously heavily biased, because there is no way any reasonable person would think Axum or Actix are like Flask. That's just not possible with a language like Rust. The Rust standard library is horrible compared to Python or Go.
      You need more dependencies to build a simple APi in Rust than you need in Python, and Go combined.
      Axum, tokio, serde, serde_json, anyhow, sqlx and probably 5 more to fix the bad standard library.
      In Python and Go you can build web app with the standard library.
      TBH after adding in a database, Rust is probably not that much faster than Go and Go has everything you need in the standard library, compiles to binary, and package manager doesn't matter cuz you don't need one using Go.
      [-]
      - tempest_ 4 months ago
        The standard lib is not "bad" it just is modular to avoid C++ like pitfalls.
        People doing web dev in python could use pythons native json but rarely do because there are other far more performant options for example.
        [-]
        pjmlp 4 months ago
        It is bad in the sense we need to reach out for external libraries to avoid manually writing all the boilerplate with handling errors, and any async runtime works, as long as it is tokio.
    - bloomingkales 4 months ago
      Show me some Rust web code that you think exemplifies what you are talking about. I think your example will speak for itself and close this argument. No reason to go back and forth.
    - pjmlp 4 months ago
      Nowehere close to the pleothora of tooling and frameworks available in Java and .NET ecosystem for all kinds of distributed computing scenarios.
      And if one misses an advanced ML type system, Scala, Kotlin, F# are there.
    - m00dy 4 months ago
      same here, Rust with actix can even replace nginx.
      [-]
      - echelon 4 months ago
        This. People are sleeping on one of the biggest developments to hit backend. They'll know soon enough.
        I'm too tired to respond to the two detractors, but it's hilarious that one of the arguments against Rust is pulling in packages. Some of the best packages, at that. I wonder if that's their argument against most other languages.
        Big standard libraries are a mistake, because the language is forever left with shitty old design decisions. Python's standard library is full of crap.
        I might just put together a side by side Flask / Go / Rust. It'll be so damning against Python and Go. Rust is the same LOC count, complexity, and yet is a nicer language, better type system, and is as fast as nginx.
        People don't know how good Rust webdev is.
        [-]
        imtringued 4 months ago
        The complaint about serde in particular... Everyone and their dog includes Jackson for webdev in Java.
        [-]
        tormeh 4 months ago
        Jackson is awful. Serde on the other hand is the smoothest JSON library I've ever used.
  - 4 months ago
    [deleted]
- IceHegel 4 months ago
  they actually implemented a decent amount of the HTTP stuff in rust. if you look at the docs
  [-]
  - 4 months ago
    [deleted]
saagarjha 4 months ago
Nvidia not name products after existing things in the ML space challenge: IMPOSSIBLE
More seriously, though:
> OpenAI Compatible Frontend – High performance OpenAI compatible http api server written in Rust.
Is this normal in this space? I know everyone has settled on copying the S3 API for block storage but I’m unsure if we’ve done the same for LLM serving.
[-]
- DeveloperErrata 4 months ago
  Increasingly so. Many other popular inference tools in this space also expose an OpenAI compatible API: VLLM, Llama.cpp, and LiteLLM all do.
lmeyerov 4 months ago
So this replaces triton for LLMs or?
[-]
- aabhay 4 months ago
  This is very narrowly focused on LLMs, whereas triton is still useful for running all kinds of ML models. In practice, Triton is a very poor choice for LLMs specifically because it has none of the required non negotiable features like KV caching built in.
- changtimwu 4 months ago
  same question here. Just asked Grok for a comparsion https://grok.com/share/bGVnYWN5_fa210574-f27b-45ae-9d95-19ed...
4 months ago
[deleted]
nitrogen99 4 months ago
I have deployed and developed on their Triton inferencing server and it was amazing. All very good C++ and well architected. This one has Rust, Go, Python and C++. Seriously? First, not many Rust devs in the AI community. How do you think you'll get community involvement. Ok, may be you don’t need it. Second, good luck maintaining such a polyglot system. I prefer at most 2-3 languages - main language (C++/Java), Python for extensibility and Shell, etc for deployment.
Carrok 4 months ago
As someone who spent the better part of a year trying to get various Nvidia inference products to work _at all_ even with a direct line to their developers, I will simply say "beware".
[-]
- dlewis1788 4 months ago
  Just curious what your issues with Triton were. We've done OK with it using it to serve LLM models w/ a classifier head via HF Transformers pipeline & Flash Attention 2, as well as serving text generation models with the vLLM back-end.
  [-]
  - bytesandbits 4 months ago
    triton is not that bad, TensorRT will give you nightmares
    [-]
    - dlewis1788 3 months ago
      100% - probably why vLLM is now the default back-end in Dynamo.
- aabhay 4 months ago
  Triton is not that bad at all, considering the wide scope of systems it has to support (tensorrt, onnx, multiple generations of pytorch, cuda, python). It was much nicer than the old Torchserve project which was JVM based.
- raffraffraff 4 months ago
  I've done very little with Nvidia software, but what I have done puts me off ever doing it again. I quit a job partially because it involved trying to get their shit to work. (There were other factors, but that was definitely on the 'GTFO' side)
- vinni2 4 months ago
  Can you share some of your wisdom on setting up a scalable inference infrastructure?
  [-]
  - Carrok 4 months ago
    Use Ray Serve. https://docs.ray.io/en/latest/serve/index.html
    [-]
    - ipsum2 4 months ago
      As someone who has run LLMs in production, using Ray is probably the worst idea. It's not optimized for language models, and is extremely slow. There's no KV-caching, model parallelism, and other basic table stakes features that are offered by Dynamo or other open source inference frameworks. Useful only if you have <1 QPS.
      Use SGLang, vLLM, or text-generation-inference instead.
      [-]
      - erulabs 4 months ago
        It really depends on the task. If you have 1 massive job, Ray sucks and doesn't provide table stakes. If you have 50M tiny jobs, Ray and kuberay is great and serves as the backbone of several billion dollar products.
        Good for the goose, good for the gander...
        [-]
        richardliaw 4 months ago
        > If you have 1 massive job, Ray sucks and doesn't provide table stakes.
        Can you say more?
      - Carrok 4 months ago
        This is probably true, but unlike every Nvidia product we tried, it did, you know, reply to inference requests with actual output. That said, you can serve vLLM with Ray Serve. https://docs.ray.io/en/latest/serve/tutorials/vllm-example.h...
        [-]
        ipsum2 4 months ago
        Ray doesn't offer anything if you use vLLM on top of Ray Serve though.
        [-]
        dzr0001 4 months ago
        It does if you need pipeline parallelism across multiple nodes.
- islewis 4 months ago
  is this in reference to Triton?
  [-]
  - Carrok 4 months ago
    And NIM, yes.
4 months ago
[deleted]