The Raft Consensus Algorithm (2015)

(raft.github.io)

170 points | by nromiun 5 days ago ago

45 comments

eatonphil 2 days ago
Check out Alex Miller's Data Replication Design Spectrum for what you might use instead of Raft (for replication specifically), or what tweaks you might make to Raft for better throughput or space efficiency (for replication).
https://transactional.blog/blog/2024-data-replication-design...
lopatin 2 days ago
Here's a recent Raft story from when I was trying to learn how to use it. I'm working on coding a multiplayer io game for fun. One key part was that it needs replication of game state so that live games can fail over if one of the servers crashes or is restarted.
I was like, "wait, that's what raft does, and my game state is already a state machine ... let's do this thing". I then ended up putting my entire game state into raft, and even abused it as a generic pubsub engine.
It was fun, and it worked, and was actually not hard to setup using the Rust implementation. Then when I was done, I realized how pathological it was to put literally all game state and every single event into Raft, so decided to stop indulging myself by having fun and trying to hack Raft into an unintended use case, and just used Redis for game state and pubsub.
I never load tested the Raft implementation, but once I do have some perf testing tools, I'd be interesting to run the Raft code vs. Redis comparison to see how throughput differs. For this use case, at some point Raft should fall over while Redis would be chugging along.
[-]
- chucky_z 2 days ago
  It depends!!! If you do something like Consul does where you cram “events” into a radix tree that you send as one big batch it should scale quite well. I wouldn’t expect batched Redis performance but it should be within 70-80% of it in my guess.
- blast 2 days ago
  > I'm working on coding a multiplayer io game for fun.
  I know what a multiplayer game is, but what's a multiplayer io game?
  (not being snarky, i loved your comment)
  [-]
  - ChadNauseam 2 days ago
    There's a trend of "io games" which are so named because the first one was hosted on a site called "agar.io". then was followed by slither.io and so on. They are characterized by a few distinctive qualities: they're entirely browser-based, players are instantly dropped into a big lobby with tons of other players, they typically have simplistic graphics (e.g. powered by canvas), they have no long-term progression (once you die you restart from the beginning), and they have some kind of leaderboard where you can see the players in your shard that have progressed the farthest since the last time they died.
  - lopatin 2 days ago
    It's just a browser based multi-player game with a .io TLD. Hope to make a Show HN soon :)
jorangreef a day ago
Raft played a huge role in improving how consensus is taught, compared to Paxos.
It’s hinted at in the introduction to the Raft paper, that Raft is "most notably" similar to Viewstamped Replication from Brian Oki, Barbara Liskov and James Cowling at MIT. Reason being that Raft’s primary goal was not so much to reinvent or contribute consensus, so much as to repackage consensus to be "more understandable than Paxos". Nevertheless, it’s not widely known, and probably could be emphasized a bit more, that the protocol is otherwise pretty much ‘88 Oki VSR (same essential protocol, different terms), itself predating Paxos a year.
For example, if you compare '14 Raft (Stanford) with '12 VSR (MIT’s revision to their ‘88 original), the resemblance is striking:
2012: http://pmg.csail.mit.edu/papers/vr-revisited.pdf [1]
2014: https://raft.github.io/raft.pdf
Presentation aside, the only major difference is that Raft’s view change (called leader election in the Raft paper) preserves the flavor of Oki’s ‘88 Viewstamped Replication view change, electing the candidate with the longest log, and missing out on the improved round-robin view change from MIT in ‘12, that brings better stability and availability, with lower latency, and no dueling leaders.
Raft has had a tremendous impact, compared to Paxos, in helping engineers to understand consensus. But once the training wheels are there, it’s important that we also understand and preserve the history of consensus, the people who pioneered these protocols.
[1] It's an accident of history that '12 VSR wasn't promoted as much as Raft would be, two years later. Otherwise, '12 VSR from James Cowling is arguably just as understandable, if not more. For example, because of the improved view change, the paper can begin immediately with normal replication. And there's no discussion, for example, of the need for randomized timeouts, it's elegantly designed away.
[-]
- stmw 21 hours ago
  Re: "It's an accident of history that '12 VSR wasn't promoted as much as Raft would be, two years later. " - more accurate to say MIT vs Stanford difference.
- avinassh a day ago
  one more difference I would add is how a leader is elected in Raft (randomized) vs VSR (deterministic succession)
  [-]
  - jorangreef a day ago
    Yes, exactly. That difference came in with '12 VSR's view change (deterministic, round-robin), which upgraded '88 VSR's view change.
    '14 Raft missed this upgrade, unfortunately (despite citing the '12 paper), thus preserving Oki's '88 "pick the candidate with longest log".
    I find the history of consensus fascinating, and was fortunate to interview both Brian Oki and James Cowling (so many anecdotes): https://www.youtube.com/watch?v=ps106zjmjhw
    [-]
    - ingloriousB a day ago
      The Raft authors didn't "miss" it, they just decided that randomized was simpler--and simplicity was a deliberate design goal.
      They found randomized was easier to reason about compared to trying to figure out how to explain and prove correctness when the next "determined" leader has died or been partitioned.
      Lacking formal methods, the original VSR work seems less rigorous/proven. Maybe rigorous proof was just not possible at the time; after all Lamport had to work really hard to prove Paxos correct.
      Consider that page 31 of Brian Oki's dissertation, http://www.pmg.csail.mit.edu/papers/MIT-LCS-TR-423.pdf , he complete misses problem that is discussed in Figure 3.7 of the Raft dissertation.
      That is, the "new view" may be missing information from a leader who had a higher term/view number, but is temporarily offline. Does the orignal VSR provide the same fix that Raft does for that scenario? Does it address other possible scenarios? Mechanical proofs seem conspicuously absent, and even human proofs aren't in ready evidence.
      [-]
      - jorangreef a day ago
        > “seems” / “seem”
        Are you intending to suggest there are no formal proofs since?
        For example, you compare ‘14 Raft against ‘88 VSR, ignoring the interim progress made by MIT with ‘12 VSR?
        It would seem fair to compare the latest version of both, no?
        [-]
        ingloriousB a day ago
        :) I'm only using "seem" to indicate the limits of my knowledge.
        I was just hoping someone would chime in with a link to stronger/formal proof for VSR. Are you aware of any?
        So, yes, to my limited knowledge, I've not found any existing formal proofs for VSR.
        The 2012 VSR revisited clearly labels their arguments informal; in section 8: "In this section we provide an informal discussion of the correctness of the protocol."
        I'd be delighted to learn of any formal / machine checked proofs of VSR ; equivalent to the Verdi project for Raft.
        These are elaborate and subtle protocols, easy to get wrong.
        In particular when it comes to things like reconfiguration, even Raft had the famous 2016 bug in the simpler of its two reconfiguration protocols.
        Note that Verdi did not attempt to verify the reconfiguration protocol; apparently it was too difficult.
        There was an attempt earlier this year to give a proven reconfiguration protocol for Raft called Recraft[1], based on an earlier paper called Adore[2]. They discuss why reconfiguration is so difficult to prove. It has to do with circularity.
        "ReCraft: Self-Contained Split, Merge, and Membership Change of Raft Protocol" by Kezhi Xiong, Soonwon Moon, Joshua Kang, Bryant Curto, Jieung Kim, Ji-Yong Shin. last revised 28 Apr 2025, v2.
        [1] https://arxiv.org/abs/2504.14802
        [2] https://dl.acm.org/doi/pdf/10.1145/3519939.3523444
        I'm not completely convinced yet that ReCraft works; at one point I thought they assumed away certain scenarios -- but I need to revisit it with a close reading.
        At a minimum, reading the Adore paper's discussion of how much subtlety is involved is pretty compelling.
        My conclusion is that formal proof is an absolute necessity to have a fighting chance at a correct implementation--especially when it comes to reconfiguration.
        [-]
        jorangreef a day ago
        There are at least two formal proofs.
        Have you tried Googling for them, instead of creating a throwaway account to comment anonymously here? :)
        [-]
        ingloriousB a day ago
        Per the site guidelines[1], please avoid gratuitous negativity.
        [1] https://www.ycombinator.com/blog/new-hacker-news-guideline
        > https://www.ycombinator.com/blog/new-hacker-news-guideline
        Certainly I've googled. I have found no proofs.
        Surely by now you would actually exhibit the formal proof if you had one right? (He asks, for the 3rd time).
        Note that a TLA+ spec is not a formal proof. Also note that a model checking run is not a formal proof.
        I'm still hoping to learn something new about how proven is, say, the reconfiguration or recovery protocols in VSR.
        So far I can only conclude that my research has turned up nothing as far a formal proof for VSR. Please show me I'm wrong with a link to one. :)
        [-]
        jorangreef a day ago
        I don't mean to trap you, my anonymous friend :) but does the formal proof for Raft in Coq via Verdi not apply—at least in spirit—to the essential view change and SMR protocol for Viewstamped Replication? And, similarly, would you say that Viewstamped Replication's core view change and SMR protocol is really that different from Multi-Paxos as a superset—such that that proof also wouldn’t carry over?
        I agree that proofs are sensitive to modeling choices. But the reason I ask is that the literature generally treats the core of these protocols (view change + SMR—reconfiguration aside for now) as essentially equivalent.
        For example, I'm sure you're aware of Heidi Howard's work here, which unifies consensus under one framework, the main differences being election style (i.e. Raft does random, VSR does round-robin) and terminology, not fundamental mechanics. The upside being that optimizations and sub-protocols, such as reconfiguration, can then be shared across protocols.
        To your point about reconfiguration, reconfiguration sub-protocols are a field in themselves, and here it’s common to mix and match. To be clear, I'm not aware of a proof for the reconfiguration sub-protocol in '12 VRR (and I've found a bug in its client crash recovery sub-protocol—with Dan Ports finding another in its recovery sub-protocol), but again, as Howard notes, since the SMR cores are equivalent, you can adopt a reconfiguration sub-protocol or session sub-protocol that has been proven—at least this is common practice in production systems.
        I hope the spirit of the argument is clear. And trust that none of this changes the OP point: that VSR pioneered the field and that Raft (in the authors' own words) is "most notably" similar to Viewstamped Replication.
        (Let's not get into the subject of actual implementation correctness, which is orders of magnitude harder than formal design proofs, or the fact that the formal proofs in question still lack for a storage fault model—for example, many Raft implementations violate the findings of “Protocol-Aware Recovery for Consensus-Based Storage”)
klas_segeljakt 2 days ago
https://transactional.blog/talk/enough-with-all-the-raft
phtrivier a day ago
Probably a FAQ, but the example about network partition leaves me wondering : if two clients talk to a different leader of subnetwork, the algorithm guarantees that eventually, one of the leader will step down, and both clients will eventually see the same operation log.
Does it mean that the client should implicitly wait a bit before "trusting" their server, to be sure ? What happens if you take a wrong decision based on an outdated log that is eventually rolled back ?
(Or it simply the same thing as with any eventually consistent system - you should _not_ have irreversible side effects that depend on the value of a log in the system, at all ?)
[-]
- ingloriousB a day ago
  There is only one version of the committed log.
  That's the definition of committed. So you won't get a committed write or read from a partitioned "zombie" leader.
  In good (non-buggy) Raft implementations, things are set up so there can only ever be at most one leader at a time.
  That is the point of the election, the pre-vote, the leader-leases, the sticky-leader, and the leader-must-step-down rules (from chapter 6 of the dissertation).
  Yes those optimizations are described as optional in the Raft dissertation. But really they are not.
  You have to read the whole dissertation to the details and full picture.
  You should have all of them as a part of a complete and sane implementation.
  Pre-vote and sticky-leader are needed for liveness.
  Leader leases are possible because you have pre-vote and sticky leader on; and they give you fast local-only leader reads.
  The leader step-down rule prevents client liveness and multiple leaders at once.
  Even if you do have multiple leaders at once, only one will actually be able to commit, so its not a problem with safety or consistency, just with liveness/not wasting the client's time talking to "a dead man walking".
baxter001 2 days ago
I recall doing a rather large technical case for using raft and a replicated log for water industry back in 2017 it must have been.
That's from knowing nothing about the shape of consensus algorithms to almost getting one adopted.
To me Raft's brilliance is how easy, clear and comprehensible they made thinking about it.
dunkelheit 2 days ago
Kudos to the raft authors for making distributed consensus accessible. Structuring the presentation in terms of RPCs and making the algorithm well-suited for implementing replicated state machines may not sound like a big deal, but those decisions really helped to make it approachable.
In a span of a decade consensus transformed from an esoteric algorithm that you can maaaaybe try implementing if you are a google engineer, to being widely deployed across many storage systems and readily accessible libraries and Raft played a big part in it.
[-]
- almostgotcaught 2 days ago
  > if you are a google engineer
  Do people really have this kind of inferiority complex? What exactly do you think Google engineers are? People with 5 eyes and for hands with 10 fingers each? They read the papers (Paxos or whatever) struggle to understand it, implement beta versions, and iterate just like everyone else.
- eatonphil 2 days ago
  Anyone can implement Raft. There are plenty of implementations of them not by Google engineers, including a custom one in the product I work on. And developers in the Software Internals Discord are constantly in there asking questions on the road to implementing Raft or Viewstamped Replication.
  [-]
  - trenchpilgrim 2 days ago
    I believe the parent is referring to pre-raft consensus algorithms like Paxos. I recall the explanation of Paxos being a lengthy PDF while the explanation of Raft is a single webpage, mostly visual.
    [-]
    - mrkeen a day ago
      Other way around.
      Step 1 of Raft is for the distributed nodes to come to consensus on a fact - i.e. who the leader is.
      ALL of Paxos is the distributed nodes coming to consensus on a fact.
      Raft just sounds easier because its descriptions use nice-sounding prose and gloss over the details.
    - eatonphil 2 days ago
      Could be, it was a little ambiguously worded. That said, single-decree Paxos is much simpler than Raft but I agree The Part-Time Parliament's analogy is a pain to read. But it's better if you just ignore the beginning chunk of the paper and read like the appendix; A1 The Basic Protocol being simpler to understand.
      [-]
      - mananaysiempre 2 days ago
        There’s also the side-by-side Paxos/Raft comparison in Howard & Mortier’s “Consensus on consensus”[1] paper, which is not enough to understand either by itself, but a great help if have a longer explanation you’re going through.
        [1] https://dl.acm.org/doi/10.1145/3380787.3393681
stmw 2 days ago
Both the Raft algorithm and its explanation are excellent, including this little animated demo that Diego Ongaro (who is also a great guy) made to help explain his invention. While Paxos was first and still popular, I am not sure I would count against any senior engineer unable to explain it to others. With Raft, one ought to be able to do it. Great to see this on HN.
[-]
- crdrost 2 days ago
  I just want to clarify that, strictly speaking, Paxos is simpler than Raft. Like Paxos is Raft plus "delete the part where the new leader has to be fully caught-up" (which causes an extra weirdness in Raft when the leader got majority adoption of a new log entry but failed to commit it due to network partition) plus "delete the part where the leader election elects the leader, you can do that and it's called multi-paxos, but if your use case isn't big enough you can instead just elect the next log entry, but it doesn't have to have the form "this is our new leader, all hail" (but in practice this is what everyone does).
  I think the Raft paper is more approachable for newcomers, but if you are finding Paxos hard to explain and Raft easy to explain, just use the Raft lingo to explain Paxos.
  [-]
  - _benedict a day ago
    Worth noting, not quite "everyone" does this. Cassandra uses "leaderless" (single decree) paxos, which has some advantages and some disadvantages (for instance, 1RT WAN reads from any region).
    I agree with you that Paxos is simpler than Raft. The problem with Paxos IMO is that Lamport's original paper is impenetrable; lots of later writing is easier to understand, including those that describe more complex protocols. The intuitions are actually pretty straightforward, and transfer to all of the extensions to Paxos (which are not as straightforwardly compatible with Raft).
    Raft may have helped more people get comfortable with distributed consensus, and sped its adoption, but being a sort of dangling branch of the tech tree I wonder if this may have stalled progress beyond it.
    [-]
    - stmw a day ago
      Interesting way to think about, I am not sure I quite agree, but good points and surrounding discussion for sure..
- chc4 2 days ago
  It is darkly amusing to me that Paxos is still cited by people as a "good algorithm" to choose and understandable - including the cutesy island metaphor from the paper to appear approachable - and the Raft paper opens with a paragraph more or less going "ok we had a bunch of people try to understand Paxos and none of them could figure it out". Anytime I see someone recommend Paxos I silently am judging if they've actually ever read the paper or just parroting advice from when it was the only game in town.
  [-]
  - jen20 2 days ago
    Do you ever bother to ask whether they are recommending it for a use case other than a replicated log?
sathish316 2 days ago
This YouTube video by the Raft creators (linked at the bottom of the page) gives a solid explanation of Raft internals - https://m.youtube.com/watch?v=YbZ3zDzDnrw
jlundberg 2 days ago
I highly recommending anyone interested in high availability to read the Raft specification.
dang 2 days ago
Related. Others?
Raft: Understandable Distributed Consensus (2014) - https://news.ycombinator.com/item?id=41669850 - Sept 2024 (87 comments)
The Raft Consensus Algorithm (2015) - https://news.ycombinator.com/item?id=37369826 - Sept 2023 (76 comments)
Implementing a distributed key-value store on top of implementing Raft in Go - https://news.ycombinator.com/item?id=36070426 - May 2023 (79 comments)
Strong Consistency with Raft and SQLite - https://news.ycombinator.com/item?id=35246228 - March 2023 (42 comments)
Raft Is So Fetch: The Raft Consensus Algorithm Explained Through Mean Girls - https://news.ycombinator.com/item?id=33071069 - Oct 2022 (53 comments)
Raft Consensus Animated (2014) - https://news.ycombinator.com/item?id=32484584 - Aug 2022 (67 comments)
Why use Paxos instead of Raft? - https://news.ycombinator.com/item?id=32467962 - Aug 2022 (45 comments)
In Search of an Understandable Consensus Algorithm (2014) [pdf] - https://news.ycombinator.com/item?id=29837995 - Jan 2022 (12 comments)
Raft Consensus Protocol - https://news.ycombinator.com/item?id=29079079 - Nov 2021 (51 comments)
Paxos vs. Raft: Have we reached consensus on distributed consensus? - https://news.ycombinator.com/item?id=27831576 - July 2021 (48 comments)
In Search of an Understandable Consensus Algorithm (2014) [pdf] - https://news.ycombinator.com/item?id=23113419 - May 2020 (26 comments)
Paxos vs. Raft: Have we reached consensus on distributed consensus? - https://news.ycombinator.com/item?id=22994420 - April 2020 (65 comments)
Raft Is So Fetch: The Raft Consensus Algorithm Explained Through Mean Girls - https://news.ycombinator.com/item?id=22520040 - March 2020 (4 comments)
In Search of an Understandable Consensus Algorithm [pdf] - https://news.ycombinator.com/item?id=14724883 - July 2017 (14 comments)
Raft Consensus Algorithm - https://news.ycombinator.com/item?id=9613493 - May 2015 (24 comments)
The Raft Consensus Algorithm - https://news.ycombinator.com/item?id=8527440 - Oct 2014 (27 comments)
Raft: Understandable Distributed Consensus - https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79 comments)
[-]
- ingloriousB a day ago
  For related, I would list two papers. If you really want to understand these algorithms, start with these two:
  1) https://lamport.azurewebsites.net/pubs/pubs.html#time-clocks
  Lamport's 1978 "Time, Clocks, and the Ordering of Events in a Distributed System". The Raft dissertation never really explains why using a logical clock is the critical, central thing. It just glosses over it.
  And then read Lampson's 1996 "How to Build a Highly Available System Using Consensus". It will teach you how to prove equivalence of your variation back to an established, proven model. It is very clearly written.
  2)https://www.microsoft.com/en-us/research/wp-content/uploads/...
Vervious 2 days ago
not again!
[-]
- bravesoul2 2 days ago
  Hey... it is a very useful algorithm for looking smart in design interviews.
  And probably etcd uses it or similar.
nurettin a day ago
So it is a democratic vote and they are all on the same raft, is that the word play they were going for?
[-]
- nateha a day ago
  You take a raft to get off the island of Paxos.
brcmthrowaway 2 days ago
What is used at Google?
[-]
- jorangreef a day ago
  I heard on good authority that one of the earliest implementations of consensus at Google involved Sanjay Ghemawat tangentially. Sanjay's PhD on Harp, with Barbara Liskov, had also been based on Viewstamped Replication, and so, internally, this became the first flavor at Google. Later, it would come to be referred to externally as Paxos, only because that was more well known at the time.
- q3k a day ago
  Back in my day it was all Paxos (Chubby, Borgmaster, ...).
- mrkeen a day ago
  Probably both?
  Paxos is used in Spanner.