← Writing
distributed-systemspythonraftconsensusfault-tolerance

Implementing Raft Consensus from Scratch in Python

A Python implementation of the Raft consensus algorithm — leader election with randomized timeouts, log replication via AppendEntry RPC, and a majority-quorum commit protocol that tolerates network partitions and replica crashes.

·3 views

Raft is a consensus algorithm designed to be understandable. Its paper — "In Search of an Understandable Consensus Algorithm" (Ongaro & Ousterhout, USENIX ATC 2014) — lays out every state transition explicitly. Implementing it anyway takes longer than expected.

Source: github.com/YuqiaoSu/Raft-Distributed-Key-Value-Database

What Raft Does

A Raft cluster is a group of replicas that agree on an ordered log of commands. Any replica can fail and rejoin; as long as a majority (floor(N/2) + 1) are alive, the cluster makes progress. The log is the database — applying committed entries in order produces the key-value state.

Figure 1 from the Raft paper — replicated state machine architecture. Each server holds a log; the consensus module keeps logs in sync; the state machine applies committed entries.

Three roles:

  • Follower — passive, processes RPCs from leader and candidates
  • Candidate — running for election; needs majority votes to win
  • Leader — handles all client writes, replicates to followers

Leader Election

Each replica runs an election timer, reset whenever it hears from a valid leader:

self.election_timeout = random.randint(150, 300) * 0.01  # 150–300ms

The randomized range is the key insight. If all nodes had the same timeout, they'd all become candidates simultaneously and split votes forever. Randomization means one node almost always fires first, wins quickly, and suppresses the others.

When the timer fires with no leader heartbeat:

def start_election(self):
    self.state = "Candidate"
    self.current_term += 1
    self.voted_for = self.id
    self.votes_received = {self.id}
    self.reset_election_timer()
    self.broadcast_request_vote()

A candidate sends RequestVote RPCs to all peers. Each peer grants at most one vote per term (first-come-first-served, with a log-freshness check). When a candidate collects votes from a majority:

if len(self.votes_received) > len(self.peers) // 2:
    self.become_leader()

Heartbeats and Log Replication

The leader sends heartbeats every 250ms to suppress follower elections:

self.heartbeat_timeout = 0.25  # seconds

Each AppendEntry RPC carries a consistency check — prevLogIndex and prevLogTerm. A follower only accepts the new entry if its log matches at that position.

Figure 6 from the Raft paper — log structure. Each entry stores a term number and command. The leader's log and follower replicas are shown. Committed entries (applied to state machine) are shaded.

Client writes go through the leader, which appends to its log and replicates:

def append_entry(self, command):
    entry = LogEntry(term=self.current_term, command=command)
    self.log.append(entry)
    self.pending_requests[len(self.log) - 1] = command
    self.replicate_to_followers()

Commit and Majority Quorum

An entry is committed when the leader has acknowledgment from a majority of replicas:

def check_commit(self):
    for index in sorted(self.pending_requests.keys()):
        acks = sum(
            1 for peer_next in self.next_index.values()
            if peer_next > index
        )
        if acks > len(self.peers) // 2:
            self.commit_index = index
            self.apply_committed()

Only then does the leader reply to the client.

Figure 8 from the Raft paper — a time sequence showing why a leader cannot safely commit an entry from a previous term by counting replicas alone. S1 leads in (a)–(b), crashes, and S5 in (c)–(d) shows the dangerous case where term-2 entries get overwritten.

This is what makes Raft safe: a committed entry has been written to a majority of logs. Any future leader must have won election from a majority containing at least one node with that entry, guaranteeing the entry appears in every future leader's log.

Handling Log Divergence

When a leader crashes and a new one is elected, follower logs may diverge:

Figure 7 from the Raft paper — ways in which followers' logs can differ from a new leader. Cases (a)–(b) show missing entries; (c)–(d) show extra uncommitted entries from prior terms.

The new leader fixes this by sending AppendEntry RPCs that overwrite conflicting entries. It walks each follower's log backwards until finding a matching position, then replays forward. Entries that were never committed (never acknowledged to clients) are silently overwritten — no data loss from the client's perspective.

Implementation Notes

The full implementation is ~600 lines of Python. The trickiest part is the threading model. The election timer, heartbeat sender, and RPC handlers all run concurrently and share state (current_term, state, log, commit_index). Without careful locking, a timer firing mid-RPC produces subtle races.

The solution: a single lock guards all state mutations. This serializes all state transitions and makes the algorithm's invariants easy to verify.

What the Paper Gets Right

Raft is easier to implement correctly than Paxos precisely because it decomposes the problem. Leader election is fully separate from log replication. Term numbers give you a total ordering on time. The paper's Figure 2 is a complete specification — every state variable, every invariant, every condition under which each RPC is sent or rejected.

Reading Figure 2 carefully before writing a line of code is the right approach. Every bug I hit traced back to a condition I'd skimmed over on first read.

Algorithm Reference

Figure 2 from the Raft paper — complete summary of the Raft consensus algorithm. Covers persistent and volatile server state, AppendEntries RPC, RequestVote RPC, and the rules each server role must follow.

Comments

Loading…

0/300 words