Welcome! On this page, I'll be sharing my experience and diving into the key topics of Distributed Systems (DS).
Distributed Systems are the backbone of modern technology. Think of the World Wide Web, online gaming, or your favorite cloud services—they're all complex DS in action. For any serious programmer or architect, understanding DS isn't just helpful—it's critical.
At its core, a system is 'distributed' when its components reside on different machines (each with its own distinct and asynchronous clock), forcing them to communicate solely via message passing. This fundamental nature is what makes building reliable DS both powerful and challenging.
I'll start by defining some basic terms before we deep dive into the subject to describe it in more detail.
To successfully design and work with any Distributed System, we must first operate at an abstract level. Since DS are inherently complex and messy, the best way to abstract them is by using a model—a simplified representation that stands in for the messy 'real thing'.
A model intentionally leaves out most of the system's detail, focusing only on the important attributes and the set of rules that govern how those attributes interact. This is why models are essential tools in computer science and engineering.
A useful model must satisfy two critical criteria:
Accurate: The model is accurate to the extent that analyzing it yields truths about the object of interest. If the model tells you A will happen, A should happen in the real system.
Tractable: The model is tractable if such an analysis is actually possible (i.e., not requiring infinite time or computation).
Essentially, a good model lets us analyze the system's behavior without having to build it first—a crucial step when dealing with the scale and complexity of distributed systems.
Time seems like a simple, fundamental concept. In our day-to-day lives, we use it to establish causality—if event A happened at 1:00 PM and event B at 1:05 PM, we know A occurred before B and could have caused it.
But in a distributed system, this framework collapses. Why?
Asynchronous Communication: Messages take a non-zero, variable amount of time to travel across the network. A message sent now from Machine 1 might arrive at Machine 2 after Machine 2 has already processed several local events.
Clock Drift: Even the most precise physical clocks on separate machines are never perfectly synchronized; they drift independently.
Because of this, we cannot rely on comparing the timestamps generated by different computers to determine the true order of events. There is no single, shared concept of "now." The "global clock" is an illusion.
This leads to the critical question: Do we really need a system-level global clock to define the relationship between events?
The answer, proven by Leslie Lamport, is a resounding No. We only need to know about the causal relationship between events. If event A could not possibly have influenced event B, we don't need to worry about their absolute physical timing.
Lamport formalized this minimal requirement with the "Happened Before" relation (→):
Local Order: If two events, A and B, occur within the same process, and A is executed before B, then A→B.
Message Order: If event A is the sending of a message and event B is the receipt of that same message, then A→B. (The send event caused the receive event).
Transitivity: If A→B and B→C, then A→C.
If neither A→B nor B→A holds, the events are concurrent (A∥B). The system has no way of knowing their true physical order, and crucially, they cannot causally affect each other.
To track this partial ordering, we use a simple counter: the Lamport Logical Clock. This is not a measure of physical time, but a mechanism that ensures our clock values are consistent with the causal → relation.
Each process Pi maintains a local counter Ci, which is updated using two rules:
Internal Event: Before executing any local event, Pi increments its counter: Ci:=Ci+1.
Sending/Receiving:
When Pi sends a message, it includes its current time Ci.
When Pj receives a message with timestamp Cm, it updates its clock to: Cj:=max(Cj,Cm)+1.
By adhering to these rules, the logical clock guarantees a fundamental property: If event A happened before event B (A→B), then C(A)<C(B).
to be continued ...