These terms measure a system's ability to recover from failure and remain operational.
The maximum acceptable amount of data (measured in time) that can be lost after a disruption.
Example: An RPO of 1 hour means you must be able to recover data up to a state that is no older than one hour before the failure occurred. This dictates backup and replication frequency.
The probability that a system or component will perform its required functions under stated conditions for a specified period of time.
In Practice: Often measured by MTBF (Mean Time Between Failures) or uptime percentages (e.g., "four nines" is 99.99% availability).
The process of sharing information across multiple servers or data stores to ensure consistency and improve availability and fault tolerance.
Types: Replication can be synchronous (data written simultaneously) or asynchronous (data written later).
These concepts focus on handling growth, performance, and distributing data loads.
A measure of a system's ability to handle an increasing workload or growing amount of data.
Types: Vertical Scaling (adding more resources, like CPU or RAM, to a single server) and Horizontal Scaling (adding more servers to the resource pool).
The act of dividing a single logical database or index into distinct, independent parts (partitions).
Goal: To manage large volumes of data by spreading the load across multiple physical machines.
A specific type of horizontal partitioning where the data is distributed across independent databases (shards) based on a sharding key (e.g., user ID or geographic region).
Benefit: It allows the system to scale beyond the capacity limits of a single database server.
These concepts guide how systems are designed, structured, and assessed.
A fundamental principle in distributed computing stating that it is impossible for a distributed data store to simultaneously provide more than two of the following three guarantees:
Consistency (C): Every read receives the most recent write or an error.
Availability (A): Every request receives a non-error response, without guaranteeing it is the latest write.
Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.
Architectural Choice: Most systems must choose between C and A when a network Partition occurs.
The process of hiding complex implementation details and showing only the necessary essential information to the user or caller.
In Architecture: Defining clean interfaces (APIs) for services so consuming systems don't need to know the underlying technology or complexity.
A measure of the difficulty of understanding, maintaining, and testing a piece of code.
Metrics: Often assessed using quantitative measures like Cyclomatic Complexity, which counts the number of linearly independent paths through a program's source code.
Architectural Impact: High complexity in core modules leads to higher risk and maintenance costs.