High Availability and Scaling

Cluster options

Single-node cluster: A single-node cluster has only one node, called the primary node. This node accepts customer connections and performs read/write operations. It is a single point of truth as well as a single point of failure.

Multi-node cluster: A multi-node cluster consists of a primary node and multiple standby nodes for maximum resilience. In the event of a primary node failure, it promotes a standby node to the primary role. Currently, standby nodes operate in warm standby mode and do not serve read requests. Future roadmap enhancements include hot standby functionality, enabling standby nodes to serve read requests as active replicas.

Database scaling

You can scale existing clusters in two ways:

  • Horizontal scaling: It is defined as configuring the number of instances that run in parallel.

    • The number of nodes in a cluster can be increased or decreased.

    • Increasing the number of instances does not cause disruption. However, decreasing the number of instances may trigger a switchover if the operation removes the current primary node.

circle-info

Note: Horizontal scaling provides high availability; it does not increase performance.

  • Vertical scaling: Lets you configure the size of individual instances to handle more data and queries.

    • You can adjust the number of CPU cores and the amount of memory to match your requirements. Each instance runs on a dedicated node. When you scale up or down, the system creates a new node for each instance.

    • Once the new node becomes available, the system switches the instance from the old node to the new node and then removes the old node. If the cluster contains multiple nodes, the system performs this process sequentially. It always replaces the standby nodes first, then the primary node, resulting in only one switchover.

    • When the system performs the switch, it terminates any application connections to the database. The system also aborts all ongoing queries, which causes some disruption. For this reason, you should perform scaling operations outside of peak hours.

    • You can also increase the storage size. However, you cannot decrease the storage size or change the storage type. The system performs storage increases on the fly without disruption.

Replication modes

The synchronization_mode determines how transactions are replicated between multiple nodes before a transaction is confirmed to the client. IONOS Cloud DBaaS supports two modes of replication:

  • Asynchronous (default)

  • Strictly Synchronous

In either mode, the transaction is first committed on the leader and then replicated to the standby node(s).

circle-exclamation

Asynchronous replication

The Asynchronous replication does not wait for the standby before confirming a transaction back to the user. Transactions are confirmed to the client after being written to disk on the primary node. Replication takes place in the background. In asynchronous mode the cluster is allowed to lose some committed (not yet replicated) transactions during a failover to ensure availability.

The benefit of asynchronous replication is the lower latency. The downside is that recent transactions might be lost if standby is promoted to leader. The lag between the leader and standby tends to be a few milliseconds.

triangle-exclamation

Strictly Synchronous replication

The replication mode is the same as synchronous replication with the exception that standalone mode is not permitted. This mode will prevent PostgreSQL from switching off the synchronous replication on the primary when no synchronous standby candidates are available. If no standby is available, no writes will be accepted anymore, so this mode sacrifices availability for replicated durability.

If replication mode is set to synchronous (either strict or non-strict) then data loss cannot occur during failovers; for example, node failures. The benefit of strict replication is that data is not lost in case of a storage failure of the primary node and a simultaneous failure of all standby nodes.

Synchronous replication

circle-exclamation

It ensures that a transaction is committed to at least one standby before confirming the transaction back to the client. This standby is known as synchronous standby. If the primary node experiences a failure then only a synchronous standby can take over as primary. This ensures that committed transactions are not lost during a failover. If the synchronous standby fails and there is another standby available then the role of the synchronous standby changes to the latter. If no standby is available then the primary can continue in standalone mode. In standalone mode the primary role cannot change until at least one standby has caught up (regained the role of synchronous standby). Latency is generally higher than with asynchronous replication, but no data is lost during a failover.

At any time there will be at most one synchronous standby. If the synchronous standby fails then another healthy standby is automatically selected as the synchronous standby.

triangle-exclamation

Synchronization mode considerations

The synchronization mode can impact DBaaS in several ways:

Aspect

Asynchronous

Strictly Synchronous

Primary failure

A healthy standby will be promoted if the primary node becomes unavailable.

Only standby nodes that contain all confirmed transactions can be promoted.

Standby failure

No effect on primary. Standby catches up once it is back online.

At least one standby must be available to accept write requests. There is a short delay in transaction processing if the synchronous standby changes.

Consistency model

Strongly consistent (except for lost data.)

Strongly consistent (except for lost data.)

Data loss during failover

Non-replicated data is lost.

Not possible.

Data loss during primary storage failure

Non-replicated data is lost.

Not possible.

Latency

Limited by the performance of the primary.

Limited by the performance of the primary, the strictly synchronous standby, and the latency between them (usually below 1ms).

The performance penalty of synchronous over asynchronous replication depends on the workload. The primary handles transactions the same way in all replication modes, except for COMMIT statements (including implicit transactions). When synchronous replication is enabled, the commit can be confirmed to the client only after it is replicated. Thus, there is a constant latency overhead per transaction, independent of its size or duration.

Change the commit guarantees per transaction

By default, the database cluster's replication mode determines the guarantees for a committed transaction. However, some workloads might have very diverse requirements regarding accepted data loss vs performance. To address this need, commit guarantees can be changed on a per-transaction basis. For more information, refer to the PostgreSQL Documentationarrow-up-right.

triangle-exclamation

Last updated

Was this helpful?