Upgrade Node Pools

A node pool upgrade generally happens automatically during weekly maintenance. You can also trigger it manually, e.g. when upgrading to a higher version of Kubernetes. In any case the node pool upgrade will result in rebuilding all nodes belonging to the node pool.

During the upgrade, an "old" node in a node pool is replaced by a new node. This may be necessary for several reasons:

  • Software updates: Since the nodes are considered immutable, IONOS Cloud does not install software updates on the running nodes, but replaces them with new ones.

  • Configuration changes: Some configuration changes require replacing all included nodes.

Considerations: Multiple node pools of the same cluster can be upgraded at the same time. A node pool upgrade locks the affected node pool and you cannot make any changes until the upgrade is complete. During a node pool upgrade, all of its nodes are replaced one by one, starting with the oldest one. Depending on the number of nodes and your workload, the upgrade can take several hours.

If the upgrade was initiated as part of weekly maintenance, some nodes may not be replaced to avoid exceeding the maintenance window.

Rebuilding a node

Please make sure that you have not exceeded your contract quota for servers, otherwise, you will not be able to provision a new node to replace an existing one.

The rebuilding process consists of the following steps:

  1. Provision a new node to replace the "old" one and wait for it to register in the control plane.

  2. Exclude the "old" node from scheduling to avoid deploying additional pods to it.

  3. Drain all existing workload from the "old" node.

  • First, IONOS Cloud tries to gracefully drain the node.

- PodDisruptionBudgets are enforced for up to 1 hour.

- GracefulTerminationPeriod for pods is respected for up to 1 hour.

  • If the process takes more than 1 hour, all remaining pods are deleted.

4. Delete the "old" node from the node pool.

Draining nodes

Please consider the following node drain updates and their impact on the maintenance procedure:

Under the current platform setup, a node drain considers PodDisruptionBudgets (PDBs). If a concrete eviction of a pod violates an existing PDB, the drain would fail. If the drain of a node fails, the attempt to delete this node would also fail.

In the past, we observed problems with unprepared workloads or misconfigured PDBs, which often led to failing drains, node deletions and resulting failure in node pool maintenance.

To prevent this, the node drain will split into two stages. In the first stage, the system will continue to try to gracefully evict the pods from the node. If this fails, the second stage will forcefully drain the node by deleting all remaining pods. This deletion will bypass checking PDBs. This prevents nodes from failing during the drain.

How does this affect node pool maintenance?

As a result of the two-stage procedure, the process will stop failing due to unprepared workloads or misconfigured PDBs. However, please note that this change may still cause interruptions to workloads that are not prepared for maintenance. During maintenance, nodes are replaced one-by-one. For each node in a node pool, a new node is created. After that, the old node is drained and then deleted.

At times, a pod would not return to READY after having been evicted from a node during maintenance. In such cases, a PDB was in place for a pod’s workload. This led to failed maintenance and the rest of the workload left untouched. With the force drain behavior, the maintenance process will proceed and all parts of the workload will be evicted and potentially end up in a non-READY state. This might lead to an interruption of the workload. To prevent this, please ensure that your workload’s pods are prepared for eviction at any time.

Last updated

Revision created

Minor update from the comments