A node pool upgrade generally happens automatically during weekly maintenance. You can also trigger it manually. For example, when upgrading to a higher version of Kubernetes. In any case, the node pool upgrade will result in rebuilding all nodes belonging to the node pool.
During the upgrade, an old node in a node pool is replaced by a new node. This may be necessary for several reasons:
Software updates: Since the nodes are considered immutable, IONOS Cloud does not install software updates on the running nodes but replaces them with new ones.
Configuration changes: Some configuration changes require replacing all included nodes.
Considerations: Multiple node pools of the same cluster can be upgraded at the same time. A node pool upgrade locks the affected node pool, and you cannot make any changes until the upgrade is complete. During a node pool upgrade, all of its nodes are replaced one by one, starting with the oldest one. Depending on the number of nodes and your workload, the upgrade can take several hours.
If the upgrade is initiated as a part of weekly maintenance, some nodes may not be replaced to avoid exceeding the maintenance window.
Make sure that you have not exceeded your contract quota for servers; otherwise, you will not be able to provision a new node to replace an existing one.
The rebuilding process consists of the following steps:
Provision a new node to replace the old one and wait for it to register in the control plane.
Exclude the old nodes from scheduling to avoid deploying additional pods to it.
Drain all existing workload from the old node.
At first, the IONOS Cloud tries to drain the node gracefully.
- PodDisruptionBudgets (PDBs) are enforced for up to 1 hour. For more information, see Specifying a Disruption Budget for your Application.
- GracefulTerminationPeriod for pods is respected for up to 1 hour. For more information, see Termination of Pods.
If the process takes more than 1 hour, all remaining pods are deleted.
Delete the old node from the node pool.
You need to consider the following node drain updates and their impact on the maintenance procedure:
Under the current platform setup, a node drain considers PDBs. If a concrete eviction of a pod violates an existing PDB, the drain would fail. If the drain of a node fails, the attempt to delete this node will also fail.
The problems with unprepared workloads or misconfigured PDBs can lead to failing drains, node deletions, and resulting failure in node pool maintenance. To prevent this issue, the node drain will split into two stages. In the first stage, the system will continue to try to gracefully evict the pods from the node. If this fails, the second stage will forcefully drain the node by deleting all remaining pods. This deletion will bypass checking PDBs. This prevents nodes from failing during the drain.
As a result of the two-stage procedure, the process will stop failing due to unprepared workloads or misconfigured PDBs. However, this change may cause interruptions to workloads that are not prepared for maintenance. During maintenance, nodes are replaced one by one. For each node in a node pool, a new node is created. After that, the old node is drained and then deleted.
At times, a pod is not able to return to its READY state after being evicted from a node during maintenance. In such cases, a PDB was in place for a pod's workload. This leads to failed maintenance, and the rest of the workload is left untouched. With the force drain behavior, the maintenance process will proceed, and all parts of the workload will be evicted and potentially end up in a non-READY state. This might lead to an interruption of the workload. To prevent this, make sure that your workload pods are prepared for eviction at any time.