Swarm Management: Worker Nodes, Load Balancing, Health Monitoring, and Troubleshooting

By default manager nodes also act as a worker nodes. This means the scheduler can assign tasks to a manager node. For small and non-critical swarms assigning tasks to managers is relatively low-risk as long as you schedule services using resource constraints for cpu and memory. However, because manager nodes use the Raft consensus algorithm to replicate data in a consistent way, they are sensitive to resource starvation. You should isolate managers in your swarm from processes that might block swarm operations like swarm heartbeat or leader elections. To avoid interference with manager node operation, you can drain manager nodes to make them unavailable as worker nodes: ```console $ docker node update --availability drain <NODE> ``` When you drain a node, the scheduler reassigns any tasks running on the node to other available worker nodes in the swarm. It also prevents the scheduler from assigning tasks to the node. ## Add worker nodes for load balancing [Add nodes to the swarm](join-nodes.md) to balance your swarm's load. Replicated service tasks are distributed across the swarm as evenly as possible over time, as long as the worker nodes are matched to the requirements of the services. When limiting a service to run on only specific types of nodes, such as nodes with a specific number of CPUs or amount of memory, remember that worker nodes that do not meet these requirements cannot run these tasks. ## Monitor swarm health You can monitor the health of manager nodes by querying the docker `nodes` API in JSON format through the `/nodes` HTTP endpoint. Refer to the [nodes API documentation](/reference/api/engine/v1.25/#tag/Node) for more information. From the command line, run `docker node inspect <id-node>` to query the nodes. For instance, to query the reachability of the node as a manager: ```console $ docker node inspect manager1 --format "{{ .ManagerStatus.Reachability }}" reachable ``` To query the status of the node as a worker that accept tasks: ```console $ docker node inspect manager1 --format "{{ .Status.State }}" ready ``` From those commands, we can see that `manager1` is both at the status `reachable` as a manager and `ready` as a worker. An `unreachable` health status means that this particular manager node is unreachable from other manager nodes. In this case you need to take action to restore the unreachable manager: - Restart the daemon and see if the manager comes back as reachable. - Reboot the machine. - If neither restarting nor rebooting works, you should add another manager node or promote a worker to be a manager node. You also need to cleanly remove the failed node entry from the manager set with `docker node demote <NODE>` and `docker node rm <id-node>`. Alternatively you can also get an overview of the swarm health from a manager node with `docker node ls`: ```console $ docker node ls ID HOSTNAME MEMBERSHIP STATUS AVAILABILITY MANAGER STATUS 1mhtdwhvsgr3c26xxbnzdc3yp node05 Accepted Ready Active 516pacagkqp2xc3fk9t1dhjor node02 Accepted Ready Active Reachable 9ifojw8of78kkusuc4a6c23fx * node01 Accepted Ready Active Leader ax11wdpwrrb6db3mfjydscgk7 node04 Accepted Ready Active bb1nrq2cswhtbg4mrsqnlx1ck node03 Accepted Ready Active Reachable di9wxgz8dtuh9d2hn089ecqkf node06 Accepted Ready Active ``` ## Troubleshoot a manager node You should never restart a manager node by copying the `raft` directory from another node. The data directory is unique to a node ID. A node can only use a node ID once to join the swarm. The node ID space should be globally unique. To cleanly re-join a manager node to a cluster: 1. Demote the node to a worker using `docker node demote <NODE>`. 2. Remove the node from the swarm using `docker node rm <NODE>`. 3. Re-join the node to the swarm with a fresh state using `docker swarm join`. For more information on joining a manager node to a swarm, refer to

This section covers several aspects of Docker Swarm management. It discusses the default behavior of manager nodes also acting as worker nodes, and how to drain them to avoid resource contention. It also touches upon adding worker nodes for load balancing and monitoring swarm health using the Docker API and command-line tools. Finally, it details troubleshooting a manager node, emphasizing the importance of not copying the 'raft' directory and outlining the steps to cleanly re-join a manager node to the cluster.