Troubleshooting, Force Removal, Backup, and Disaster Recovery in Docker Swarm

1mhtdwhvsgr3c26xxbnzdc3yp node05 Accepted Ready Active 516pacagkqp2xc3fk9t1dhjor node02 Accepted Ready Active Reachable 9ifojw8of78kkusuc4a6c23fx * node01 Accepted Ready Active Leader ax11wdpwrrb6db3mfjydscgk7 node04 Accepted Ready Active bb1nrq2cswhtbg4mrsqnlx1ck node03 Accepted Ready Active Reachable di9wxgz8dtuh9d2hn089ecqkf node06 Accepted Ready Active ``` ## Troubleshoot a manager node You should never restart a manager node by copying the `raft` directory from another node. The data directory is unique to a node ID. A node can only use a node ID once to join the swarm. The node ID space should be globally unique. To cleanly re-join a manager node to a cluster: 1. Demote the node to a worker using `docker node demote <NODE>`. 2. Remove the node from the swarm using `docker node rm <NODE>`. 3. Re-join the node to the swarm with a fresh state using `docker swarm join`. For more information on joining a manager node to a swarm, refer to [Join nodes to a swarm](join-nodes.md). ## Forcibly remove a node In most cases, you should shut down a node before removing it from a swarm with the `docker node rm` command. If a node becomes unreachable, unresponsive, or compromised you can forcefully remove the node without shutting it down by passing the `--force` flag. For instance, if `node9` becomes compromised: ```none $ docker node rm node9 Error response from daemon: rpc error: code = 9 desc = node node9 is not down and can't be removed $ docker node rm --force node9 Node node9 removed from swarm ``` Before you forcefully remove a manager node, you must first demote it to the worker role. Make sure that you always have an odd number of manager nodes if you demote or remove a manager. ## Back up the swarm Docker manager nodes store the swarm state and manager logs in the `/var/lib/docker/swarm/` directory. This data includes the keys used to encrypt the Raft logs. Without these keys, you cannot restore the swarm. You can back up the swarm using any manager. Use the following procedure. 1. If the swarm has auto-lock enabled, you need the unlock key to restore the swarm from backup. Retrieve the unlock key if necessary and store it in a safe location. If you are unsure, read [Lock your swarm to protect its encryption key](swarm_manager_locking.md). 2. Stop Docker on the manager before backing up the data, so that no data is being changed during the backup. It is possible to take a backup while the manager is running (a "hot" backup), but this is not recommended and your results are less predictable when restoring. While the manager is down, other nodes continue generating swarm data that is not part of this backup. > [!NOTE] > > Be sure to maintain the quorum of swarm managers. During the > time that a manager is shut down, your swarm is more vulnerable to > losing the quorum if further nodes are lost. The number of managers you > run is a trade-off. If you regularly take down managers to do backups, > consider running a five manager swarm, so that you can lose an additional > manager while the backup is running, without disrupting your services. 3. Back up the entire `/var/lib/docker/swarm` directory. 4. Restart the manager. To restore, see [Restore from a backup](#restore-from-a-backup). ## Recover from disaster ### Restore from a backup After backing up the swarm as described in [Back up the swarm](#back-up-the-swarm), use the following procedure to restore the data to a new swarm. 1. Shut down Docker on the target host machine for the restored swarm. 2. Remove the contents of the `/var/lib/docker/swarm` directory on the new swarm. 3. Restore the `/var/lib/docker/swarm` directory with the contents of the backup. > [!NOTE] > > The new node uses the same encryption key for on-disk > storage as the old one. It is not possible to change the on-disk storage

This section provides guidance on troubleshooting manager nodes, including how to cleanly re-join a manager to the cluster. It then explains how to forcibly remove a node from a swarm when it's unreachable or compromised, and highlights the importance of demoting manager nodes before forceful removal. The text then details how to back up a swarm by copying the `/var/lib/docker/swarm/` directory and recovering a swarm from a disaster by restoring the data into a fresh swarm instance.