Disaster Recovery: Restoring from Backup and Recovering from Quorum Loss

> losing the quorum if further nodes are lost. The number of managers you > run is a trade-off. If you regularly take down managers to do backups, > consider running a five manager swarm, so that you can lose an additional > manager while the backup is running, without disrupting your services. 3. Back up the entire `/var/lib/docker/swarm` directory. 4. Restart the manager. To restore, see [Restore from a backup](#restore-from-a-backup). ## Recover from disaster ### Restore from a backup After backing up the swarm as described in [Back up the swarm](#back-up-the-swarm), use the following procedure to restore the data to a new swarm. 1. Shut down Docker on the target host machine for the restored swarm. 2. Remove the contents of the `/var/lib/docker/swarm` directory on the new swarm. 3. Restore the `/var/lib/docker/swarm` directory with the contents of the backup. > [!NOTE] > > The new node uses the same encryption key for on-disk > storage as the old one. It is not possible to change the on-disk storage > encryption keys at this time. > > In the case of a swarm with auto-lock enabled, the unlock key is also the > same as on the old swarm, and the unlock key is needed to restore the > swarm. 4. Start Docker on the new node. Unlock the swarm if necessary. Re-initialize the swarm using the following command, so that this node does not attempt to connect to nodes that were part of the old swarm, and presumably no longer exist. ```console $ docker swarm init --force-new-cluster ``` 5. Verify that the state of the swarm is as expected. This may include application-specific tests or simply checking the output of `docker service ls` to be sure that all expected services are present. 6. If you use auto-lock, [rotate the unlock key](swarm_manager_locking.md#rotate-the-unlock-key). 7. Add manager and worker nodes to bring your new swarm up to operating capacity. 8. Reinstate your previous backup regimen on the new swarm. ### Recover from losing the quorum Swarm is resilient to failures and can recover from any number of temporary node failures (machine reboots or crash with restart) or other transient errors. However, a swarm cannot automatically recover if it loses a quorum. Tasks on existing worker nodes continue to run, but administrative tasks are not possible, including scaling or updating services and joining or removing nodes from the swarm. The best way to recover is to bring the missing manager nodes back online. If that is not possible, continue reading for some options for recovering your swarm. In a swarm of `N` managers, a quorum (a majority) of manager nodes must always be available. For example, in a swarm with five managers, a minimum of three must be operational and in communication with each other. In other words, the swarm can tolerate up to `(N-1)/2` permanent failures beyond which requests involving swarm management cannot be processed. These types of failures include data corruption or hardware failures. If you lose the quorum of managers, you cannot administer the swarm. If you have lost the quorum and you attempt to perform any management operation on the swarm,

This section describes the procedure to restore a Docker swarm from a backup after a disaster, emphasizing the importance of shutting down Docker on the target host, restoring the `/var/lib/docker/swarm` directory, and initializing a new swarm cluster. It also covers recovering from a quorum loss, where administrative tasks become impossible. The primary recommendation is to restore the missing manager nodes. Additionally, the text outlines the concept of quorum in a swarm and the tolerance for permanent failures, indicating the importance of maintaining a majority of manager nodes for swarm administration.