Swarm mode Overview & Key Concepts

Docker Swarm Mode is a feature within Docker that allows you to manage a cluster of Docker nodes (computers running Docker) as if they were a single machine. This is extremely useful for deploying applications that require multiple containers distributed across various servers. It provides built-in tools for clustering, service orchestration, load balancing, and scaling without needing extra software.

In simple terms, Swarm Mode turns a collection of computers running Docker into a "swarm," allowing you to manage services across these machines as though they were one system.

What is a Swarm? / What are Roles?

A swarm is a group of Docker hosts (servers running Docker) that are connected and work together to run containerized applications. Each host can play one of two roles:

Manager: A node that controls the swarm. It handles the cluster management tasks, such as assigning workloads (tasks) to worker nodes and maintaining the desired state of the services.
Worker: A node that does the actual work by running containers. The worker nodes execute the tasks assigned by the manager.

Any Docker host in the swarm can be a manager, a worker, or even perform both roles.

Key Concepts in Docker Swarm Mode

1. Nodes

A node is any machine that is part of a Docker Swarm cluster. Nodes can either be manager nodes (which control the swarm) or worker nodes (which run containers). In a real-world production environment, nodes are often spread across multiple physical servers or cloud machines.

Manager Node: Manages the cluster by keeping track of tasks and assigning them to workers. The manager also ensures that the desired number of containers are always running.

Worker Node: Receives and executes tasks given by the manager. Workers run the containers but do not manage the swarm.

2. Services and Tasks

A service is a definition of what needs to be run in the swarm. When you create a service, you specify things like the container image to use and how many copies (replicas) of the service should run.

There are two types of services:
- Replicated Services: The swarm manager assigns a set number of replica tasks to run across the available nodes.
- Global Services: A task for this service runs on every node in the swarm.

A task is a unit of work, which includes running a Docker container. Each task is scheduled by the swarm manager to be executed on one of the worker nodes. Once a task is assigned to a node, it remains on that node until it completes or fails.

3. Load Balancing

Docker Swarm has built-in load balancing to distribute traffic between the different containers running on the swarm. When external users access a service, the traffic is routed to any node in the swarm, and that node forwards the request to the appropriate container running the service. Swarm uses ingress load balancing for external traffic and internal DNS-based load balancing for traffic within the swarm.

4. Desired State Reconciliation

One of the most important features of Docker Swarm is its ability to maintain the desired state. The manager nodes constantly monitor the swarm and automatically adjust the number of containers to match what you have defined. For example, if one of the worker nodes fails, the manager will ensure that new containers are created on other nodes to maintain the required number of replicas.

Example: Creating a 3-Node Docker Swarm with All Manager Nodes

Let’s walk through an example where we set up a Docker Swarm with three nodes, all acting as manager nodes. This scenario is useful when you want high availability and fault tolerance in your cluster, meaning if one or two manager nodes fail, the remaining nodes can continue managing the swarm.

Why Make All Nodes Managers?

In a Docker Swarm, manager nodes are responsible for handling the cluster's state, scheduling tasks, and distributing containers to the worker nodes. By making all three nodes managers, you ensure that your swarm can tolerate failures of one or even two nodes and still function. This is known as high availability because the swarm can elect a new leader and continue operating without downtime.