12.4. Shard Management

12.4.1. Introduction

This document discusses how sharding works in CouchDB along with how to safely add, move, remove, and create placement rules for shards and shard replicas.

A shard is a horizontal partition of data in a database. Partitioning data into shards and distributing copies of each shard (called “shard replicas” or just “replicas”) to different nodes in a cluster gives the data greater durability against node loss. CouchDB clusters automatically shard databases and distribute the subsets of documents that compose each shard among nodes. Modifying cluster membership and sharding behavior must be done manually.

12.4.1.1. Shards and Replicas

How many shards and replicas each database has can be set at the global level, or on a per-database basis. The relevant parameters are q and n.

q is the number of database shards to maintain. n is the number of copies of each document to distribute. The default value for n is 3, and for q is 8. With q=8, the database is split into 8 shards. With n=3, the cluster distributes three replicas of each shard. Altogether, that’s 24 shard replicas for a single database. In a default 3-node cluster, each node would receive 8 shards. In a 4-node cluster, each node would receive 6 shards. We recommend in the general case that the number of nodes in your cluster should be a multiple of n, so that shards are distributed evenly.

CouchDB nodes have a etc/local.ini file with a section named cluster which looks like this:

[cluster]
q=8
n=3

These settings can be modified to set sharding defaults for all databases, or they can be set on a per-database basis by specifying the q and n query parameters when the database is created. For example:

$ curl -X PUT "$COUCH_URL:5984/database-name?q=4&n=2"

That creates a database that is split into 4 shards and 2 replicas, yielding 8 shard replicas distributed throughout the cluster.

12.4.1.2. Quorum

Depending on the size of the cluster, the number of shards per database, and the number of shard replicas, not every node may have access to every shard, but every node knows where all the replicas of each shard can be found through CouchDB’s internal shard map.

Each request that comes in to a CouchDB cluster is handled by any one random coordinating node. This coordinating node proxies the request to the other nodes that have the relevant data, which may or may not include itself. The coordinating node sends a response to the client once a quorum of database nodes have responded; 2, by default. The default required size of a quorum is equal to r=w=((n+1)/2) where r refers to the size of a read quorum, w refers to the size of a write quorum, and n refers to the number of replicas of each shard. In a default cluster where n is 3, ((n+1)/2) would be 2.

Note

Each node in a cluster can be a coordinating node for any one request. There are no special roles for nodes inside the cluster.

The size of the required quorum can be configured at request time by setting the r parameter for document and view reads, and the w parameter for document writes. For example, here is a request that directs the coordinating node to send a response once at least two nodes have responded:

$ curl "$COUCH_URL:5984/<doc>?r=2"

Here is a similar example for writing a document:

$ curl -X PUT "$COUCH_URL:5984/<doc>?w=2" -d '{...}'

Setting r or w to be equal to n (the number of replicas) means you will only receive a response once all nodes with relevant shards have responded or timed out, and as such this approach does not guarantee ACIDic consistency. Setting r or w to 1 means you will receive a response after only one relevant node has responded.

12.4.2. Moving a shard

This section describes how to manually place and replace shards. These activities are critical steps when you determine your cluster is too big or too small, and want to resize it successfully, or you have noticed from server metrics that database/shard layout is non-optimal and you have some “hot spots” that need resolving.

Consider a three-node cluster with q=8 and n=3. Each database has 24 shards, distributed across the three nodes. If you add a fourth node to the cluster, CouchDB will not redistribute existing database shards to it. This leads to unbalanced load, as the new node will only host shards for databases created after it joined the cluster. To balance the distribution of shards from existing databases, they must be moved manually.

Moving shards between nodes in a cluster involves the following steps:

  1. Ensure the target node has joined the cluster.
  2. Copy the shard(s) and any secondary index shard(s) onto the target node.
  3. Set the target node to maintenance mode.
  4. Update cluster metadata to reflect the new target shard(s).
  5. Monitor internal replication to ensure up-to-date shard(s).
  6. Clear the target node’s maintenance mode.
  7. Update cluster metadata again to remove the source shard(s)
  8. Remove the shard file(s) and secondary index file(s) from the source node.

12.4.2.1. Copying shard files

Note

Technically, copying database and secondary index shards is optional. If you proceed to the next step without performing this data copy, CouchDB will use internal replication to populate the newly added shard replicas. However, copying files is faster than internal replication, especially on a busy cluster, which is why we recommend performing this manual data copy first.

Shard files live in the data/shards directory of your CouchDB install. Within those subdirectories are the shard files themselves. For instance, for a q=8 database called abc, here is its database shard files:

data/shards/00000000-1fffffff/abc.1529362187.couch
data/shards/20000000-3fffffff/abc.1529362187.couch
data/shards/40000000-5fffffff/abc.1529362187.couch
data/shards/60000000-7fffffff/abc.1529362187.couch
data/shards/80000000-9fffffff/abc.1529362187.couch
data/shards/a0000000-bfffffff/abc.1529362187.couch
data/shards/c0000000-dfffffff/abc.1529362187.couch
data/shards/e0000000-ffffffff/abc.1529362187.couch

Secondary indexes (including JavaScript views, Erlang views and Mango indexes) are also sharded, and their shards should be moved to save the new node the effort of rebuilding the view. View shards live in data/.shards. For example:

data/.shards
data/.shards/e0000000-ffffffff/_replicator.1518451591_design
data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview
data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
data/.shards/c0000000-dfffffff
data/.shards/c0000000-dfffffff/_replicator.1518451591_design
data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview
data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
...

Since they are files, you can use cp, rsync, scp or other file-copying command to copy them from one node to another. For example:

# one one machine
$ mkdir -p data/.shards/<range>
$ mkdir -p data/shards/<range>
# on the other
$ scp <couch-dir>/data/.shards/<range>/<database>.<datecode>* \
  <node>:<couch-dir>/data/.shards/<range>/
$ scp <couch-dir>/data/shards/<range>/<database>.<datecode>.couch \
  <node>:<couch-dir>/data/shards/<range>/

Note

Remember to move view files before database files! If a view index is ahead of its database, the database will rebuild it from scratch.

12.4.2.2. Set the target node to true maintenance mode

Before telling CouchDB about these new shards on the node, the node must be put into maintenance mode. Maintenance mode instructs CouchDB to return a 404 Not Found response on the /_up endpoint, and ensures it does not participate in normal interactive clustered requests for its shards. A properly configured load balancer that uses GET /_up to check the health of nodes will detect this 404 and remove the node from circulation, preventing requests from being sent to that node. For example, to configure HAProxy to use the /_up endpoint, use:

http-check disable-on-404
option httpchk GET /_up

If you do not set maintenance mode, or the load balancer ignores this maintenance mode status, after the next step is performed the cluster may return incorrect responses when consulting the node in question. You don’t want this! In the next steps, we will ensure that this shard is up-to-date before allowing it to participate in end-user requests.

To enable maintenance mode:

Then, verify that the node is in maintenance mode by performing a GET /_up on that node’s individual endpoint:

Finally, check that your load balancer has removed the node from the pool of available backend nodes.

12.4.2.3. Updating cluster metadata to reflect the new target shard(s)

Now we need to tell CouchDB that the target node (which must already be joined to the cluster) should be hosting shard replicas for a given database.

To update the cluster metadata, use the special /_dbs database, which is an internal CouchDB database that maps databases to shards and nodes. This database is replicated between nodes. It is accessible only via a node-local port, usually at port 5986. By default, this port is only available on the localhost interface for security purposes.

First, retrieve the database’s current metadata:

$ curl http://localhost:5986/_dbs/{name}
{
  "_id": "{name}",
  "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
  "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
  "changelog": [
    ["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
    ["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
    ["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
    …
  ],
  "by_node": {
    "node1@xxx.xxx.xxx.xxx": [
      "00000000-1fffffff",
      …
    ],
    …
  },
  "by_range": {
    "00000000-1fffffff": [
      "node1@xxx.xxx.xxx.xxx",
      "node2@xxx.xxx.xxx.xxx",
      "node3@xxx.xxx.xxx.xxx"
    ],
    …
  }
}

Here is a brief anatomy of that document:

  • _id: The name of the database.
  • _rev: The current revision of the metadata.
  • shard_suffix: A timestamp of the database’s creation, marked as seconds after the Unix epoch mapped to the codepoints for ASCII numerals.
  • changelog: History of the database’s shards.
  • by_node: List of shards on each node.
  • by_range: On which nodes each shard is.

To reflect the shard move in the metadata, there are three steps:

  1. Add appropriate changelog entries.
  2. Update the by_node entries.
  3. Update the by_range entries.

Warning

Be very careful! Mistakes during this process can irreparably corrupt the cluster!

As of this writing, this process must be done manually.

To add a shard to a node, add entries like this to the database metadata’s changelog attribute:

["add", "<range>", "<node-name>"]

The <range> is the specific shard range for the shard. The <node- name> should match the name and address of the node as displayed in GET /_membership on the cluster.

Note

When removing a shard from a node, specify remove instead of add.

Once you have figured out the new changelog entries, you will need to update the by_node and by_range to reflect who is storing what shards. The data in the changelog entries and these attributes must match. If they do not, the database may become corrupted.

Continuing our example, here is an updated version of the metadata above that adds shards to an additional node called node4:

{
  "_id": "{name}",
  "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
  "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
  "changelog": [
    ["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
    ["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
    ["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
    ...
    ["add", "00000000-1fffffff", "node4@xxx.xxx.xxx.xxx"]
  ],
  "by_node": {
    "node1@xxx.xxx.xxx.xxx": [
      "00000000-1fffffff",
      ...
    ],
    ...
    "node4@xxx.xxx.xxx.xxx": [
      "00000000-1fffffff"
    ]
  },
  "by_range": {
    "00000000-1fffffff": [
      "node1@xxx.xxx.xxx.xxx",
      "node2@xxx.xxx.xxx.xxx",
      "node3@xxx.xxx.xxx.xxx",
      "node4@xxx.xxx.xxx.xxx"
    ],
    ...
  }
}

Now you can PUT this new metadata:

$ curl -X PUT http://localhost:5986/_dbs/{name} -d '{...}'

12.4.2.4. Monitor internal replication to ensure up-to-date shard(s)

After you complete the previous step, as soon as CouchDB receives a write request for a shard on the target node, CouchDB will check if the target node’s shard(s) are up to date. If it finds they are not up to date, it will trigger an internal replication job to complete this task. You can observe this happening by triggering a write to the database (update a document, or create a new one), while monitoring the /_node/<nodename>/_system endpoint, which includes the internal_replication_jobs metric.

Once this metric has returned to the baseline from before you wrote the document, or is 0, the shard replica is ready to serve data and we can bring the node out of maintenance mode.

12.4.2.5. Clear the target node’s maintenance mode

You can now let the node start servicing data requests by putting "false" to the maintenance mode configuration endpoint, just as in step 2.

Verify that the node is not in maintenance mode by performing a GET /_up on that node’s individual endpoint.

Finally, check that your load balancer has returned the node to the pool of available backend nodes.

12.4.2.6. Update cluster metadata again to remove the source shard

Now, remove the source shard from the shard map the same way that you added the new target shard to the shard map in step 2. Be sure to add the ["remove", <range>, <source-shard>] entry to the end of the changelog as well as modifying both the by_node and by_range sections of the database metadata document.

12.4.2.7. Remove the shard and secondary index files from the source node

Finally, you can remove the source shard replica by deleting its file from the command line on the source host, along with any view shard replicas:

Congratulations! You have moved a database shard replica. By adding and removing database shard replicas in this way, you can change the cluster’s shard layout, also known as a shard map.

12.4.3. Specifying database placement

You can configure CouchDB to put shard replicas on certain nodes at database creation time using placement rules.

First, each node must be labeled with a zone attribute. This defines which zone each node is in. You do this by editing the node’s document in the /_nodes database, which is accessed through the node-local port. Add a key value pair of the form:

"zone": "{zone-name}"

Do this for all of the nodes in your cluster. For example:

$ curl -X PUT http://localhost:5986/_nodes/<node-name> \
    -d '{ \
        "_id": "<node-name>",
        "_rev": "<rev>",
        "zone": "<zone-name>"
        }'

In the local config file (local.ini) of each node, define a consistent cluster-wide setting like:

[cluster]
placement = <zone-name-1>:2,<zone-name-2>:1

In this example, CouchDB will ensure that two replicas for a shard will be hosted on nodes with the zone attribute set to <zone-name-1> and one replica will be hosted on a new with the zone attribute set to <zone-name-2>.

This approach is flexible, since you can also specify zones on a per- database basis by specifying the placement setting as a query parameter when the database is created, using the same syntax as the ini file:

curl -X PUT $COUCH_URL:5984/<dbname>?zone=<zone>

Note that you can also use this system to ensure certain nodes in the cluster do not host any replicas for newly created databases, by giving them a zone attribute that does not appear in the [cluster] placement string.

12.4.4. Resharding a database to a new q value

The q value for a database can only be set when the database is created, precluding live resharding. Instead, to reshard a database, it must be regenerated. Here are the steps:

  1. Create a temporary database with the desired shard settings, by specifying the q value as a query parameter during the PUT operation.
  2. Stop clients accessing the database.
  3. Replicate the primary database to the temporary one. Multiple replications may be required if the primary database is under active use.
  4. Delete the primary database. Make sure nobody is using it!
  5. Recreate the primary database with the desired shard settings.
  6. Clients can now access the database again.
  7. Replicate the temporary back to the primary.
  8. Delete the temporary database.

Once all steps have completed, the database can be used again. The cluster will create and distribute its shards according to placement rules automatically.

Downtime can be avoided in production if the client application(s) can be instructed to use the new database instead of the old one, and a cut- over is performed during a very brief outage window.