11.5. Sharding

11.5.1. Scaling out

Normally you start small and grow over time. In the beginning you might do just fine with one node, but as your data and number of clients grows, you need to scale out.

For simplicity we will start fresh and small.

Start node1 and add a database to it. To keep it simple we will have 2 shards and no replicas.

curl -X PUT "http://xxx.xxx.xxx.xxx:5984/small?n=1&q=2" --user daboss

If you look in the directory data/shards you will find the 2 shards.

data/
+-- shards/
|   +-- 00000000-7fffffff/
|   |    -- small.1425202577.couch
|   +-- 80000000-ffffffff/
|        -- small.1425202577.couch

Now, go to the admin panel

http://xxx.xxx.xxx.xxx:5986/_utils

and look in the database _dbs, it is here that the metadata for each database is stored. As the database is called small, there is a document called small there. Let us look in it. Yes, you can get it with curl too:

curl -X GET "http://xxx.xxx.xxx.xxx:5986/_dbs/small"

{
    "_id": "small",
    "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64",
    "shard_suffix": [
        46,
        49,
        52,
        50,
        53,
        50,
        48,
        50,
        53,
        55,
        55
    ],
    "changelog": [
    [
        "add",
        "00000000-7fffffff",
        "node1@xxx.xxx.xxx.xxx"
    ],
    [
        "add",
        "80000000-ffffffff",
        "node1@xxx.xxx.xxx.xxx"
    ]
    ],
    "by_node": {
        "node1@xxx.xxx.xxx.xxx": [
            "00000000-7fffffff",
            "80000000-ffffffff"
        ]
    },
    "by_range": {
        "00000000-7fffffff": [
            "node1@xxx.xxx.xxx.xxx"
        ],
        "80000000-ffffffff": [
            "node1@xxx.xxx.xxx.xxx"
        ]
    }
}
  • _id The name of the database.
  • _rev The current revision of the metadata.
  • shard_suffix The numbers after small and before .couch. This is seconds after UNIX epoch when the database was created. Stored as ASCII characters.
  • changelog Self explaining. Mostly used for debugging.
  • by_node List of shards on each node.
  • by_range On which nodes each shard is.

Nothing here, nothing there, a shard in my sleeve

Start node2 and add it to the cluster. Check in /_membership that the nodes are talking with each other.

If you look in the directory data on node2, you will see that there is no directory called shards.

Go to Fauxton and edit the metadata for small, so it looks like this:

{
    "_id": "small",
    "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64",
    "shard_suffix": [
        46,
        49,
        52,
        50,
        53,
        50,
        48,
        50,
        53,
        55,
        55
    ],
    "changelog": [
    [
        "add",
        "00000000-7fffffff",
        "node1@xxx.xxx.xxx.xxx"
    ],
    [
        "add",
        "80000000-ffffffff",
        "node1@xxx.xxx.xxx.xxx"
    ],
    [
        "add",
        "00000000-7fffffff",
        "node2@yyy.yyy.yyy.yyy"
    ],
    [
        "add",
        "80000000-ffffffff",
        "node2@yyy.yyy.yyy.yyy"
    ]
    ],
    "by_node": {
        "node1@xxx.xxx.xxx.xxx": [
            "00000000-7fffffff",
            "80000000-ffffffff"
        ],
        "node2@yyy.yyy.yyy.yyy": [
            "00000000-7fffffff",
            "80000000-ffffffff"
        ]
    },
    "by_range": {
        "00000000-7fffffff": [
            "node1@xxx.xxx.xxx.xxx",
            "node2@yyy.yyy.yyy.yyy"
        ],
        "80000000-ffffffff": [
            "node1@xxx.xxx.xxx.xxx",
            "node2@yyy.yyy.yyy.yyy"
        ]
    }
}

Then press Save and marvel at the magic. The shards are now on node2 too! We now have n=2!

If the shards are large, then you can copy them over manually and only have CouchDB syncing the changes from the last minutes instead.

11.5.2. Moving Shards

Add, then delete

In the world of CouchDB there is no such thing as “moving” shards, only adding and removing shard replicas. You can add a new replica of a shard and then remove the old replica, thereby creating the illusion of moving. If you do this for a database that has n=1, you might be caught by the following mistake:

  1. Copy the shard onto a new node.
  2. Update the metadata to use the new node.
  3. Delete the shard on the old node.
  4. Oh, no!: You have lost all writes made between 1 and 2.

To avoid this mistake, you always want to make sure that both shards have been live for some time and that the shard on your new node is fully caught up before removing a shard on an old node. Since “moving” is a more conceptually (if not technically) accurate description of what you want to do, we’ll use that word in this documentation as well.

Moving

When you get to n=3 you should start moving the shards instead of adding more replicas.

We will stop on n=2 to keep things simple. Start node number 3 and add it to the cluster. Then create the directories for the shard on node3:

mkdir -p data/shards/00000000-7fffffff

And copy over data/shards/00000000-7fffffff/small.1425202577.couch from node1 to node3. Do not move files between the shard directories as that will confuse CouchDB!

Edit the database document in _dbs again. Make it so that node3 have a replica of the shard 00000000-7fffffff. Save the document and let CouchDB sync. If we do not do this, then writes made during the copy of the shard and the updating of the metadata will only have n=1 until CouchDB has synced.

Then update the metadata document so that node2 no longer have the shard 00000000-7fffffff. You can now safely delete data/shards/00000000-7fffffff/small.1425202577.couch on node 2.

The changelog is nothing that CouchDB cares about, it is only for the admins. But for the sake of completeness, we will update it again. Use delete for recording the removal of the shard 00000000-7fffffff from node2.

Start node4, add it to the cluster and do the same as above with shard 80000000-ffffffff.

All documents added during this operation was saved and all reads responded to without the users noticing anything.

11.5.3. Views

The views needs to be moved together with the shards. If you do not, then CouchDB will rebuild them and this will take time if you have a lot of documents.

The views are stored in data/.shards.

It is possible to not move the views and let CouchDB rebuild the view every time you move a shard. As this can take quite some time, it is not recommended.

11.5.4. Reshard? No, Preshard!

Reshard? Nope. It cannot be done. So do not create databases with too few shards.

If you can not scale out more because you set the number of shards too low, then you need to create a new cluster and migrate over.

  1. Build a cluster with enough nodes to handle one copy of your data.
  2. Create a database with the same name, n=1 and with enough shards so you do not have to do this again.
  3. Set up 2 way replication between the 2 clusters.
  4. Let it sync.
  5. Tell clients to use both the clusters.
  6. Add some nodes to the new cluster and add them as replicas.
  7. Remove some nodes from the old cluster.
  8. Repeat 6 and 7 until you have enough nodes in the new cluster to have 3 replicas of every shard.
  9. Redirect all clients to the new cluster
  10. Turn off the 2 way replication between the clusters.
  11. Shut down the old cluster and add the servers as new nodes to the new cluster.
  12. Relax!

Creating more shards than you need and then move the shards around is called presharding. The number of shards you need depends on how much data you are going to store. But, creating too many shards increases the complexity without any real gain. You might even get lower performance. As an example of this, we can take the author’s (15 year) old lab server. It gets noticeably slower with more than one shard and high load, as the hard drive must seek more.

How many shards you should have depends, as always, on your use case and your hardware. If you do not know what to do, use the default of 8 shards.