Partitioning strategies for Cosmos Database SQL API

If you’re considering using Cosmos Database (DB) SQL API for your next project, one of the fundamental design decisions you will face is how to partition your data.

In this post, we’ll take a look at some of the factors that may influence your partitioning strategy, and explore some examples of various partitioning strategies and compare their pros and cons.

Considerations

It is important that we know our data before deciding on our partition strategy to avoid issues further down the road. For example, consider the following:

How fast is your data growing?
What is the expected rate of read and write operations?

Partition limits

There are limits to how much you can store in a logical and physical partition. At the time of writing, Azure Cosmos DB supports up to 20GB for a logical partition and 50GB for a physical partition.

When a physical partition is approaching its limit, Cosmos DB will split new physical partitions under the hood without affecting availability. When a logical partition limit is hit, Cosmos DB will throw the error message “Partition key reached maximum size of 20 GB” and we can no longer insert records into it.

Therefore, be mindful of how your data grows with respect to the partition key you choose.

Read uses-cases

In general, you would want to have an even distribution of partition key for write operations to avoid hot partitions. It is also important that you design data for your read operations because inefficient queries eat up “ ” significantly.

Identify what type of queries you need, e.g.

Is the set of data you are accessing from a single partition or cross partition? Cross-partition queries always cost more Request Units and that also depends on how many physical partitions there are in your container.
Do you require pagination? Most Cosmos DB SDKs supports a “ ” when your query is for a single partition. However, the same support for cross-partition queries is not available in all SDKs.
Aggregate functions are expensive for cross-partition. For example: SELECT COUNT(1), c.group FROM c GROUP BY c.group
Filter criterias with a single partition key and item id (point filtering) is desirable as many SDKs would know which partition to query your records from.
If you have multiple other query criteria including range filters and sorting, designing proper index policy will help efficiency.

SDK limitations

There are various SDKs available to integrate with Cosmos DB. For the complete list, refer to the official documentation from Microsoft.

Your options depend on which Cosmos DB API you use (e.g. SQL API, Cassandra API, etc.) and which language your application code is written in. Not all SDKs are implemented the same and they have different feature limitations.

Cross-partition queries

While working with Node.js SDK , we found out that pagination is not supported for cross-partition queries. This had caused us to redesign the container’s partitioning strategy to fit our need for pagination while not having to migrate our application from Node.js to .NET platform.

Multi-tenancy systems

In the setting of a multi-tenancy system, all customers’ data goes into a single Cosmos DB container. There might be some active customers that generate high requests rates to the container, known as busy tenants. If the data is not segregated into different partitions by customer, others will be affected by the busy tenants. Requests have a much higher chance of getting throttled.

Note that the provisioned Request Units are split evenly between underlying physical partitions. E.g. You provisioned 1000 RU/s for a container, which at the moment has 5 physical partitions. Therefore, each physical partition supports the rate of 200 RU/s.

It is not in our control how partition keys are being distributed into different physical partitions and we might still have busy tenants affecting others.

Data retention

Physical partitions has a limit of storing 50GB of data. When data grows over time, there will be more physical partitions created under the hood. This also means that the throughput (RU/s) will be split among more physical partitions. As a result, you will need to provision more throughput (higher cost) to maintain the same request rate.

Data retention policy is something you would want to consider to keep the data growth in check. Cosmos DB has a Time to Live (TTL) feature that let us specify how long we want to retain an item for. The items are deleted automatically by Cosmos DB using left-over Request Units.

If your data grows into sheer size and is meant to be stored indefinitely, consider moving less accessed data to a cheaper alternative storage.

Some example of partitioning strategies

In this section we’ll look at several examples of partioning strategies. By no means is this an exhaustive list though.

Random

The partition key is either composed wholly or partly of a random value, for example a randomly-generated guid.

Pros

Avoids ‘hot’ partitions, throughput is evenly spread
Highly scalable – won’t come close to hitting the logical partition size limit

Cons

Querying more than one document unavoidably is cross-partition query. This means the queries are not as efficient and are more expensive, and Cosmos DB SDKs like NodeJS cannot use the native pagination and continuation token.

Per tenant

If you have a multi-tenant system this means using the unique tenant identifier for the partition key.

Pros

All queries for a tenant are within a single partition.

Cons

May not be scalable – because each tenant is limited to a maximum of 20GB of data, this means busy tenants with a lot of document writes and/or a long time-to-live (TTL) could easily run out of space.

Per tenant + category

This is somewhat similar to the ‘Per tenant’ scheme discussed above, but with additional identifiers for the category of data being stored.

An example might be {tenantid}_workflow to store all workflow-related documents for the tenant.

Pros

One advantage over the ‘Per tenant’ scheme is the 20GB limit is now per category, so if you have many categories this effectively increases your storage capacity.
For queries that are contained only with a tenant and category, avoids cross-partition queries.

Cons

Scalability is possibly still not great – for those categories where you are storing a lot of documents, sooner or later you may hit the 20GB limit.

Per tenant + entity

This an extension of the previous scheme, but adds a unique identifier for the parent entity. This is suited for hierarchical data, where an entity has a collection of child entities.

For example, if we have a Workflow entity that contains multiple “WorkflowInstance” entities, we could use a partition scheme like this for the workflow instance documents: {tenantid}_workflowinstances_{parentworkflowid}

Pros

One advantage over the ‘Per tenant + category’ scheme is the 20GB limit is now per entity.
For queries that are contained only with a tenant and parent entity, this avoids cross-partition queries. Using our example, querying all workflow instances for a specified parent workflow would be confined to one partition.

Cons

Although scalability is better than the previous scheme, it might not be ideal if you have a relatively low number of parent entities but a large number of child entities.

Per tenant + date period

This where each partition corresponds to a time period (a ‘bucket’) for a tenant. Say we wanted separate buckets for each month, we could use the scheme {tenantid}_{yyyyMM}.

This is suited for time-series event data where the document contains a fixed date property that determines what bucket the document goes into.

Pros

More scalable that the previous two strategies, but still need to take into account the expected frequency of document creation and document size to see if this is suitable.
Cross-partition queries are still a possibility if you wish to query across a date range that spans more than one bucket, there are techniques that can be used to manage this – we’ll look more at this in part 2.

Cons

If you need pagination, this is only really suitable for queries that sort by the date property that determines the document’s bucket.

Next

In Part 2 we’ll take a deeper look at the ‘Per tenant + date period’ partitioning strategy and discover ways of managing pagination as we query across bucket boundaries.

Interested in exploring more engineering posts from our team here at Nintex? Click here to discover more.

Partitioning strategies for Cosmos Database SQL API – Part one