Do's and Don'ts of DynamoDB autoscaling

This is the part-II of the DynamoDB Autoscaling blog post. While the Part-I blog post talks about how to accomplish DynamoDB autoscaling, this one talks about when to use and when not to use it.

Before you proceed further with auto scaling, make sure to read
Amazon DynamoDB guidelines
for working with tables and internal partitions.

SCALING GUIDELINES

First off all, let's define the variables before we jump into more details:

Definitions:
  • R = Provisioned Read IOPS per second for a table
  • W = Provisioned Write IOPS per second for a table
How to estimate number of partitions for your table:
  • Approximate number of internal DynamoDB partitions = (R + W * 3) / 3000. This assumes the each partition size is < 10 GB. If a given partition exceeds 10 GB of storage space, DynamoDB will automatically split the partition into 2 separate partitions. For more details refer to this AWS DynamoDB article.
  • Another hack for computing number of internal DynamoDB Partitions is by enabling streams for table and checking the number of shards, which is approximately equal to the number of partitions. You can disable the streams feature immediately after you’ve an idea about the number of partitions.

SCALING UP GUIDELINES

  • For any tables of any throughput/storage sizes, scaling up can be done with one-click in Neptune! It’s easy and doesn’t require much thought.
  • Only exception to this if you’ve a hot key workload problem, where scaling up based on your throughput limits will not fix the root of the problem. The only way to address hot key problem is to either change your workload so that it becomes uniform across all your DynamoDB internal partitions or use a separate caching layer outside of DynamoDB.

SCALING DOWN GUIDELINES

You’ve to look carefully at your access partners, throughput and storage sizes before you can turn on throughput downscaling for your tables. While in some cases if downscaling helps you save your costs, but in other cases, it actually worsens your latency or error rates if you don’t really understand the implications. So, be sure to understand your specific case before jumping on downscaling!

Scenario1: (Safe Zone) Safely perform throughput downscaling if:

All the following three conditions are true:

  • R + 3 * W <= 3000 IOPS
  • Size of table is less than 10GB (will continue to be so)
  • Reads & write access partners are uniformly distributed across all DynamoDB partitions (i.e. no hot keys)
Scenario2: (Cautious Zone) Validate whether throughput downscaling actually helps by checking if:

Here is where you’ve to consciously strike the balance between performance and cost savings.

  • R <= 5000/sec, and W <= 5000/sec
  • Verify if the approximate number of internal DynamoDB partitions is relative small (< 10 partitions).
  • Verify that your tables are not growing too quickly (it typically takes a few months to hit 10-20GB)
  • Read/Write access patterns are uniform, so scaling down wouldn’t increase the throttled request count despite no changes in internal DynamoDB partition count

Let’s consider a table with the below configuration:
Auto scale R upper limit = 5000
Auto scale W upper limit = 4000
R = 3000
W = 2000
(Assume every partition is less than 10 GB for simplicity in this example)

If you followed the best practice of provisioning for the peak first (do it once and scale it down immediately to your needs), DynamoDB would have created 5000 + 3000 * 3 = 14000 = 5 partitions with 2800 IOPS/sec for each partition. Now if you were to downscale to 3000 and 2000 read and writes respectively, new partitions will have 1800 IOPS/sec each for each partition. This means each partition has another 1200 IOPS/sec of reserved capacity before more partitions are created internally.

Scenario3: (Risky Zone) Use downscaling at your own risk if:
  • Storage size of your tables is significantly higher than > 10GB
  • Reads and writes are NOT uniformly distributed across the key space (i.e. if your workload has some hot keys). The exception is that if you’ve an external caching solution explicitly designed to address this need.
  • You are scaling up and down way too often and your tables are big in terms of throughput and storage. By scaling up and down often, you can potentially increase the #internal partitions and this could result in more throttled requests if you’ve hot-key based workload. It will also increase query and scan latencies since your query + scan calls are spread across multiple partitions. To be specific, if your read and write throughput rates are above 5000*, we don’t recommend you use auto scaling. This is just a cautious recommendation; you can still continue to use it at your own risk of understanding the implications.
  • If your table already has too many internal partitions in DyanmoDB, auto scaling actually might worsen your situation.

Best Practices

  • When you create your table for the time, set provisioned throughput capacity based on 12-month peak. Then you can scale down to what you want. This will ensure that DynamoDB will internally create the correct # partitions for your peak traffic. Let’s say you want to create the table with 4000 reads/sec and 4000 writes/sec. and Let’s assume your peak is 10,000 reads/sec and 8000 writes/second. Our proposal is to create the table with R = 10000, and W = 8000, then bring them to down R = 4000 and W=4000 respectively.
  • Have a custom metric for tracking application level #failed requests not just throttled request count exposed by CloudWatch/DynamoDB. This will also help you understand direct impact to your customers whenever you hit throughput limits (regardless of your use of Neptune). Note that Amazon SDK performs a retry for every throttled request (Provisioned Throughput Exceeded Exception)
  • Why the 5000 limit? This is purely based on our empirical understanding. This is something we are learning from our customers so would love your feedback as we are iterating on this. We just know below 5000 read/write throughput IOPS, you are likely not run into issues. But beyond read/write 5000 IOPS, we are not just so sure, so we are taking a cautious stance. You can still find it valuable beyond 5000 as well, but you need to really understand your workload and verify that it doesn't actually worsen your situation by creating too many unnecessary partitions.

Limitations

  • Neptune’s DynamoDB auto scaling is designed for scaling up aggressively and scaling down slowly. Scaling up can be made optimal according to your workload needs. While scaling down works pretty well, it’s not optimal and we don’t strive to make it optimal either just because the complications involved in 1) too much down scaling, and 2) limits placed on the DynamoDB in terms of you can scale down on a given day.
  • Neptune cannot respond to bursts shorter than 1 minute since 1 minute is the minimum level of granularity provided by the CloudWatch for DynamoDB metrics.
  • Right now, you’ve to manually configure alarms for throttled requests
  • If both read and write UpdateTable operations roughly happen at the same time, we don't batch those operations to optimize for #downscale scenarios/ day. However, in practice, we expect customers to run into this very less often. It's definitely a feature on our roadmap.
  • We explicitly restrict your scale up/down throughput factor ranges in UI and this is by design. By enforcing these constraints, we explicitly avoid cyclic up/down flapping.

Conclusions

In summary, you can use Neptune’s DynamoDB scale up throughput anytime (without thinking much). But, before signing up for throughput down scaling, you should:

  • Understand your provisioned throughput limits
  • Understand your access patterns and get a handle on your throttled requests (uniform or hot-key based workloads)
  • Understand table storage sizes (less than or greater than 10 GB)
  • Understand the number of DynamoDB internal partitions your tables might create
  • Be aware of the limitation of auto scaling tool (what it is designed for and what it’s not)

You can try DynamoDB autoscaling at www.neptune.io

We would love to hear your comments and feedback below.

comments powered by Disqus
Subscribe to get DevOps best practices delivered to you directly