We recently decoupled our platform's search functionality from our main application. The search functionality required deep knowledge of some specialized systems (like Elasticsearch), was inaccessible to other services that wanted to offer text search and had much different throughput requirements than the rest of the application.
We decided to run the indexing and search functionality as a separate service. Indexing consumes data changes from a Kafka event stream and a REST API supports performing text searches on that data.
We made some performance and scalability tweaks, including a new Postgres database to store a copy of the Elasticsearch data. The database copy provides faster recovery if Elasticsearch has issues, simplifies admin tasks like adjusting an index's shard count and allows us to denormalize data across entities, even if those entities are owned by different services.
Running in Production
After moving over a majority of customer traffic, we began noticing significant slowdowns in the time it took to index changes from the Kafka event stream. This process starts by reading an event from Kafka, then applying the event's data to the corresponding model in the Postgres database and then finally indexing the updated model into Elasticsearch.
Metrics during the periods of slowness indicated that Postgres response time jumped 10x-100x despite normal event and API traffic. Something was causing these periodic spikes in Postgres response time and we needed to dig in.
To reframe the problem more technically: our database processes a homogeneous query workload but slows down significantly for hours at a time, then recovers and processes normally again.
A Few Bad Queries
Inconsistent performance didn't sound like a typical inefficient query problem. An expensive query can cause a tail latency amplification effect during sporadic slowness or it can slow-down other queries by consuming a disproportionate amount of resources during execution. All of our queries should have adequate indexes but we didn’t want to miss a simple fix by relying on what “should” happen.
We selected a few slower queries and others with high disk IO. We used Postgres’ `EXPLAIN ANALYZE` to capture traces of each query during slow-periods but nothing suggested what was causing the slowness. Query and execution plans looked identical, but execution time jumped 100x during periods of slowness.
Digging into RDS Performance
We use Amazon Web Services (AWS) to run almost all of our infrastructure at Salsify so our problematic Postgres database was running on their Relational Database Service (RDS) offering. RDS greatly simplifies the task of running a database in a production environment and reduces the knowledge required to do so. The flip side of that benefit is that you might not have as much context when things go wrong.
AWS suggested monitoring the following to identify performance issues with RDS:
- High CPU or RAM Consumption
- Disk Space Consumption
- Network Traffic
- Database Connections
- IOPS Metrics
Our CPU, RAM and Disk Space consumption peaked around 50% of their total capacity which ruled them out as limiting factors. Network traffic and database connections were closer to our instance size’s maximum but never climbed to more than ~70%. Finally our Read + Write IOPS spiked to within ~95% of the instance size’s limit and hovered at that level. Great! We found a metric with conspicuous behavior during degraded performance! Now we need to determine whether it's a cause or just a symptom.
We decided to raise our database's IOPS limit. It's a simple and relatively cheap change that should give us an indication whether IOPS was causing the slowness. We already used provisioned IOPS to provide consistent performance so we weren’t surprised that we might need to increase the allocation. After a few days, the slowdowns continued and our database never consumed more than ~80% of its new IOPS limit. Whatever was causing the slowdowns was probably also preventing us from realizing our new IOPS limit.
Postgres Vacuuming and TOAST Data
Salsify uses Postgres throughout our platform so we're familiar with its vacuuming functionality. Postgres uses multi-version concurrency control (MVCC) to provide transaction isolation. A byproduct of MVCC is "dead" tuples or database rows that are no longer visible to any transactions. Postgres's vacuum process scans the database for dead tuples and marks the space for reuse.
A handful of configuration options control when (via the autovacuumer) and how the vacuum process runs. Our database used a configuration that we apply for write-heavy applications. Write-heavy workloads generate lots of dead tuples so it's more effective to run frequent small vacuums so they can operate on a reasonable amount of data.
Workload |
autovacuum_vacuum_scale_factor (% of table size) |
autovacuum_vacuum_threshold (number of tuples) |
Postgres Default |
.2 |
50 |
Long-lived data, many writes per row |
.01 |
50 |
Short-lived data, insert then delete row |
0 |
5000 |
Our monitoring showed that Postgres was vacuuming our largest table whenever response times spiked. Vacuums also ran during normal performance but we figured the vacuum process is complicated enough that we might not understand some nuance that was causing the discrepancy. We managed to catch a vacuum process during a slowdown and performance improved dramatically when we manually killed the process.
So we had a lead: vacuuming occurs when slowdowns occur. We didn't expect vacuuming to impact user queries and had never observed a similar issue in our other databases. But this database had a lot of large data blobs, which we hadn't dealt with elsewhere. These large data blobs were stored in TOAST storage.
Postgres uses TOAST storage when a row has a column value that is greater than 2 kB. TOAST storage is basically another table that chunks that value, and is referenced by the tuple in the original table. Our largest table had a few hundred million rows and many of those used TOAST storage.
Vacuum statistics indicated that it was taking hours to vacuum our largest TOAST table. We constantly updated data in that table, but it often did not change the TOAST'ed value. Postgres updates typically create a new tuple that contains the current version of a row's data. New values come from the query and unchanged values are copied from the most recent tuple for the given row. Since a TOAST'ed value is stored in another table, only the TOAST reference is re-written on a row update as long as the TOAST'ed value did not change.
We avoided even loading the TOAST values when we knew they would not change, but this custom handling also prevented our ORM (Rails’ ActiveRecord) from detecting when the data actually changed when it was loaded. Postgres' TOAST documentation does not indicate whether new TOAST rows are created when the same value is written repeatedly so we put together a quick and dirty script to verify [Github gist]. Postgres does generate new rows in a TOAST table for existing-value updates so sending unchanged TOAST values was generating a lot of unnecessary vacuum work.
Normally a row update creates a single dead tuple, but because TOAST values are chunked in the TOAST table, a single row update that includes a TOAST value generates 1 dead tuple in the main table and then (value size / TOAST_MAX_CHUNK_SIZE) dead tuples in the TOAST table. For example, an update that includes a 10kB value with a TOAST_MAX_CHUNK_SIZE of 2kB generates 6 dead tuples.
Our service processed millions of events per hour that did not change the TOAST value. We’d forced Postgres to unnecessarily read and write millions of TOAST rows which had to impact the performance of Postgres. We uncovered a clear issue with our usage of Postgres that could explain the slowness, but we still didn’t understand why vacuuming would slow down user queries.
We tweaked our ORM's dirty tracking to omit the TOAST’ed value when it did not change and excitedly pushed the change to production. Slowdowns continued to occur, albeit less frequently, so we were puzzled. Dead TOAST tuples contributed to the slowdowns but they weren't the primary cause. Time to do some more digging.
EBS Throughput
Unpacking some Postgres internal processes like vacuuming didn't uncover any obvious problems so we continued to sift through RDS metrics and documentation. Reading more about IOPS indicated that we should see consumption come close to 100% of our limit. But even after our IOPS increased from 20k to 30k, database IOPSs never exceeded 19k. It hovered around 15k during periods of slowness, with periodic spikes that approached 19k. We weren’t getting close to 30k.
In tandem with the IOPS oddity was the RDS’s disk queue depth metric. Articles online suggested that disk queue depth should generally only rise above 0 when an instance exceeds its IOPS limit but we saw a queue depth of 50 to 100 during periods of slowness while at 50-70% of the IOPS limit.
We honed in on this invisible ceiling of 19k IOPS and looked for an explanation online. We found our answer in an AWS doc about EBS-optimized instances, which our RDS instance class used for storage. We were running a r5.2xlarge RDS instance which had an IOPS limit of 18,750. This limit explained the 19k IOPS ceiling that our database briefly faced but it did not explain why our database consistently operated well below that limit during periods of slowness, at 15k IOPS.
This doc also indicated that our instance could “burst” to a throughput of 593.75 MB/s for at least 30 minutes (and possibly more as the zoomed out chart below suggests) every 24 hours but had a much lower guaranteed throughput maximum of 287.5 MB/s. Bandwidth, throughput and IOPS limits were listed in the EBS limits doc but we were only familiar with IOPS. While not sure exactly what the others meant, we turned to Cloudwatch.
r5.2xlarge resource maximums
Metric |
Burst |
Baseline |
Bandwidth (Mbps) |
4,750 |
2,300 |
Throughput (Mbps) |
593.75 |
287.5 |
IOPS |
18,750 |
12,000 |
Oddly enough, Cloudwatch does not report these metrics directly. Instead they’re broken down into read and write components. Looking at read and write throughput individually didn’t suggest any pattern. However, stacking the metrics showed we were hitting a clear ceiling:
Our read + write throughput was pegged at 213MB. Zooming out to the last 24 hours showed more patterns that suggested a ceiling:
Throughput spiked and hovered around 443MB overnight and then became pegged at the 213MB limit for the remainder of the day. Our baseline maximum of 213MB does not completely line up with the table above because our instance was launched prior to December 3, 2019. AWS instructs starting and stopping older instances to realize the current maximums.
We reached out to AWS and confirmed that our interpretation of the metrics was correct. Our RDS instance was hitting its EBS throughput limit. We scheduled an upgrade to the r5.4xlarge instance size for that evening since it would double our maximum baseline throughput.
The day after the upgrade, we didn’t experience any slowdowns! We’ve been running this instance size for almost 6 months now and are beginning to see occasional throughput spikes that hit the limit. We’ve started exploring changes to reduce the amount of data moving around, but for now we know we can grow the database to buy ourselves some headroom.
DB performance
Metric |
Before |
After |
Change |
Avg. Read Latency |
5ms |
1ms |
5x faster |
Avg. Write Latency |
100ms |
15ms |
6.67x faster |
Avg. Throughput |
213 MB |
325 MB |
66% more |
Conclusion
Consuming the maximum amount of IOPS, throughput or bandwidth will likely prevent you from maximizing your consumption of other resources. Hitting our throughput maximum prevented us from hitting our IOPS limit. Since we only monitored memory, CPU and IOPS, we did not suspect our slowdowns were caused by resource allocation.
IOPS, throughput and bandwidth should be foundational metrics of RDS monitoring, similar to CPU and memory. They’re hard limits imposed by RDS (via its use of EBS) and can rear their heads at unexpected times.
More generally, you can learn a lot by digging into performance issues with tooling. Although the issue didn’t end up being an implication of some internal Postgres design decision, we developed a much stronger understanding of Vacuuming, MVCC and TOAST storage. Using these concepts effectively is crucial to successfully scaling Postgres.