Moving to PostgreSQL on Amazon Aurora simplifies all the replication issues list...

icheishvili · on April 4, 2020

We were quite unhappy with Aurora because of the terrible performance of writes against tables with secondary indices. What does your workload look like?

phamilton · on April 4, 2020

We trimmed out most of our secondary indexes long before we moved to Aurora because performance of postgres writes with secondary indexes is always a bit rough. The few we have remaining do well enough.

Overall our workload is very read heavy. At peak, if we compare our CPU on the writer vs the aggregate CPU on the readers, we do about 10x more read work than writes.

Remind is an education messaging program, so our workload is partially user management (which users belong to which schools and which classes) and partially user generated content (messages being sent). Our user generated content (more like 2-3x read vs write) is all backed by DynamoDB and our user management is in a couple of Aurora database clusters.

MuffinFlavored · on April 4, 2020

How much do you pay a month for what size?

phamilton · on April 4, 2020

Cost is complicated because we autoscale from 2-3 readers up 10+ readers during peak traffic. Instead of running 10 replicas 24/7, we can spin one up and within about 10 minutes it's handling reads. So we have a few instances that only run for about 3 hours a day and others that run maybe 6 hours. That is a big cost savings over running them all 24 hours a day. We couldn't bring up new replicas like that when we were on normal Postgres RDS.

Pricing is also surprising compared to vanilla Postgres RDS because reader nodes double as spare writers. A multi-az deployment of Postgres RDS plus two single-az replicas is more expensive than a 3 node (1 writer and two readers) Aurora cluster. E.g. on 2xlarge instances, this Aurora setup is $3.48/hour vs $4/hr on RDS for similar effective hardware and fault tolerance.

Running directly on EC2 is going to be much cheaper obviously. $1.51/hr for three 2xlarge instances(if you want to failover to an active replica) or $2.01/hr for 4 if you want a dedicated failover instance (like RDS does).

agacera · on April 4, 2020

Aurora is indeed amazing. However, my experience with it regarding costs is that main cost you have running it is not compute, but IO ($0.2 per million requets [1]).

This cost is kind of hidden since to estimate this in the early stages of a project is an art. In one project on my team the IO cost is about 8x more than cost of instances. But imo it is still worth and I never actually calculated how much we would pay if we were running on RDS + provisioned IOPS.

phamilton · on April 4, 2020

It is really hard to estimate because it isn't quite apples to apples. With Aurora we pay for what we use, not what we provision. We used to have some big write spikes for about 5 minutes each hour and so had to massively over-provision IOPS. With Aurora we just pay for the IOPs we use on those spikes.

Right now, on a multiple of the traffic we had before we moved to Aurora we are paying less than half what we used to for IO.

nopurpose · on April 4, 2020

How much knoweldge is transferable? Isn't Aurora just protocol and SQL dialect compatible, but underneath it has nothing to do with postgres?

phamilton · on April 4, 2020

Underneath it is postgres, with a few key things rewritten.

It has a custom storage layer that isn't too different from a very fancy SAN. Replication is where things get to be very different. All instances use the same underlying store, so replication of storage isn't part of the postgres layer. However, reader nodes need to invalidate cache when writes occur. So they use postgres replication, but the readers skip writing to storage.

Lots of rewritten components under the hood to support this different storage paradigm, but the engine itself is still postgres.

perfectspiral · on April 4, 2020

Each flavor (and version) of Aurora is compatible with a corresponding version of the open source software. For example Aurora MySQL 1.* is compatible with MySQL 5.6

At my current gig we use it in Prod, but we are also able to run our software during development pointing to locally installed open-source versions of MySQL just fine. I imagine it's the same for Postgres.

setr · on April 4, 2020

I believe he was asking whether an understanding of "under-the-hood" of postgres transfers to Aurora -- that is, does your tuning knowledge transfer as well, or is it just you can migrate your codebase transparently

Though now that I think about it, I think Aurora gives relatively little in tuning accees, so it's more of whether the hueristics transfer (eg the ol' avoid all joins, which I've always been suspicious of, but still don't know if it's a useful saying)

phamilton · on April 4, 2020

There are a number of gotchas around tuning though. Buffer cache, for example, was fully redesigned for Aurora postgres. Aurora defaults are pretty good, but in our case we had some tuning of cache settings in place before moving over and the result was terrible. Wiping out our tuning and just using the defaults was a good fix. Also, pgbouncer didn't play very well with the aurora reader endpoints and we had to mess around with that.

But from the application perspective, it all runs pretty seamlessly. We've never had a behavioral difference between development against postgres and production with Aurora. Perf has really been the only difference, and perf in development never represents perf in prod at large scale anyway.

setr · on April 4, 2020

I believe he was asking whether an understanding of "under-the-hood" of postgres transfers to Aurora -- that is, does your tuning knowledge transfer as well, or is it just you can migrate your codebase transparently

kqr · on April 4, 2020

What are latencies like compared to a pg server on physical servers?

phamilton · on April 4, 2020

Query latency is about the same, especially if you are serving from cache. If you have to hit the storage layer it probably slows down but we have a 99%+ buffer cache hit rate so we don't see it that often.

Replication lag is pretty steady at 15ms.