I run a 100 billion+ rows Postgres database [0], that is around 16TB, it's pretty painless!
There are a few tricks that make it run well (PostgreSQL compiled with a non-standard block size, ZFS, careful VACUUM planning). But nothing too out of the ordinary.
ATM, I insert about 150,000 rows a second, run 40,000 transactions a second, and read 4 million rows a second.
People run postgres at scale which is much "larger" than what you are running, which isn't to say that your workload isn't substantial, it is. But there are folks who push it harder, both in terms of writes/rows and workload complexity. It's one thing to write a ton of rows into a single table, it's another thing entirely to write into many tables, index all of them and then query them at scale; you didn't mention much about your workload complexity so I'm not trying to suggest it isn't complex, but there are certainly plenty of folks on this forum who can attest to it being tricky to scale high write workloads, in large organizations with many stakeholders and complex query loads.
One of things I find challenging is understand the meaning of the word "scales". It is sometimes used differently in different contexts.
Can it be performant in high load situations? Certainly. Can is elastically scale up and down based on demand? As far as I'm aware it cannot.
What I'm most interested in is how operations are handled. For example, if it's deployed in a cloud environment and you need more CPU and/or memory, you have to eat the downtime to scale it up. What if it's deployed to bare metal and it cannot handle the increasing load anymore? How costly (in terms of both time and money) is it to migrate it to bigger hardware?
When it "scales", it usually means "scales up". A scalable solution is such that can withstand a large and increasing load, past the usual limitations of obvious solutions.
Being elastic is nice, but not always needed. In most cases of database usage, downsizing never happens, or expected to happen: logically, data are only added, and any packaging and archiving only exists to keep the size manageable.
You’re conflating things. The question was about scaling compute and memory up and down based on load and you’re commenting about never needing to downsize on storage.
a database scaling dramatically up and down /under load/ and expecting it to perform the same as steady state seems a bit weird, vs a single, beefy database with a beefy ingest job and a bunch of read only clients searching it?
like you're more likely to encounter two phases (building the DB in heavy growth mode, and using the DB in light growth heavy read mode).
A business that doesn't quite yet know what size the DB needs to be has a frightening RDS bill incoming.
And here I am, having an ERP software at hand whose database I "should not query" with some (3-6) joins and likely in the 100'000s of records in total in the larger tables because "the DB server will go down".
Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?
> Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?
It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.
6 NVMe drives in raidz-1, 1024GB of memory, a 96 core AMD EPYC cpu.
A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).
> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).
That's kind of where I'm at now... you can vertically scale a server so much now (compared to even a decade ago) that there's really no need to bring a lot of complexity in IMO for Databases. Simple read replicas or hot spare should be sufficient for the vast majority of use cases and the hardware is way cheaper than a few years ago, relatively speaking.
I spent a large part of the past decade and a half using and understanding all the no-sql options (including sharding with pg) and where they're better or not. At this point my advice is start with PG, grow that DB as far as real hardware will let you... if you grow to the point you need more, then you have the money to deal with your use case properly.
So few applications have the need for beyond a few million simultaneous users, and avoiding certain pitfalls, it's not that hard. Especially if you're flexible enough to leverage JSONB and a bit of denormalization for fewer joins, you'll go a very, very long way.
Just last week for some small application and checking the performance of some queries I add to get random data on a dev setup. Which is a dockerized postgres (with no tuning at all) in a VM on a basic windows laptop. I inserted enough data to represent what could maybe be there in 20 years (like some tables got half a billion rows, small internal app). Still no problem chugging along.
It is crazy when you compare what you can do with databases now on modern hardware with how other software do not feel as having benefited as much. Especially on the frontend side.
Considering the front end today is an amazingly flexible client app platform with flexible rendering styling and accessibility compared to a VB app a few decades ago... It's kind of amazing.
Only if my favorite websites in the late 90s was 15 seconds make because that's how long people would wait for a webpage to load at the time. Things have improved dramatically.
I'm gonna doubt you when on they own website navigating is really bad.
Only tab and shift-tab. Arrow keys are a bust. And the only visible shortcut is ctrl-K for the search input and I think it's because it comes as an algolia default.
For something better I only have to watch around the page at the browser itself: underlined letters in the menu tells me what alt+letter will open said menu. Then I can navigate using arrow keys and most menu items are shown with a key combination shortcut.
If I could show you some of the apps I've built with it would probably change your mind. A few had to go through testing and validation for accessibility. That and I'm pretty firm on keyboard navigation for all things. Had to tweak the corporate colors a little bit to fit WCAG compliance on the contrasts.
One thing that was crazy was having to go through verification for blind usability, when the core function (validating scanned documents) requires a well sighted user.
I won't say MUI is perfect... it isn't... but you can definitely go a lot farther in a browser than you can with what's in the box with most ui component libraries is the only real point.
> It’s self-hosted on bare metal, with standby replication, normal settings, nothing “weird” there.
16TB without nothing weird is pretty impressive. Our devops team reached for Aurora way before that.
> 6 NVMe drives in raidz-1, 1024GB of memory, a 96-core AMD EPYC CPU.
Since you’re self hosted, I’m you aren’t on AWS. How much is this setup costing you now if you don’t mind sharing.
> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IMHO does not provide many benefits outside of niche use cases).
So that's about 467 eur per month per server assuming a 5 year term. Anyone know what it would be on AWS with Aurora? I had a quick go with https://calculator.aws/ and ended up with a 5-figure sum per month.
The CPU itself is around $8-10k for a top-end AMD Epyc, $15-20k for the rest of the server, including memory and storage is probably about right. There are still $100k+ servers, but they tend to be AI equipment at this point, not the general purpose stuff, which is sub $30-50k now.
I mean no disrespect, but it is stunning how hard the idea of owning your own hardware is to a large percentage of the tech population.
You can just… own servers. I have five in a rack in my house. I could pay a colo a relatively small fee per month for a higher guarantee of power and connectivity. This idea also scales.
> 16TB without nothing weird is pretty impressive. Our devops team reached for Aurora way before that.
Probably depends on the usage patterns too. Our developers commit atrocities in their 'microservices' (which are not micro, or services, but that's another discussion).
Not OP, but managed a ~1tb Postgres install for years. You should use something like pgbackrest or barman to help with both replication (replicas can pull WAL from your backups when catching up), backups, and failovers.
At least for pgbackrest, set up a spool directory which allows async wal push / fetch.
> It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.
I can build scalable data storage without a flexible scalable redundant resilient fault-tolerant available distributed containerized serverless microservice cloud-native managed k8-orchestrated virtualized load balanced auto-scaled multi-region pubsub event-based stateless quantum-ready vectorized private cloud center? I won't believe it.
"Postgres does not scale" means that you can't just give it more machines, which is true. At some point you've got the fastest available machine, maybe you've played with the settings, and that's it. Then you can embark on the kind of manual sharding journey the article describes.
But most of the time, an RDBMS is the right tool for the job anyway, you just have to deal with it.
Regarding MerkleMap: Ouch! I never realized letsencrypt is leaking all of the sub domains i'm creating. :-/ I thought the names were giving me a bit of extra security through obscurity.
Why does it do that? I thought only revocations need to be published?
It does that because it's required to. The system is called Certificate Transparency[1] and browsers require certificates to be in CT logs to be accepted.
If you want to hide what subdomains you have you can use a wildcard certificate, though it can be a bit harder to set up.
You have decent documentation portal https://www.merklemap.com/documentation with related articles and API examples, great work! By the way, did you use any markdown2html converter here to built developer portal or this is just a simplified Redocly version of it?
> I used to run a bunch of Postgres nodes at a similar scale. The most painful parts (by far) were restoring to a new node and major version upgrades.
Any tricks you used for those parts?
Super interesting compiling pg, I assume, with same as the zfs block size! It was always on our todo to try, but never got around to it. If possible, what block size did you end up with? Have you tried zfs direct io in 2.3.x, if so, could you share any findings? Thanks for sharing - and cool website!
I don’t think Postgres will be able to benefit from direct io? I might be wrong though!
I use Postgres with 32K BLKSZ.
I am actually using default 128K zfs recordsize, in a mixed workload, I found overall performance nicer than matching at 32K, and compression is way better.
Curious. At that scale and transaction rate, are you deleting / offloading rows to another storage solution after some amount of time? I’m assuming you’re not just letting 150,000 rows a second accumulate indefinitely.
That's amazing - I would love to know if you have done careful data modeling, indexing, etc that allows you to get to this and what kind of data is being insert ed?
I am not optimizing too much around insertion speed. I avoid GIN, GIST and hash indexes.
The schema is nicely normalized.
I had troubles with hash indexes requiring hundreds of gigabytes of memory to rebuild.
Postgres B-Trees are painless and (very) fast.
Eg. querying one table by id (redacted):
EXPLAIN ANALYZE SELECT * FROM table_name WHERE id = [ID_VALUE];
Index Scan using table_name_pkey on table_name (cost=0.71..2.93 rows=1 width=32) (actual time=0.042..0.042 rows=0 loops=1)
Index Cond: (id = '[ID_VALUE]'::bigint)
Planning Time: 0.056 ms
Execution Time: 0.052 ms
Have you measured _transactions_ per second? For both “insert only” and “mixed insert/update/delete?” This might be a more telling metric in terms of the stress it puts on the system. There’s a big difference between 1 transaction with 150k inserts, and 150k transactions with 1 insert each – especially when the workload shifts to include updates in the middle of those.
If you’re doing insert only, you might benefit from copying directly to disk rather than going through the typical query interface.
150k rows/second is pretty ordinary if you do batching, especially for a machine of that size. I was able to get 250k+ row inserts on a 16vCPU on a table with 16 columns, a third of which are strings >30 bytes. Pretty sure you can push that much higher with COPY.
Mostly, be careful with long-running transactions (hours long), and modify your autovacuum settings to be as aggressive as possible based on your workload. Be careful with the freezing threshold too.
It’s too slow at that scale, pg_squeeze works wonders though.
I only “need” that because one of my table requires batch deletion, and I want to reclaim the space. I need to refactor that part.
Otherwise nothing like that would be required.
Availability is trickier than scalability. An async replica can lose a few recent writes during a failover, and a synchronous replica is safer but slower. A company using some platform might not even know which one they're using until it bites them.
99.9% of companies also aren't going to feel the performance difference of synchronous replication.
That being said, the setups I typically see don't even go that far. Most companies don't mitigate for the database going down in the first place. If the db goes down they just eat the downtime and fix it.
Skype open-sourced their architecture way back, using PL/Proxy to route calls based on shard. It works, is quite elegant, handled 50% of all international phone calls in the noughties. My old company used it to provide real-time analytics on about 300M mobile devices.
If the database team designed a thoughtful API with stored procedures, this can actually be a quite nice way to interact with a database for specific uses.
Being 100% hard and fast on that rule seems like a bad idea though.
I think it's a perfectly fine approach in 2025 now that that CI/CD have proliferated and you're less likely to run into a human DBA arbitrarily blocking deployments. It was always the feudal lord mentality of DB operations that made relying so much on stored procedures undesirable.
You already can [0]. C, PL/Perl, PL/Python, and PL/Tcl exist out of the box, in addition to PL/pgSQL, which I assume you were implying isn’t a “real programming language.”
Frankly, I think devs need to stop being such primmadonnas about DX. Not everything has a slick UI that does most of the work for you. Take Linux, for example: you have to email your PR as a patch, inline as plaintext to a mailing list. Yet somehow, they manage, and it had an incredibly active community.
It is indeed, but I’ll be the first to admit it’s a bit clunky to use beyond a few variables and actions.
Sorry for snapping. I’m exhausted with devs complaining that some older and well-established piece of tech (Postgres, HAProxy, nginx to name a few) isn’t easy enough to use, and then using something demonstrably worse, or writing their own terrible version of it. Work trauma.
One of the benefits of stored procedures they don't mention is SECURITY DEFINER, which is like setuid.
You can for instance have a user table with login and hashed password, have a stored procedure that can verify login and password, without giving SELECT access to the user table to the database user your application use.
Stored procedures also block SQL injection attacks.
I would much rather this than letting devs write their own queries, ORM or not. But then, I own the DBs, and I am constantly dealing with the result of poor schema and query decisions.
Probably as useful is the overview of what pgdog is and the docs. From their docs[1]: "PgDog is a sharder, connection pooler and load balancer for PostgreSQL. Written in Rust, PgDog is fast, reliable and scales databases horizontally without requiring changes to application code."
Another option is going full-scale with CockroachDB. We had a Django application backed by PostgreSQL, which we migrated to CockroachDB using their official backend.
The data migration was a pain, but it was still less painful than manually sharding the data or dealing with 3rd party extensions. Since then, we’ve had a few hiccups with autogenerated migration scripts, but overall, the experience has been quite seamless. We weren’t using any advanced PostgreSQL features, so CockroachDB has worked well.
Unless their pricing has changed, it’s quite exorbitant when you need a lot of data. To the point that one year of cockroachdb would cost 5x the cost of the server it was running on.
This is still true. I wouldn’t use Cockroach if it were my own business. Also, they don’t offer any free version to try out the product. All you get is a short trial period and that’s it.
> Also, they don’t offer any free version to try out the product.
The site makes it seems as if I can install CockroachDB on Mac, Linux, or Windows and try it out for as long as I like. https://www.cockroachlabs.com/docs/v25.1/install-cockroachdb... Additionally, they claim CockroachDB Cloud is free for use "up to 10 GiB of storage and 50M RUs per organization per month".
Licenses are also yearly renewed. Its not like you get one license and can use it forever. No, you need to yearly enter the new license in your instance, if you do not, after 2 weeks it goes into cripple mode (as in, you can not run anything that is not a personal blog).
And those free-licenses have this dirty little clause that you are not entitled to a license, they need to APPROVE a free-license. Now that is even more scary.
They pull all this because people kept using the free-core version and people simply never upgrade/wanted more. That is why all these changed happened. Coincidentally, the buzz around CRDB has died down to the point that most talks about CRDB are these rare mentions here (even reddit is as good as dead). 98% of CRDB mentioning how great it is, is all origination from CRDBLabs. Do a google and limit in time range, and then go page by page, ... They are a enterprise only company at this point.
> Additionally, they claim CockroachDB Cloud is free for use "up to 10 GiB of storage and 50M RUs per organization per month".
Take in account, that the constant CPU/Mem/Query monitoring that CRDB does, eats up around 20 a 30M RUs per month. There are some people that complained as to why there free instances lost so much capacity. And those RU are not 1:1, like, you do a insert, its 1RU, oooo, no ... Its like 9 to 12RU or something.
Its very easy to eat all those RUs on a simply website. Let alone something that needs to scale. Trust me, your better of self deploying but then you enjoy the issue of the new licenses / forced telemetric / forced phone home, or spend 125$+ / vcpu (good luck finding out the price, we only know these numbers from people breaking nda offers). They are very aggressive in sales.
Its not worth it to tie your company to a product, that can chance licenses on a whim, that charges Oracle prices (and uses the same tactics). I am very sure that some of their sales staff is ex-Oracle employees. ;)
I also wonder what limitations you get on queries. Cause in Spanner, there's something about cross-directory queries, or Citus can't do cross-shard FKs iirc. There's no magic way to shard a relational DB, you have to deal with it one way or another.
Cockroach really doesn’t have limitations when it comes to queries. It is kinda magic in that regard. The single partition queries are the special case and are treated as rare. Cockroach ranges are quite small compared to other systems (~256MiB).
> Just beware that CockroachDB is not a drop-in replacement for PostgreSQL.
O, its way worse then that... A ton of small functionality tend to be missing or work differently. Sometimes even small stuff like column[1:2] does not even exist in CRDB and other times its things like ROW LEVEL SECURITY ... you know, something you may want on a mass distributed database then will be use for tenant setups (i hear they are finally going to implement it this year).
The main issue is the performance... You are spending close to ~4x the resources on a similar performing setup. And this is a 2x Pgres (sync) vs 3x CRDB (where every node is just a replica). Its not the replication itself but the massive overhead on the raft protocol + the way less optimized query planner.
To match Pgres, your required to deploy around 10 a 12 nodes, on hardware that needs about twice the resources. That is the point where CRDB performance on a similar level.
The issue is, well, you increase the resources to your Postgresql server and the gap is back. You can scale Postgresql to insane levels with 64, 96 CPUs ... Or imagine, you know, just planning your app in advance in such a way, that you spread your load over multiple Postgresql instances. Rocket science folks ;)
CRDB is really fun to watch, the build in GUI, the replication being very visual, but its a resource hog. The storage system (peble) eats a ton of resources to compact the data, when you can simply solve that with a PostgreSQL instance with zfs (with ironically, often better compression).
I do not joke when i say, that seeing those step cpu spikes in the night hours of the compacter working, is painful. Even a basic empty CRDB instance, just logging its own usage, runs between 7 to 50% on a quad ARM N1, constantly.
PostgreSQL? You do not even know its running. Barely any CPU usage, what memory usage?
And we have not talked license/payment issues ... Of "free enterprise" version with FORCED telemetric on, AND your not allowed to hide the server (aka, if it can not call home, it goes into restricted mode in 7 days, with like 50 queries / second ... aka, the same as just shutting down your DB ). By the way, some people reported that its 125+ dollar/vcore payment. Given that even the most basic CRDB gets you 3x instances, with minimum 4 cores ... Do the math. Yea, after 10M income but they chance their licenses every few years, so who knows next year or the year after that.
Interesting product, shitty sales company behind it. I am more interested to see when the postgresql storage extension orioledb comes out, so it solve the main issue that prevents postgresql scaling even more, namely the write/vacuum issue. And ofcourse a better solution to upgrade postgresql versions, there CRDB is leaps and bound better.
I'm glad you brought up the migration, because one of the main goals behind our project is to automate migrating to a sharded deployment. You can think of your DB as the base case (num_shards = 1), and PgDog as the recursive solution.
Automatic data transfer would be super cool to get out of the box. We had a custom multi-tenancy solution for our application that heavily used PostgreSQL schemas. One schema for each customer.
It was a pain to get that work with Cockroach since it doesn’t optimize cross schema queries and suggests one DB per customer. This was a deal breaker for us and we had to duplicate data to avoid cross schema queries.
Being able to live within Postgres has its advantages.
Another (battle tested * ) solution is to deploy the (open source) Postgres distribution created by Citus (subsidiary of Microsoft) on nodes running on Ubuntu, Debian or Red Hat and you are pretty much done: https://www.citusdata.com/product/community
It's not. We tried. Plus, it doesn't work on RDS, where most of production databases are. I think Citus was a great first step in the right direction, but it's time to scale the 99% of databases that don't run on Azure Citus already.
That's because Amazon wants to do whatever they like themselves... you apparently can get stuff to work by running your own masters (w/ citus extension) in EC2 backed by workers (Postgres RDS) in RDS:
Depends on your schema, really. The hard part is choosing a distribution key to use for sharding- if you've got something like tenant ID that's in most of your queries and big tables, it's pretty easy, but can be a pain otherwise.
For a multi-tenant use case, yeah, pretty close to thinking about partitioning.
For other use cases, there can be big gains from cross-shard queries that you can't really match with partitioning, but that's super use case dependent and not a guaranteed result.
I’m working with several Postgres databases that share identical schemas, and I want to make their data accessible from a single interface.
Currently, I’m using Postgres FDWs to import the tables from those databases. I then create views that UNION ALL the relevant tables, adding a column to indicate the source database for each row.
This works, but I’m wondering if there’s a better way — ideally something that can query multiple databases in parallel and merge the results with a source database column included.
Would tools like pgdog, pgcat, pganimal be a good fit for this? I’m open to suggestions for more efficient approaches.
I'm kind of interested in why we can't make a better database with all of our modern technology.
Postgres is a fantastic workhorse, but it was also released in the late 80s. Who, who among you will create the database of the future... And not lock it behind bizarro licenses which force me to use telemetry.
There are "better" databases but they're better given some particular definition that may not be relevant to your needs. If SQL/the relational model and ACID semantics is what you need then postgres is simply the best in class. The fact it dates back to the 80s is probably an advantage (requirement?) when it comes to solving a problem really well
I guess I’d ask why is something having been first released in the late 80s, or any decade, as positive or negative? Some things are still used because they solve the problems people have. Some things are still used just because of industry capture. I’m not honestly sure where I’d put Postgres.
Are there specific things you’d want from a modern database?
Relating to the article, better scaling. Saying run it on a bigger box is a very brute force way to optimize an application.
While they come up with some other tricks here, that's ultimately what's scaling postgres means.
If I imagine a better database, it would have native support for scaling, a postgres compatible data layer as well as first party support for NoSQL( JSONB columns don't cut it since if you have simultaneous writes unpredictable behavior tends to occur).
What does permissible license mean? If you mean open source, no such database exists AFAIK. If you mean you can run it locally for free for dev purposes, on prem without telemetry etc, then Oracle is clearly the best option.
Compared to Postgres, Oracle DB:
• Scales horizontally with full SQL and transactional consistency. That means both write and read masters, not replicas - you can use database nodes with storage smaller than your database, or with no storage, and they are fully ACID.
• Has full transactional MQ support, along with many other features.
• Can scale elastically.
• Doesn't require vacuuming or have problems with XID wraparound. These are all Postgresisms that don't affect Oracle due to its better MVCC engine design.
• Has first party support for NoSQL that resolves your concern (see SODA and JSON duality views).
I should note that I have a COI because I work part time at Oracle Labs (and this post is my own yadda yadda), but you're asking why does no such database exist and whether anyone can make one. The database you're asking for not only exists but is one of the world's most popular databases. It's also actually quite cheap thanks to the elastic scaling. Spec out a managed Oracle DB in Oracle's cloud using the default elastic scaling option and you'll find it's cheaper than an Amazon Postgres RDS for similar max specs!
Oracle Cloud (called OCI) has an always-free offering, you don't need credits. You can just sign up and use some small quantity of resources for nothing indefinitely. That includes a managed Oracle database:
There's also some sort of startup credits program with a brochure here, apparently you can just fill out a form and get some credits with an option to apply for more. But I don't know much about that. I've used the always-free programme for some personal stuff and it worked fine so I never needed to think about credits.
I have to admit I'm not really familiar with Firebase, I thought that was some managed service for mobile apps, but Oracle DB comes with some stuff that sounds similar. And Oracle Cloud is an AWS-style cloud, it has a ton of high level services for things.
ORDS is a REST binding layer that lets you export tables, views, stored procedures and NoSQL JSON document stores over HTTP without writing a middleman server yourself. You can drive it directly from the browser. ORDS supports OAuth2 or can be integrated with custom auth schemes from what I understand. I've not used ORDS myself yet but probably will in the near future.
Firebase IIRC when it first launched was known for push streaming of changes. Oracle DB lets you subscribe to the results of SQL queries and get push notifications when they change, either directly via driver callbacks or into a message queue for async processing later. It's pretty easy to hook such notifications up to web sockets or SSE or similar, in fact I've done that in my current project.
There's also a thing called APEX which is a bundled visual low-code app builder. I've never used it but I've used apps built with it, and it must be quite flexible as they all had a lot of features and looked very different. You can tell you're using an APEX app because they have a lot of colons in the URLs for some reason. Here's a random example of one from outside of Oracle that exports a database of dubious scientific research papers:
I'm not holding it up as a great example, there are probably better examples out there, it's just a one that came to mind that's public and I used before.
Firebase, is basically a back-end server in a box. Authentication, check, file storage , check. Databases, check. Firebase functions which allow for complex logic written in JavaScript, check. Hosting, check. Supabase comes really close, but has been much more difficult in actual use.
I'm working on a fully open source game right now, and I can't ethically tell people to hook into a closed source service. But I must admit, having to use supabase instead of firebase has made this much harder than it needs to be.
Hm right well you can certainly write apps that run fully inside the database and use its services, so with ORDS doing OAuth2 I guess that gets close to something like Firebase. You can run logic written in JS easily too, it has an embedding of graaljs which is a fast JS engine that's well integrated with the rest of the database environment.
Can you please expand on the JSONB unpredictable behavior? We are about to embark on a journey to move some of our data from MongoDB to postgres (with some JSONB). While we don't have significant concurrent writes to a table, would be very helpful to understand the issues
From what I can tell, unlike mongo, some postgres queries will try to update the entire JSONB data object vs a single field. This can lead to race conditions.
Ah, thanks.
The first link seems something specific to Shrine. The bottomline is concurrent updates to different parts of JSONB need row level locking for correct behavior in Postgresql. This is not an important issue for us. Thank you for the pointers
The problem with postgres scaling is that you have to have a single master which means horizontal scaling really only gives you more reads and a failover. Eventually you wont be able to find a server big enough to handle all the writes, and if you get enough reads with even a small number of writes single master setups fall over. Distributed computing gets complicated very quickly but the gist here is basically that you need to be able to have multiple instances that can accept writes. Lots of literature on this but good starting points imo would be the paxos paper https://lamport.azurewebsites.net/pubs/time-clocks.pdf and dynamo db paper https://www.allthingsdistributed.com/files/amazon-dynamo-sos...
Postgres is the most worked on database in the world right now. Its original release date doesnt mean work stopped. No-sql was a new thing a decade-ish ago but most companies probably dont need it. New data platforms focused on scaling like snowflake and cockroach have come too but again for most use cases postgres is better.
Came here to say this. Cockroach solves the sharding issue by adopting consistent hashing-based data distribution, as described in the Dynamo paper. However, their cloud solution is a bit expensive to get started with.
Something I don't see in the pgdog documentation is how cross-shard joins work. Okay, if I do a simple `select * from users order by id`, you'll in-memory order the combined results for me. But if I have group by and aggregations and such? Will it resolve that correctly?
Heh, no chance I can introduce this at work and hard to have a personal project requiring it. :)
I think you probably need some documentation to the effect of the current state of affairs, as well as prescriptions as to how to work around it. _Most_ live workloads, even if the total dataset is huge, have a pretty small working set. So limiting DB operations to simple fetches and doing any complex operations in memory is viable, but should be prescribed as the solution or people will consider it's omission as a fault instead of a choice.
I had it in the back of my mind for a while, nice to have it in code. Works pretty well, as long as columns in GROUP BY are present in the result set. Otherwise, we would need to rewrite the query to include them, and remove them once we're done.
No worries. It's early days, the code and docs are just a few months in the making. I'm happy to keep you updated on the progress. If you want, send your contact info to hi@pgdog.dev.
People talk about scale frequently as a single dimension (and usually volume as it relates to users) but that can be oversimplifying for many kinds of applications. For instance, as you are thinking about non-trivial partitioning schemes (like if there is high coupling between entities of the same kind - as you see in graphs) is when you should consider alternatives like the Bigtable-inspired DBs, since those are (relatively) more batteries included for you.
> It’s funny to write this. The Internet contains at least 1 (or maybe 2) meaty blog posts about how this is done
It would’ve been great to link those here. I’m guessing one refers to StackOverflow which has/had one of the more famous examples of scaled Postgres.
I run a postgresql db with a few billion rows at about 2TB right now. We don't need sharding yet but when we do I was considering Citus. Does anyone have experience implementing Citus that could comment?
It can be great, depending on your schema and planned growth. Questions I'd be asking in your shoes:
1. Does the schema have an obvious column to use for distribution? You'll probably want to fit one of the 2 following cases, but these aren't exclusive:
1a. A use case where most traffic is scoped to a subset of data. (e.g. a multitenant system). This is the easiest use case- just make sure most of your queries contain the column (most likely tenant ID or equivalent), and partially denormalize to have it in tables where it's implicit to make your life easier. Do not use a timestamp.
1b. A rollup/analytics based use case that needs heavy parallelism (e.g. a large IoT system where you want to do analytics across a fleet). For this, you're looking for a column that has high cardinality witout too many major hot spots- in the IoT use case mentioned, this would probably be a device ID or similar
2. Are you sure you're going to grow to the scale where you need Citus? Depending on workload, it's not too hard to have a 20TB single-server PG database, and that's more than enough for a lot of companies these days.
3. When do you want to migrate? Logical replication in should work these days (haven't tested myself), but the higher the update rate and larger the database, the more painful this gets. There's not a lot of tools that are very useful for the more difficult scenarios here, but the landscape has changed since I've last had to do this
4. Do you want to run this yourself? Azure does offer a managed service, and Crunchy offers Citus on any cloud, so you have options.
5. If you're running this yourself, how are you managing HA? pg_auto_failover has some Citus support, but can be a bit tricky to get started with.
I did get my Citus cluster over 1 PB at my previous job, and that's not the biggest out out there, so there's definitely room to scale, but the migration can be tricky.
Interesting technology. Similar to Citus but not built as an extension. The Citus coordinator, which is a Postgres database with the Citus extension, is replaced by a proxy layer written in Rust. That might provide more flexibility and velocity implementing distributed planning and execution than being tied to the extension ecosystem. It would indeed be a journey to catch up with Postgres on compatibility, but it's a good start.
Tangentially related: is there a good guide or setup scripts to run self hosted Postgres with backups and secondary standby? Like I just want something I can deploy to a VPS/dedicated box for all my side projects.
If not is supabase the most painless way to get started?
Funny how everyone eventually hits this point and thinks they're inventing fire - but then again, pushing your trusty old tools way past comfort is where cool engineering actually happens.
Seems like this is a similar philosophy, but is missing a bunch of things the Citus coordinator provides. From the article, I'm guessing Citus is better at cross-shard queries, SQL support, central management of workers, keeping schemas in sync, and keeping small join tables in sync across the fleet, and provides a single point of ingestion.
That being said, this does seem to handle replicas better than Citus ever really did, and most of the features it's lacking aren't relevant for the sort of multitenant use case this blog is describing, so it's not a bad tradeoff. This also avoids the coordinator as a central point of failure for both outages and connection count limitations, but we never really saw those be a problem often in practice.
We certainly have a way to go to support all cross-shard use cases, especially complex aggregates (like percentiles). In OLTP, where PgDog will focus on first, it's good to have a sharding key and a single shard in mind, 99% of the time. The 1% will be divided between easy things we already support, like sorting, and slightly more complex things like aggregates (avg, count, max, min, etc.), which are on the roadmap.
For everything else, and until we cover what's left, postgres_fdw can be a fallback. It actually works pretty well.
Citus works really well *if* you have your schema well defined and slightly denormalized (meaning you have the shard key materialized on every table), and you ensure you're always joining on that as part of querying. For a lot of existing applications that were not designed with this in mind if can be several months of database and application code changes to get things into shape to work with Citus.
If you're designing from scratch and make it worth with Citus then (specifically for a multi-tenant/SaaS sharded app) it can make scaling seem a bit magical.
Great question and we will publish something for this later on. TLDR: Citus doesn't provide an easy migration path for existing DBs and the managed deployments are pretty much exclusive to Azure. A large percentage of production DBs run elsewhere and don't have an easy way to shard.
Replace it with a function that makes sure the id coming out matches the sharding schema. Assuming it's coming from a sequence, we're consuming it until we get a matching number. It would be good to know what is behind the generated column in your use case.
We actually found Aurora to be about 3x slower than community PG for small queries. That was back then, maybe things are better now. Migrating to another database (and Aurora is Postgres-compatible, it's not Postgres) is very risky when you've been using yours for years and know where the edge cases are.
I’ve consistently found Aurora MySQL and PG to be slower than everything, including my 12 year old Dell R620s. You can’t beat data locality, and the 4/6 quorum requirement of Aurora combined with the physical distance kills any hope of speed.
Aurora is postgres but with a different storage layer, no? It uses the postgres engine, which other postgres-compatible databases like cockroach do not, right?
I run a 100 billion+ rows Postgres database [0], that is around 16TB, it's pretty painless!
There are a few tricks that make it run well (PostgreSQL compiled with a non-standard block size, ZFS, careful VACUUM planning). But nothing too out of the ordinary.
ATM, I insert about 150,000 rows a second, run 40,000 transactions a second, and read 4 million rows a second.
Isn't "Postgres does not scale" a strawman?
[0] https://www.merklemap.com/
People run postgres at scale which is much "larger" than what you are running, which isn't to say that your workload isn't substantial, it is. But there are folks who push it harder, both in terms of writes/rows and workload complexity. It's one thing to write a ton of rows into a single table, it's another thing entirely to write into many tables, index all of them and then query them at scale; you didn't mention much about your workload complexity so I'm not trying to suggest it isn't complex, but there are certainly plenty of folks on this forum who can attest to it being tricky to scale high write workloads, in large organizations with many stakeholders and complex query loads.
> it's another thing entirely to write into many tables, index all of them and then query them at scale
Well, that’s pretty much what I am doing.
> PostgreSQL compiled with a non-standard block size
Do you think this could become less important for your use case the new PG17 "I/O combining" stuff?
https://medium.com/@hnasr/combining-i-os-in-postgresql-17-39...
No, because the custom block size is about reducing zfs r/w amplification.
One of things I find challenging is understand the meaning of the word "scales". It is sometimes used differently in different contexts.
Can it be performant in high load situations? Certainly. Can is elastically scale up and down based on demand? As far as I'm aware it cannot.
What I'm most interested in is how operations are handled. For example, if it's deployed in a cloud environment and you need more CPU and/or memory, you have to eat the downtime to scale it up. What if it's deployed to bare metal and it cannot handle the increasing load anymore? How costly (in terms of both time and money) is it to migrate it to bigger hardware?
When it "scales", it usually means "scales up". A scalable solution is such that can withstand a large and increasing load, past the usual limitations of obvious solutions.
Being elastic is nice, but not always needed. In most cases of database usage, downsizing never happens, or expected to happen: logically, data are only added, and any packaging and archiving only exists to keep the size manageable.
You’re conflating things. The question was about scaling compute and memory up and down based on load and you’re commenting about never needing to downsize on storage.
a database scaling dramatically up and down /under load/ and expecting it to perform the same as steady state seems a bit weird, vs a single, beefy database with a beefy ingest job and a bunch of read only clients searching it?
like you're more likely to encounter two phases (building the DB in heavy growth mode, and using the DB in light growth heavy read mode).
A business that doesn't quite yet know what size the DB needs to be has a frightening RDS bill incoming.
And here I am, having an ERP software at hand whose database I "should not query" with some (3-6) joins and likely in the 100'000s of records in total in the larger tables because "the DB server will go down".
PeopleSoft?
Progress
Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?
By the way, really cool website.
I'll try to get a blog post out soon!
> Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?
It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.
6 NVMe drives in raidz-1, 1024GB of memory, a 96 core AMD EPYC cpu.
A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).
> By the way, really cool website.
Thank you!
> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).
That's kind of where I'm at now... you can vertically scale a server so much now (compared to even a decade ago) that there's really no need to bring a lot of complexity in IMO for Databases. Simple read replicas or hot spare should be sufficient for the vast majority of use cases and the hardware is way cheaper than a few years ago, relatively speaking.
I spent a large part of the past decade and a half using and understanding all the no-sql options (including sharding with pg) and where they're better or not. At this point my advice is start with PG, grow that DB as far as real hardware will let you... if you grow to the point you need more, then you have the money to deal with your use case properly.
So few applications have the need for beyond a few million simultaneous users, and avoiding certain pitfalls, it's not that hard. Especially if you're flexible enough to leverage JSONB and a bit of denormalization for fewer joins, you'll go a very, very long way.
> you can vertically scale a server so much now
And you often don't really need to.
Just last week for some small application and checking the performance of some queries I add to get random data on a dev setup. Which is a dockerized postgres (with no tuning at all) in a VM on a basic windows laptop. I inserted enough data to represent what could maybe be there in 20 years (like some tables got half a billion rows, small internal app). Still no problem chugging along.
It is crazy when you compare what you can do with databases now on modern hardware with how other software do not feel as having benefited as much. Especially on the frontend side.
Considering the front end today is an amazingly flexible client app platform with flexible rendering styling and accessibility compared to a VB app a few decades ago... It's kind of amazing.
Only if my favorite websites in the late 90s was 15 seconds make because that's how long people would wait for a webpage to load at the time. Things have improved dramatically.
> accessibility
I'd like to see those accessible frontends. The majority is not usable keyboard-only.
I have done a lot of work for eLearning, govt and banking which required good usability. Also, mui is very good out of the box.
https://mui.com/material-ui/all-components/
I'm gonna doubt you when on they own website navigating is really bad.
Only tab and shift-tab. Arrow keys are a bust. And the only visible shortcut is ctrl-K for the search input and I think it's because it comes as an algolia default.
For something better I only have to watch around the page at the browser itself: underlined letters in the menu tells me what alt+letter will open said menu. Then I can navigate using arrow keys and most menu items are shown with a key combination shortcut.
If I could show you some of the apps I've built with it would probably change your mind. A few had to go through testing and validation for accessibility. That and I'm pretty firm on keyboard navigation for all things. Had to tweak the corporate colors a little bit to fit WCAG compliance on the contrasts.
One thing that was crazy was having to go through verification for blind usability, when the core function (validating scanned documents) requires a well sighted user.
I won't say MUI is perfect... it isn't... but you can definitely go a lot farther in a browser than you can with what's in the box with most ui component libraries is the only real point.
> I'll try to get a blog post out soon!
Please do.
> It’s self-hosted on bare metal, with standby replication, normal settings, nothing “weird” there.
16TB without nothing weird is pretty impressive. Our devops team reached for Aurora way before that.
> 6 NVMe drives in raidz-1, 1024GB of memory, a 96-core AMD EPYC CPU.
Since you’re self hosted, I’m you aren’t on AWS. How much is this setup costing you now if you don’t mind sharing.
> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IMHO does not provide many benefits outside of niche use cases).
Beautiful!
> Since you’re self hosted, I’m you aren’t on AWS. How much is this setup costing you now if you don’t mind sharing.
About 28K euros of hardware per replica IIRC + colo costs.
Yearly, 28k Euros I presume.
Damn. I hope you make enough revenue to continue. This is pretty impressive.
No, one time + ongoing colocation costs.
So that's about 467 eur per month per server assuming a 5 year term. Anyone know what it would be on AWS with Aurora? I had a quick go with https://calculator.aws/ and ended up with a 5-figure sum per month.
I tried for fun:
https://calculator.aws/#/estimate?id=cfc9b9e8207961f777766e1...
Seems like it would be 160k USD a month.
I could not input my actual IO stats there, I was getting:
Baseline IO rate can't be more than 1000000000 per hour.
The CPU itself is around $8-10k for a top-end AMD Epyc, $15-20k for the rest of the server, including memory and storage is probably about right. There are still $100k+ servers, but they tend to be AI equipment at this point, not the general purpose stuff, which is sub $30-50k now.
I mean no disrespect, but it is stunning how hard the idea of owning your own hardware is to a large percentage of the tech population.
You can just… own servers. I have five in a rack in my house. I could pay a colo a relatively small fee per month for a higher guarantee of power and connectivity. This idea also scales.
> 16TB without nothing weird is pretty impressive. Our devops team reached for Aurora way before that.
Probably depends on the usage patterns too. Our developers commit atrocities in their 'microservices' (which are not micro, or services, but that's another discussion).
I continue to find horror shows during incidents; sometimes not even the cause, merely a tangential rabbit hole I wandered down.
“Are you… are you storing images in BLOBS?”
“Yes. Is that bad?”
Your replies are really valuable and informative. Thank you so much.
Question - what is your peak utilization % like? How close are you to saturating these boxes in terms of CPU etc?
I’d say 60-70% overall cpu usage, including database, ingest workers, web app and search.
> Your replies are really valuable and informative. Thank you so much.
Thank you!
> 6 NVMe drives in raidz-1
Did you benchmark io rate with different ZFS layouts?
6 NVMe drives in mirrored pairs would probably be substantially higher latency and throughput
Though you'd probably need more pairs of drives to match your current storage size. Or get higher capacity NVMe drives. :)
I'm also self-hosting Postgres, and the project is getting to the point where a standby would be a good idea to ensure higher availability.
Did you use any particular guide for setting up replication? Also, how do you handle failover/fallback to/from standby please?
Not OP, but managed a ~1tb Postgres install for years. You should use something like pgbackrest or barman to help with both replication (replicas can pull WAL from your backups when catching up), backups, and failovers.
At least for pgbackrest, set up a spool directory which allows async wal push / fetch.
> It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.
I can build scalable data storage without a flexible scalable redundant resilient fault-tolerant available distributed containerized serverless microservice cloud-native managed k8-orchestrated virtualized load balanced auto-scaled multi-region pubsub event-based stateless quantum-ready vectorized private cloud center? I won't believe it.
+1 as I'm hoping this is sarcastic humor.
from riches to RAG.
"Postgres does not scale" means that you can't just give it more machines, which is true. At some point you've got the fastest available machine, maybe you've played with the settings, and that's it. Then you can embark on the kind of manual sharding journey the article describes.
But most of the time, an RDBMS is the right tool for the job anyway, you just have to deal with it.
> "Postgres does not scale" means that you can't just give it more machines, which is true.
Well, it’s only true for writes.
It's still true for reads if you need them to be fully consistent, but yeah, read replicas are the answer sometimes.
Regarding MerkleMap: Ouch! I never realized letsencrypt is leaking all of the sub domains i'm creating. :-/ I thought the names were giving me a bit of extra security through obscurity.
Why does it do that? I thought only revocations need to be published?
It does that because it's required to. The system is called Certificate Transparency[1] and browsers require certificates to be in CT logs to be accepted.
If you want to hide what subdomains you have you can use a wildcard certificate, though it can be a bit harder to set up.
[1]: https://developer.mozilla.org/en-US/docs/Web/Security/Certif...
Thanks for explaining!
Having some issues with your numbers, but I’m probably just missing something…
If you insert 150K rows per second, that’s roughly 13 Billion rows per day.
So you’re inserting 10%+ of your database size every day?
That seems weird to me. Are you pruning somewhere? If not, is your database less than a month old? I’m confused.
Well, that's why I said "ATM", it's not a sustained rate, all the time. And yes, there's a bunch of DELETEs too.
> ATM, I insert about 150,000 rows a second, run 40,000 transactions a second, and read 4 million rows a second.
Silly question but is this at the same time, regular daily numbers or is that what you've benchmarked?
You have decent documentation portal https://www.merklemap.com/documentation with related articles and API examples, great work! By the way, did you use any markdown2html converter here to built developer portal or this is just a simplified Redocly version of it?
It’s mdx, with nextjs.
Thanks for the kind words!
I used to run a bunch of Postgres nodes at a similar scale. The most painful parts (by far) were restoring to a new node and major version upgrades.
Any tricks you used for those parts?
> were restoring to a new node
Zfs send / recv or replication.
> I used to run a bunch of Postgres nodes at a similar scale. The most painful parts (by far) were restoring to a new node and major version upgrades. Any tricks you used for those parts?
Replication makes this pretty painless :)
So run a replica? Is there more literature on that?
> So run a replica?
Basically, yes.
https://www.postgresql.org/docs/current/runtime-config-repli...
Super interesting compiling pg, I assume, with same as the zfs block size! It was always on our todo to try, but never got around to it. If possible, what block size did you end up with? Have you tried zfs direct io in 2.3.x, if so, could you share any findings? Thanks for sharing - and cool website!
I don’t think Postgres will be able to benefit from direct io? I might be wrong though!
I use Postgres with 32K BLKSZ.
I am actually using default 128K zfs recordsize, in a mixed workload, I found overall performance nicer than matching at 32K, and compression is way better.
> Thanks for sharing - and cool website!
Thank you!
Curious. At that scale and transaction rate, are you deleting / offloading rows to another storage solution after some amount of time? I’m assuming you’re not just letting 150,000 rows a second accumulate indefinitely.
> I insert about 150,000 rows a second
That's amazing - I would love to know if you have done careful data modeling, indexing, etc that allows you to get to this and what kind of data is being insert ed?
I am not optimizing too much around insertion speed. I avoid GIN, GIST and hash indexes.
The schema is nicely normalized.
I had troubles with hash indexes requiring hundreds of gigabytes of memory to rebuild.
Postgres B-Trees are painless and (very) fast.
Eg. querying one table by id (redacted):
Here’s a zpool iostat 1Have you measured _transactions_ per second? For both “insert only” and “mixed insert/update/delete?” This might be a more telling metric in terms of the stress it puts on the system. There’s a big difference between 1 transaction with 150k inserts, and 150k transactions with 1 insert each – especially when the workload shifts to include updates in the middle of those.
If you’re doing insert only, you might benefit from copying directly to disk rather than going through the typical query interface.
thank you for the response and stats! very cool
150k rows/second is pretty ordinary if you do batching, especially for a machine of that size. I was able to get 250k+ row inserts on a 16vCPU on a table with 16 columns, a third of which are strings >30 bytes. Pretty sure you can push that much higher with COPY.
that's true, you can get very far with COPY
Curious as to why it’s that many inserts? The whole CT ecosystem has about 30-40,000 new certs/precerts a minute IIRC). Love merkelmap though!
There’s backlog I still need to process :)
Because no one else cares to ask, are you running FreeBSD? I ask because you use ZFS.
Debian!
Damn, a stud! ZFS on Linux.
Ahaha :)
Can you please share some tips and tricks for achieving such high throughput?
That's amazing! Could you elaborate more on your VACUUM planning?
Mostly, be careful with long-running transactions (hours long), and modify your autovacuum settings to be as aggressive as possible based on your workload. Be careful with the freezing threshold too.
I ran into
Quite a bit when starting out :)Can you recommend a good rule of thumb for autovacuum settings given a large workload like yours?
Here's mine:
- autovacuum_max_workers to my number of tables (Only do so if you have enough IO capacity and CPU...).
- autovacuum_naptime 10s
- autovacuum_vacuum_cost_delay 1ms
- autovacuum_vacuum_cost_limit 2000
You probably should read https://www.postgresql.org/docs/current/routine-vacuuming.ht... it's pretty well written and easy to parse!
Do you ever need to VACUUM FULL?
It’s too slow at that scale, pg_squeeze works wonders though.
I only “need” that because one of my table requires batch deletion, and I want to reclaim the space. I need to refactor that part. Otherwise nothing like that would be required.
Yes, I was asking because you have mention deletes. Thanks for the answer, cool stuff!
[dead]
99.9% of the companies in the world will never need more than 1 beefy box running postgres with a replica for a manual failover and/or reads.
Availability is trickier than scalability. An async replica can lose a few recent writes during a failover, and a synchronous replica is safer but slower. A company using some platform might not even know which one they're using until it bites them.
99.9% of companies also aren't going to feel the performance difference of synchronous replication.
That being said, the setups I typically see don't even go that far. Most companies don't mitigate for the database going down in the first place. If the db goes down they just eat the downtime and fix it.
Skype open-sourced their architecture way back, using PL/Proxy to route calls based on shard. It works, is quite elegant, handled 50% of all international phone calls in the noughties. My old company used it to provide real-time analytics on about 300M mobile devices.
https://wiki.postgresql.org/images/2/28/Moskva_DB_Tools.v3.p...
https://s3.amazonaws.com/apsalar_docs/presentations/Apsalar_...
Skype has had from the beginning the requirement that all database access must be implemented through stored procedures.
That presentation starts with hard violence.
If the database team designed a thoughtful API with stored procedures, this can actually be a quite nice way to interact with a database for specific uses.
Being 100% hard and fast on that rule seems like a bad idea though.
Fashionable 20 years ago but thankfully everyone who had that bee in their bonnet seems to have retired.
I think it's a perfectly fine approach in 2025 now that that CI/CD have proliferated and you're less likely to run into a human DBA arbitrarily blocking deployments. It was always the feudal lord mentality of DB operations that made relying so much on stored procedures undesirable.
It'll make a comeback once stored procedures can be easily written in real programming languages using standard tools.
You already can [0]. C, PL/Perl, PL/Python, and PL/Tcl exist out of the box, in addition to PL/pgSQL, which I assume you were implying isn’t a “real programming language.”
[0]: https://www.postgresql.org/docs/current/server-programming.h...
Sure, even PL/Rust is available. But what does the deployment pipeline look like, from version control to a live instance?
That is described in the link, in section 36.18, albeit for C. If you have another language, i assume you get to figure out how to make it work.
[0]: https://www.postgresql.org/docs/current/extend-pgxs.html
Few people seem to be using these. At a guess the dev experience is not great.
Frankly, I think devs need to stop being such primmadonnas about DX. Not everything has a slick UI that does most of the work for you. Take Linux, for example: you have to email your PR as a patch, inline as plaintext to a mailing list. Yet somehow, they manage, and it had an incredibly active community.
Alternative programming languages are all about DX. PL/pgsql is turing complete after all.
It is indeed, but I’ll be the first to admit it’s a bit clunky to use beyond a few variables and actions.
Sorry for snapping. I’m exhausted with devs complaining that some older and well-established piece of tech (Postgres, HAProxy, nginx to name a few) isn’t easy enough to use, and then using something demonstrably worse, or writing their own terrible version of it. Work trauma.
Righteous violence.
One of the benefits of stored procedures they don't mention is SECURITY DEFINER, which is like setuid.
You can for instance have a user table with login and hashed password, have a stored procedure that can verify login and password, without giving SELECT access to the user table to the database user your application use.
Stored procedures also block SQL injection attacks.
I would much rather this than letting devs write their own queries, ORM or not. But then, I own the DBs, and I am constantly dealing with the result of poor schema and query decisions.
Probably as useful is the overview of what pgdog is and the docs. From their docs[1]: "PgDog is a sharder, connection pooler and load balancer for PostgreSQL. Written in Rust, PgDog is fast, reliable and scales databases horizontally without requiring changes to application code."
[1] https://docs.pgdog.dev/
Another option is going full-scale with CockroachDB. We had a Django application backed by PostgreSQL, which we migrated to CockroachDB using their official backend.
The data migration was a pain, but it was still less painful than manually sharding the data or dealing with 3rd party extensions. Since then, we’ve had a few hiccups with autogenerated migration scripts, but overall, the experience has been quite seamless. We weren’t using any advanced PostgreSQL features, so CockroachDB has worked well.
Unless their pricing has changed, it’s quite exorbitant when you need a lot of data. To the point that one year of cockroachdb would cost 5x the cost of the server it was running on.
I think you're referring to the CockroachDB Cloud DBaaS offering vs. CockroachDB itself, correct?
Not the parent but yeah, most likely. But then again, you probably don’t want to maintain your own deployment of Cockroach fleet.
This is still true. I wouldn’t use Cockroach if it were my own business. Also, they don’t offer any free version to try out the product. All you get is a short trial period and that’s it.
> Also, they don’t offer any free version to try out the product.
The site makes it seems as if I can install CockroachDB on Mac, Linux, or Windows and try it out for as long as I like. https://www.cockroachlabs.com/docs/v25.1/install-cockroachdb... Additionally, they claim CockroachDB Cloud is free for use "up to 10 GiB of storage and 50M RUs per organization per month".
> The site makes it seems as if I can install CockroachDB on Mac, Linux, or Windows and try it out for as long as I like.
Therare limitations in terms of licenses [0].
[0]: https://www.cockroachlabs.com/docs/v25.1/licensing-faqs
Licenses are also yearly renewed. Its not like you get one license and can use it forever. No, you need to yearly enter the new license in your instance, if you do not, after 2 weeks it goes into cripple mode (as in, you can not run anything that is not a personal blog).
And those free-licenses have this dirty little clause that you are not entitled to a license, they need to APPROVE a free-license. Now that is even more scary.
They pull all this because people kept using the free-core version and people simply never upgrade/wanted more. That is why all these changed happened. Coincidentally, the buzz around CRDB has died down to the point that most talks about CRDB are these rare mentions here (even reddit is as good as dead). 98% of CRDB mentioning how great it is, is all origination from CRDBLabs. Do a google and limit in time range, and then go page by page, ... They are a enterprise only company at this point.
> Additionally, they claim CockroachDB Cloud is free for use "up to 10 GiB of storage and 50M RUs per organization per month".
Take in account, that the constant CPU/Mem/Query monitoring that CRDB does, eats up around 20 a 30M RUs per month. There are some people that complained as to why there free instances lost so much capacity. And those RU are not 1:1, like, you do a insert, its 1RU, oooo, no ... Its like 9 to 12RU or something.
Its very easy to eat all those RUs on a simply website. Let alone something that needs to scale. Trust me, your better of self deploying but then you enjoy the issue of the new licenses / forced telemetric / forced phone home, or spend 125$+ / vcpu (good luck finding out the price, we only know these numbers from people breaking nda offers). They are very aggressive in sales.
Its not worth it to tie your company to a product, that can chance licenses on a whim, that charges Oracle prices (and uses the same tactics). I am very sure that some of their sales staff is ex-Oracle employees. ;)
Oh yeah, you can run docker-compose and play with the local version as long as you want. But their cloud offers are limited and quite expensive.
You can use official binary in production deployment - you just need to manage it yourself like you would manage Postgres.
> Another option is going full-scale with CockroachDB
Just beware that CockroachDB is not a drop-in replacement for PostgreSQL.
Last time I looked it was missing basic stuff. Like stored functions. I don't call stored functions an "advanced feature".
I also wonder what limitations you get on queries. Cause in Spanner, there's something about cross-directory queries, or Citus can't do cross-shard FKs iirc. There's no magic way to shard a relational DB, you have to deal with it one way or another.
Cockroach really doesn’t have limitations when it comes to queries. It is kinda magic in that regard. The single partition queries are the special case and are treated as rare. Cockroach ranges are quite small compared to other systems (~256MiB).
Interesting, I'll have to try it out at some point.
> Just beware that CockroachDB is not a drop-in replacement for PostgreSQL.
O, its way worse then that... A ton of small functionality tend to be missing or work differently. Sometimes even small stuff like column[1:2] does not even exist in CRDB and other times its things like ROW LEVEL SECURITY ... you know, something you may want on a mass distributed database then will be use for tenant setups (i hear they are finally going to implement it this year).
The main issue is the performance... You are spending close to ~4x the resources on a similar performing setup. And this is a 2x Pgres (sync) vs 3x CRDB (where every node is just a replica). Its not the replication itself but the massive overhead on the raft protocol + the way less optimized query planner.
To match Pgres, your required to deploy around 10 a 12 nodes, on hardware that needs about twice the resources. That is the point where CRDB performance on a similar level.
The issue is, well, you increase the resources to your Postgresql server and the gap is back. You can scale Postgresql to insane levels with 64, 96 CPUs ... Or imagine, you know, just planning your app in advance in such a way, that you spread your load over multiple Postgresql instances. Rocket science folks ;)
CRDB is really fun to watch, the build in GUI, the replication being very visual, but its a resource hog. The storage system (peble) eats a ton of resources to compact the data, when you can simply solve that with a PostgreSQL instance with zfs (with ironically, often better compression).
I do not joke when i say, that seeing those step cpu spikes in the night hours of the compacter working, is painful. Even a basic empty CRDB instance, just logging its own usage, runs between 7 to 50% on a quad ARM N1, constantly.
PostgreSQL? You do not even know its running. Barely any CPU usage, what memory usage?
And we have not talked license/payment issues ... Of "free enterprise" version with FORCED telemetric on, AND your not allowed to hide the server (aka, if it can not call home, it goes into restricted mode in 7 days, with like 50 queries / second ... aka, the same as just shutting down your DB ). By the way, some people reported that its 125+ dollar/vcore payment. Given that even the most basic CRDB gets you 3x instances, with minimum 4 cores ... Do the math. Yea, after 10M income but they chance their licenses every few years, so who knows next year or the year after that.
Interesting product, shitty sales company behind it. I am more interested to see when the postgresql storage extension orioledb comes out, so it solve the main issue that prevents postgresql scaling even more, namely the write/vacuum issue. And ofcourse a better solution to upgrade postgresql versions, there CRDB is leaps and bound better.
I'm glad you brought up the migration, because one of the main goals behind our project is to automate migrating to a sharded deployment. You can think of your DB as the base case (num_shards = 1), and PgDog as the recursive solution.
Automatic data transfer would be super cool to get out of the box. We had a custom multi-tenancy solution for our application that heavily used PostgreSQL schemas. One schema for each customer.
It was a pain to get that work with Cockroach since it doesn’t optimize cross schema queries and suggests one DB per customer. This was a deal breaker for us and we had to duplicate data to avoid cross schema queries.
Being able to live within Postgres has its advantages.
Well, ofcourse it does! :)
Another (battle tested * ) solution is to deploy the (open source) Postgres distribution created by Citus (subsidiary of Microsoft) on nodes running on Ubuntu, Debian or Red Hat and you are pretty much done: https://www.citusdata.com/product/community
Slap good old trusty PgBounce in front of it if you want/need (and you probably do) connection pooling: https://www.citusdata.com/blog/2017/05/10/scaling-connection...
*) Citus was purchased by Microsoft more or less solely to provide easy scale out on Azure through Cosmos DB for PostgreSQL
Is it really that easy? What are the edge cases?
It's not. We tried. Plus, it doesn't work on RDS, where most of production databases are. I think Citus was a great first step in the right direction, but it's time to scale the 99% of databases that don't run on Azure Citus already.
That's because Amazon wants to do whatever they like themselves... you apparently can get stuff to work by running your own masters (w/ citus extension) in EC2 backed by workers (Postgres RDS) in RDS:
https://www.citusdata.com/blog/2015/07/15/scaling-postgres-r... (note that this is a old blog post -- pg_shard has been succeeded by citus, but the architecture diagram still applies)
And me saying "Apparently" because I have no experience dealing with large databases on AWS.
Personally had no issues with Citus too, both on bare metal/VMs and as SaaS on Azure...
Depends on your schema, really. The hard part is choosing a distribution key to use for sharding- if you've got something like tenant ID that's in most of your queries and big tables, it's pretty easy, but can be a pain otherwise.
Same pain as with good old (native) partitioning, right? :)
As with partitioning, in my experience something like a common key (identifying data sets), tenant id and/or partial date (yyyy-mm) work pretty great
For a multi-tenant use case, yeah, pretty close to thinking about partitioning.
For other use cases, there can be big gains from cross-shard queries that you can't really match with partitioning, but that's super use case dependent and not a guaranteed result.
I’m working with several Postgres databases that share identical schemas, and I want to make their data accessible from a single interface.
Currently, I’m using Postgres FDWs to import the tables from those databases. I then create views that UNION ALL the relevant tables, adding a column to indicate the source database for each row.
This works, but I’m wondering if there’s a better way — ideally something that can query multiple databases in parallel and merge the results with a source database column included.
Would tools like pgdog, pgcat, pganimal be a good fit for this? I’m open to suggestions for more efficient approaches.
Thanks!
I'm kind of interested in why we can't make a better database with all of our modern technology.
Postgres is a fantastic workhorse, but it was also released in the late 80s. Who, who among you will create the database of the future... And not lock it behind bizarro licenses which force me to use telemetry.
There are "better" databases but they're better given some particular definition that may not be relevant to your needs. If SQL/the relational model and ACID semantics is what you need then postgres is simply the best in class. The fact it dates back to the 80s is probably an advantage (requirement?) when it comes to solving a problem really well
I guess I’d ask why is something having been first released in the late 80s, or any decade, as positive or negative? Some things are still used because they solve the problems people have. Some things are still used just because of industry capture. I’m not honestly sure where I’d put Postgres.
Are there specific things you’d want from a modern database?
Relating to the article, better scaling. Saying run it on a bigger box is a very brute force way to optimize an application.
While they come up with some other tricks here, that's ultimately what's scaling postgres means.
If I imagine a better database, it would have native support for scaling, a postgres compatible data layer as well as first party support for NoSQL( JSONB columns don't cut it since if you have simultaneous writes unpredictable behavior tends to occur).
It needs to also have a permissible license
What does permissible license mean? If you mean open source, no such database exists AFAIK. If you mean you can run it locally for free for dev purposes, on prem without telemetry etc, then Oracle is clearly the best option.
Compared to Postgres, Oracle DB:
• Scales horizontally with full SQL and transactional consistency. That means both write and read masters, not replicas - you can use database nodes with storage smaller than your database, or with no storage, and they are fully ACID.
• Has full transactional MQ support, along with many other features.
• Can scale elastically.
• Doesn't require vacuuming or have problems with XID wraparound. These are all Postgresisms that don't affect Oracle due to its better MVCC engine design.
• Has first party support for NoSQL that resolves your concern (see SODA and JSON duality views).
I should note that I have a COI because I work part time at Oracle Labs (and this post is my own yadda yadda), but you're asking why does no such database exist and whether anyone can make one. The database you're asking for not only exists but is one of the world's most popular databases. It's also actually quite cheap thanks to the elastic scaling. Spec out a managed Oracle DB in Oracle's cloud using the default elastic scaling option and you'll find it's cheaper than an Amazon Postgres RDS for similar max specs!
Does Oracle offer a Firebase like application/layer with authentication and stuff.
Can you get me some Oracle cloud credits ?
First positive thing I’ve ever heard about an oracle project !
Oracle Cloud (called OCI) has an always-free offering, you don't need credits. You can just sign up and use some small quantity of resources for nothing indefinitely. That includes a managed Oracle database:
https://docs.oracle.com/en/cloud/paas/autonomous-database/se...
There's also some sort of startup credits program with a brochure here, apparently you can just fill out a form and get some credits with an option to apply for more. But I don't know much about that. I've used the always-free programme for some personal stuff and it worked fine so I never needed to think about credits.
https://www.oracle.com/a/ocom/docs/free-cloud.pdf
I have to admit I'm not really familiar with Firebase, I thought that was some managed service for mobile apps, but Oracle DB comes with some stuff that sounds similar. And Oracle Cloud is an AWS-style cloud, it has a ton of high level services for things.
ORDS is a REST binding layer that lets you export tables, views, stored procedures and NoSQL JSON document stores over HTTP without writing a middleman server yourself. You can drive it directly from the browser. ORDS supports OAuth2 or can be integrated with custom auth schemes from what I understand. I've not used ORDS myself yet but probably will in the near future.
https://www.oracle.com/database/technologies/appdev/rest.htm...
Firebase IIRC when it first launched was known for push streaming of changes. Oracle DB lets you subscribe to the results of SQL queries and get push notifications when they change, either directly via driver callbacks or into a message queue for async processing later. It's pretty easy to hook such notifications up to web sockets or SSE or similar, in fact I've done that in my current project.
There's also a thing called APEX which is a bundled visual low-code app builder. I've never used it but I've used apps built with it, and it must be quite flexible as they all had a lot of features and looked very different. You can tell you're using an APEX app because they have a lot of colons in the URLs for some reason. Here's a random example of one from outside of Oracle that exports a database of dubious scientific research papers:
https://dbrech.irit.fr/pls/apex/f?p=9999:1::::::
I'm not holding it up as a great example, there are probably better examples out there, it's just a one that came to mind that's public and I used before.
Firebase, is basically a back-end server in a box. Authentication, check, file storage , check. Databases, check. Firebase functions which allow for complex logic written in JavaScript, check. Hosting, check. Supabase comes really close, but has been much more difficult in actual use.
I'm working on a fully open source game right now, and I can't ethically tell people to hook into a closed source service. But I must admit, having to use supabase instead of firebase has made this much harder than it needs to be.
Hm right well you can certainly write apps that run fully inside the database and use its services, so with ORDS doing OAuth2 I guess that gets close to something like Firebase. You can run logic written in JS easily too, it has an embedding of graaljs which is a fast JS engine that's well integrated with the rest of the database environment.
> Supabase comes really close, but has been much more difficult in actual use
supabase team here - can you share more about the challenges? We'd love your feedback so that we can fix any difficulties for the future
Can you please expand on the JSONB unpredictable behavior? We are about to embark on a journey to move some of our data from MongoDB to postgres (with some JSONB). While we don't have significant concurrent writes to a table, would be very helpful to understand the issues
https://github.com/shrinerb/shrine/discussions/665
I've never personally encountered this, but I've seen other HN contributors mention it. https://news.ycombinator.com/item?id=43189535
From what I can tell, unlike mongo, some postgres queries will try to update the entire JSONB data object vs a single field. This can lead to race conditions.
Ah, thanks. The first link seems something specific to Shrine. The bottomline is concurrent updates to different parts of JSONB need row level locking for correct behavior in Postgresql. This is not an important issue for us. Thank you for the pointers
You can trivially do it without locking using a version column.
You may be interested in FerretDB[1].
[1] https://www.ferretdb.com/
What's an example of non-brute force scaling?
The problem with postgres scaling is that you have to have a single master which means horizontal scaling really only gives you more reads and a failover. Eventually you wont be able to find a server big enough to handle all the writes, and if you get enough reads with even a small number of writes single master setups fall over. Distributed computing gets complicated very quickly but the gist here is basically that you need to be able to have multiple instances that can accept writes. Lots of literature on this but good starting points imo would be the paxos paper https://lamport.azurewebsites.net/pubs/time-clocks.pdf and dynamo db paper https://www.allthingsdistributed.com/files/amazon-dynamo-sos...
Something like a queuing system, re-architecting the database to handle a higher load without just throwing more machines on it. A cache.
Why is it brute force and why is it bad ?
Postgres 17.4 was released last month. Show some respect to the devs.
Anyone who creates a better database is going to want to get paid for it which either means DBaaS or those weird licenses.
Postgres is the most worked on database in the world right now. Its original release date doesnt mean work stopped. No-sql was a new thing a decade-ish ago but most companies probably dont need it. New data platforms focused on scaling like snowflake and cockroach have come too but again for most use cases postgres is better.
Look at AWS presentation/talk about Aurora DSQL [1]
It’s a Postgres facade. But everything beyond that is a complete reimagining and a rewrite to scale independently.
I personally think it’s going to eat a lot of market share when it solves some remaining limitations.
[1] https://youtu.be/huGmR_mi5dQ?si=ALw4XjdDJBxkZWRv
Have you looked at CockroachDB? PostgreSQL compatibility with modern comforts (e.g. easy cloud deployments, horizontal scaling, memory safe language)
Came here to say this. Cockroach solves the sharding issue by adopting consistent hashing-based data distribution, as described in the Dynamo paper. However, their cloud solution is a bit expensive to get started with.
Does CockroachdB already support ltree?
It doesn’t. It doesn’t support most of the Postgres extensions. We got away with it because we weren’t doing anything beyond vanilla Postgres.
Something I don't see in the pgdog documentation is how cross-shard joins work. Okay, if I do a simple `select * from users order by id`, you'll in-memory order the combined results for me. But if I have group by and aggregations and such? Will it resolve that correctly?
Aggregates are a work in progress. We're going to implement them in this order:
1. count
2. max, min, sum
3. avg (needs a query rewrite to include count)
Eventually, we'll do all of these: https://www.postgresql.org/docs/current/functions-aggregate..... If you got a specific use case, reach out and we'll prioritize.
Heh, no chance I can introduce this at work and hard to have a personal project requiring it. :)
I think you probably need some documentation to the effect of the current state of affairs, as well as prescriptions as to how to work around it. _Most_ live workloads, even if the total dataset is huge, have a pretty small working set. So limiting DB operations to simple fetches and doing any complex operations in memory is viable, but should be prescribed as the solution or people will consider it's omission as a fault instead of a choice.
I ended up adding support for GROUP BY: https://github.com/pgdogdev/pgdog/pull/43
I had it in the back of my mind for a while, nice to have it in code. Works pretty well, as long as columns in GROUP BY are present in the result set. Otherwise, we would need to rewrite the query to include them, and remove them once we're done.
No worries. It's early days, the code and docs are just a few months in the making. I'm happy to keep you updated on the progress. If you want, send your contact info to hi@pgdog.dev.
- Lev
People talk about scale frequently as a single dimension (and usually volume as it relates to users) but that can be oversimplifying for many kinds of applications. For instance, as you are thinking about non-trivial partitioning schemes (like if there is high coupling between entities of the same kind - as you see in graphs) is when you should consider alternatives like the Bigtable-inspired DBs, since those are (relatively) more batteries included for you.
> It’s funny to write this. The Internet contains at least 1 (or maybe 2) meaty blog posts about how this is done
It would’ve been great to link those here. I’m guessing one refers to StackOverflow which has/had one of the more famous examples of scaled Postgres.
Notion wrote about their experience before:
https://www.notion.com/blog/sharding-postgres-at-notion
https://www.notion.com/blog/the-great-re-shard
I was thinking of the Instagram post years ago. And maybe the Instacart one.
Maybe this one for Instagram? https://instagram-engineering.com/sharding-ids-at-instagram-...
I run a postgresql db with a few billion rows at about 2TB right now. We don't need sharding yet but when we do I was considering Citus. Does anyone have experience implementing Citus that could comment?
It can be great, depending on your schema and planned growth. Questions I'd be asking in your shoes:
1. Does the schema have an obvious column to use for distribution? You'll probably want to fit one of the 2 following cases, but these aren't exclusive:
2. Are you sure you're going to grow to the scale where you need Citus? Depending on workload, it's not too hard to have a 20TB single-server PG database, and that's more than enough for a lot of companies these days.3. When do you want to migrate? Logical replication in should work these days (haven't tested myself), but the higher the update rate and larger the database, the more painful this gets. There's not a lot of tools that are very useful for the more difficult scenarios here, but the landscape has changed since I've last had to do this
4. Do you want to run this yourself? Azure does offer a managed service, and Crunchy offers Citus on any cloud, so you have options.
5. If you're running this yourself, how are you managing HA? pg_auto_failover has some Citus support, but can be a bit tricky to get started with.
I did get my Citus cluster over 1 PB at my previous job, and that's not the biggest out out there, so there's definitely room to scale, but the migration can be tricky.
Disclaimer: former Citus employee
Interesting technology. Similar to Citus but not built as an extension. The Citus coordinator, which is a Postgres database with the Citus extension, is replaced by a proxy layer written in Rust. That might provide more flexibility and velocity implementing distributed planning and execution than being tied to the extension ecosystem. It would indeed be a journey to catch up with Postgres on compatibility, but it's a good start.
So like the MySQL proxies of long ago?
There are definitely advantages of not running inside the system you wish to orchestrate.
Better keep up with the parser though!
We use the Postgres parser directly, thanks to the great work of pg_query [1].
[1] https://github.com/pganalyze/pg_query.rs
Tangentially related: is there a good guide or setup scripts to run self hosted Postgres with backups and secondary standby? Like I just want something I can deploy to a VPS/dedicated box for all my side projects.
If not is supabase the most painless way to get started?
Apart from being backed by Postgres instead of MySQL, is this different from Vitess (and its commercial vendor PlanetScale)?
https://vitess.io/
The goal for this project is to be analogous to Vitess for Postgres.
Funny how everyone eventually hits this point and thinks they're inventing fire - but then again, pushing your trusty old tools way past comfort is where cool engineering actually happens.
Never heard of pgdog before. How does it compare to citus?
Seems like this is a similar philosophy, but is missing a bunch of things the Citus coordinator provides. From the article, I'm guessing Citus is better at cross-shard queries, SQL support, central management of workers, keeping schemas in sync, and keeping small join tables in sync across the fleet, and provides a single point of ingestion.
That being said, this does seem to handle replicas better than Citus ever really did, and most of the features it's lacking aren't relevant for the sort of multitenant use case this blog is describing, so it's not a bad tradeoff. This also avoids the coordinator as a central point of failure for both outages and connection count limitations, but we never really saw those be a problem often in practice.
We certainly have a way to go to support all cross-shard use cases, especially complex aggregates (like percentiles). In OLTP, where PgDog will focus on first, it's good to have a sharding key and a single shard in mind, 99% of the time. The 1% will be divided between easy things we already support, like sorting, and slightly more complex things like aggregates (avg, count, max, min, etc.), which are on the roadmap.
For everything else, and until we cover what's left, postgres_fdw can be a fallback. It actually works pretty well.
What are the drawbacks of Citus/why isn't it perfect/what would you look for from a competitor/alternative?
Citus works really well *if* you have your schema well defined and slightly denormalized (meaning you have the shard key materialized on every table), and you ensure you're always joining on that as part of querying. For a lot of existing applications that were not designed with this in mind if can be several months of database and application code changes to get things into shape to work with Citus.
If you're designing from scratch and make it worth with Citus then (specifically for a multi-tenant/SaaS sharded app) it can make scaling seem a bit magical.
Great question and we will publish something for this later on. TLDR: Citus doesn't provide an easy migration path for existing DBs and the managed deployments are pretty much exclusive to Azure. A large percentage of production DBs run elsewhere and don't have an easy way to shard.
DynamoDB is most certainly not the way.
How does PgDog handle a column like:
`id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY`
Replace it with a function that makes sure the id coming out matches the sharding schema. Assuming it's coming from a sequence, we're consuming it until we get a matching number. It would be good to know what is behind the generated column in your use case.
Shouldn't the title be 'Learning to scale Postgres'?
Logical replication works both ways, thats a good start.
i was reading through this and was going "huh this sounds familiar" until i read who wrote it :)
neat piece of tech! excited to try it out.
Couldn't they have just moved to Aurora DSQL and saved all the headache?
We actually found Aurora to be about 3x slower than community PG for small queries. That was back then, maybe things are better now. Migrating to another database (and Aurora is Postgres-compatible, it's not Postgres) is very risky when you've been using yours for years and know where the edge cases are.
I’ve consistently found Aurora MySQL and PG to be slower than everything, including my 12 year old Dell R620s. You can’t beat data locality, and the 4/6 quorum requirement of Aurora combined with the physical distance kills any hope of speed.
Aurora is postgres but with a different storage layer, no? It uses the postgres engine, which other postgres-compatible databases like cockroach do not, right?
That’s right, Aurora Postgres is quite close to vanilla Postgres. Aurora DSQL is a different story though.
For a more scientific answer, there is this project: https://pgscorecard.com/
Note that Aurora scores 93% while Cockroach scores only 40%.
I actually wasn't aware of Aurora DSQL, that's incredibly bad product naming.
This!
Postgres to pg-compatible DBs are never as smooth as they advertise it to be.
> Aurora DSQL
There isn't even pricing on it...