Scaling Postgres to the Next Level at OpenAI

By Ruohang Feng (@Vonng) | Speaker: Bohan Zhang | May 19, 2025

At PGConf.Dev 2025, Bohan Zhang from OpenAI presented a session titled “Scaling Postgres to the Next Level at OpenAI”, giving us a peek into database operations at one of the world’s most prominent AI companies.

OpenAI has successfully demonstrated that PostgreSQL can scale to support massive read-heavy workloads—even without sharding—using a single primary writer with dozens of read replicas.

Architecture Foundation

OpenAI operates PostgreSQL on Azure using a classic primary-replica replication model. The infrastructure includes one primary database and over a dozen read replicas distributed across regions. This approach supports several hundred million active users while maintaining critical system reliability.

The key insight here is that sharding is not necessary for scaling reads—you can scale horizontally by adding more replicas. The single primary handles all writes, while read traffic is distributed across replicas.

Key Challenges Identified

Write Bottlenecks

The primary challenge centers on write request limitations. While read scalability performs well through replication, write operations remain constrained to the single primary instance. This is the fundamental trade-off of the single-primary architecture.

MVCC Design Issues

PostgreSQL’s multi-version concurrency control creates table and index bloat concerns. Every write generates a new row version, requiring additional heap fetches for visibility checks during index access. Autovacuum tuning becomes critical at scale.

Replication Lag

Increased WAL (write-ahead logging) correlates with greater replication lag, and expanding replica count increases network bandwidth demands. Managing this becomes crucial as you add more replicas.

Optimization Strategies Implemented

Load Reduction

OpenAI systematically offloads writes where possible:

Minimize unnecessary application-level writes
Implement lazy write patterns
Control data backfill rates
Move read requests to replicas whenever feasible
Only read-write transaction queries remain on the primary

Query Optimization

The team employs aggressive timeout settings to prevent “idle in transaction” states:

Session, statement, and client-level timeouts prevent resource exhaustion
Multi-table joins receive special attention (ORMs frequently generate inefficient queries)
Long-running queries are strictly controlled

Failure Mitigation

Recognizing the primary as a single point of failure, OpenAI:

Segregates high-priority requests onto dedicated read-only replicas
Prevents low-priority workloads from impacting critical operations
Maintains separate replica pools for different workload types

Schema Management Restrictions

The cluster operates under strict schema change limitations:

Operations requiring full table rewrites are forbidden
Index operations must use the CONCURRENTLY syntax
Column additions or removals require a 5-second timeout maximum
No heavy DDL during business hours

Results Achieved

The approach has proven highly successful:

Millions of QPS: Scaled PostgreSQL to handle millions of queries per second
Dozens of replicas: Added dozens of replicas without increasing replication lag
Geographic distribution: Deployed geo-distributed read replicas with maintained low latency
High reliability: Only one SEV0 (highest severity) PostgreSQL-related incident in nine months
Headroom remaining: Substantial capacity for future growth

Feature Requests to PostgreSQL Community

OpenAI raised several enhancement suggestions for the PostgreSQL community:

Index Disabling

Request a formal capability to disable indexes while maintaining them during DML operations. This allows safe validation before deletion—currently requires superuser access to manipulate pg_index.indisvalid.

Real-Time Observability

Current pg_stat_statements provides average metrics but lacks percentile latency distributions (p95, p99). Histogram-style latency metrics would dramatically improve monitoring capabilities.

Schema Change History

SQL-queryable views tracking DDL operations would improve audit and historical reference capabilities. Currently requires parsing logs or using EVENT TRIGGERs.

Monitoring View Semantics

Clarification needed regarding session states reporting “active” while waiting for client I/O, particularly regarding idle_in_transaction_timeout applicability.

Default Parameters

Request for heuristic-based default configuration detection based on hardware specifications—smarter out-of-the-box settings.

Commentary

Having managed comparable infrastructure at Tantan (supporting approximately 2.5 million QPS across multiple clusters, with individual primary instances handling 400K QPS using 33 read replicas), I can contextualize OpenAI’s experience within broader PostgreSQL deployment history.

The difference reflects hardware advancement over eight years, enabling modern startups to serve entire operations through single PostgreSQL instances. What required dozens of clusters in 2016 can now be handled by a single, well-tuned PostgreSQL deployment.

Most features OpenAI requested already exist within the PostgreSQL ecosystem but remain unavailable in either vanilla PostgreSQL or managed cloud database environments:

Index disabling: Manipulate pg_index.indisvalid catalog field (requires superuser)
Query latency percentiles: Use pg_stat_monitor extension or calculate from pg_stat_statements standard deviation
DDL history tracking: Enable log_statement = 'ddl' or use EVENT TRIGGER mechanisms
Connection timeout management: Configure at load balancer (PgBouncer/HAProxy) or application connection pool level
Parameter optimization: Use initialization tools with hardware detection

The presentation demonstrates that despite cloud database management limitations, successful extreme-scale PostgreSQL operations require substantial operational expertise and DBA capabilities—even with premier cloud provider support.

Infrastructure Insights

During conference discussions, OpenAI revealed:

Reliance on Azure’s highest-specification managed PostgreSQL service
Dedicated engineering support from Azure
Database access routes through Kubernetes via application-side PgBouncer connection pooling
Monitoring utilizes Datadog for observability across the deployment

This confirms that even at the highest cloud tiers, organizations still need deep PostgreSQL expertise to succeed at scale.

References

Last modified 2026-01-06: batch update (cc9e058)