Scaling Postgres to the Next Level at OpenAI
At PGConf.Dev 2025, Bohan Zhang from OpenAI presented a session titled “Scaling Postgres to the Next Level at OpenAI”, giving us a peek into database operations at one of the world’s most prominent AI companies.
OpenAI has successfully demonstrated that PostgreSQL can scale to support massive read-heavy workloads—even without sharding—using a single primary writer with dozens of read replicas.
Architecture Foundation
OpenAI operates PostgreSQL on Azure using a classic primary-replica replication model. The infrastructure includes one primary database and over a dozen read replicas distributed across regions. This approach supports several hundred million active users while maintaining critical system reliability.
The key insight here is that sharding is not necessary for scaling reads—you can scale horizontally by adding more replicas. The single primary handles all writes, while read traffic is distributed across replicas.
Key Challenges Identified
Write Bottlenecks
The primary challenge centers on write request limitations. While read scalability performs well through replication, write operations remain constrained to the single primary instance. This is the fundamental trade-off of the single-primary architecture.
MVCC Design Issues
PostgreSQL’s multi-version concurrency control creates table and index bloat concerns. Every write generates a new row version, requiring additional heap fetches for visibility checks during index access. Autovacuum tuning becomes critical at scale.
Replication Lag
Increased WAL (write-ahead logging) correlates with greater replication lag, and expanding replica count increases network bandwidth demands. Managing this becomes crucial as you add more replicas.
Optimization Strategies Implemented
Load Reduction
OpenAI systematically offloads writes where possible:
- Minimize unnecessary application-level writes
- Implement lazy write patterns
- Control data backfill rates
- Move read requests to replicas whenever feasible
- Only read-write transaction queries remain on the primary
Query Optimization
The team employs aggressive timeout settings to prevent “idle in transaction” states:
- Session, statement, and client-level timeouts prevent resource exhaustion
- Multi-table joins receive special attention (ORMs frequently generate inefficient queries)
- Long-running queries are strictly controlled
Failure Mitigation
Recognizing the primary as a single point of failure, OpenAI:
- Segregates high-priority requests onto dedicated read-only replicas
- Prevents low-priority workloads from impacting critical operations
- Maintains separate replica pools for different workload types
Schema Management Restrictions
The cluster operates under strict schema change limitations:
- Operations requiring full table rewrites are forbidden
- Index operations must use the
CONCURRENTLYsyntax - Column additions or removals require a 5-second timeout maximum
- No heavy DDL during business hours
Results Achieved
The approach has proven highly successful:
- Millions of QPS: Scaled PostgreSQL to handle millions of queries per second
- Dozens of replicas: Added dozens of replicas without increasing replication lag
- Geographic distribution: Deployed geo-distributed read replicas with maintained low latency
- High reliability: Only one SEV0 (highest severity) PostgreSQL-related incident in nine months
- Headroom remaining: Substantial capacity for future growth
Feature Requests to PostgreSQL Community
OpenAI raised several enhancement suggestions for the PostgreSQL community:
Index Disabling
Request a formal capability to disable indexes while maintaining them during DML operations. This allows safe validation before deletion—currently requires superuser access to manipulate pg_index.indisvalid.
Real-Time Observability
Current pg_stat_statements provides average metrics but lacks percentile latency distributions (p95, p99). Histogram-style latency metrics would dramatically improve monitoring capabilities.
Schema Change History
SQL-queryable views tracking DDL operations would improve audit and historical reference capabilities. Currently requires parsing logs or using EVENT TRIGGERs.
Monitoring View Semantics
Clarification needed regarding session states reporting “active” while waiting for client I/O, particularly regarding idle_in_transaction_timeout applicability.
Default Parameters
Request for heuristic-based default configuration detection based on hardware specifications—smarter out-of-the-box settings.
Commentary
Having managed comparable infrastructure at Tantan (supporting approximately 2.5 million QPS across multiple clusters, with individual primary instances handling 400K QPS using 33 read replicas), I can contextualize OpenAI’s experience within broader PostgreSQL deployment history.
The difference reflects hardware advancement over eight years, enabling modern startups to serve entire operations through single PostgreSQL instances. What required dozens of clusters in 2016 can now be handled by a single, well-tuned PostgreSQL deployment.
Most features OpenAI requested already exist within the PostgreSQL ecosystem but remain unavailable in either vanilla PostgreSQL or managed cloud database environments:
- Index disabling: Manipulate
pg_index.indisvalidcatalog field (requires superuser) - Query latency percentiles: Use
pg_stat_monitorextension or calculate frompg_stat_statementsstandard deviation - DDL history tracking: Enable
log_statement = 'ddl'or use EVENT TRIGGER mechanisms - Connection timeout management: Configure at load balancer (PgBouncer/HAProxy) or application connection pool level
- Parameter optimization: Use initialization tools with hardware detection
The presentation demonstrates that despite cloud database management limitations, successful extreme-scale PostgreSQL operations require substantial operational expertise and DBA capabilities—even with premier cloud provider support.
Infrastructure Insights
During conference discussions, OpenAI revealed:
- Reliance on Azure’s highest-specification managed PostgreSQL service
- Dedicated engineering support from Azure
- Database access routes through Kubernetes via application-side PgBouncer connection pooling
- Monitoring utilizes Datadog for observability across the deployment
This confirms that even at the highest cloud tiers, organizations still need deep PostgreSQL expertise to succeed at scale.
