Vibe Coding Forem

Bakhat Yar
Bakhat Yar

Posted on

Optimizing PostgreSQL Queries for Large-Scale Data Applications

Over the past two years, I've been working on scaling database performance for a high-traffic application that handles millions of daily transactions. What started as a straightforward PostgreSQL setup quickly became a performance bottleneck as our data grew exponentially. This post shares the practical optimization techniques that helped us reduce query times from 15+ seconds to under 200ms.

The Problem: When Good Queries Go Bad

Our application was initially handling around 50,000 records. Fast forward to today, and we're managing over 10 million records with complex relationships. Here's what we learned the hard way.

Initial Performance Issues

-- Our original "simple" query
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE o.created_at >= '2024-01-01'
ORDER BY o.created_at DESC;

-- Query time: 18 seconds ❌
Enter fullscreen mode Exit fullscreen mode

This query seemed innocent enough, but it was killing our application's responsiveness.

Strategy 1: Index Optimization

Understanding Index Types

PostgreSQL offers several index types, and choosing the right one makes a massive difference.

1. B-tree Indexes (Default)

Best for equality and range queries:

-- Create B-tree index on frequently queried columns
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

-- Composite index for multi-column queries
CREATE INDEX idx_orders_customer_date 
ON orders(customer_id, created_at);
Enter fullscreen mode Exit fullscreen mode

Performance Impact: Query time dropped to 8 seconds

2. Partial Indexes

Index only relevant rows:

-- Only index active orders (not archived ones)
CREATE INDEX idx_orders_active 
ON orders(created_at) 
WHERE status = 'active';

-- Reduced index size by 60%
Enter fullscreen mode Exit fullscreen mode

Performance Impact: Query time dropped to 4.5 seconds

3. GIN Indexes for JSONB

For JSON data columns:

-- Index JSONB column for fast lookups
CREATE INDEX idx_orders_metadata 
ON orders USING GIN (metadata);

-- Now we can query JSON efficiently
SELECT * FROM orders 
WHERE metadata @> '{"priority": "high"}';
Enter fullscreen mode Exit fullscreen mode

Index Maintenance

-- Monitor index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

-- Remove unused indexes (idx_scan = 0)
DROP INDEX IF EXISTS unused_index_name;

-- Rebuild bloated indexes
REINDEX INDEX CONCURRENTLY idx_orders_created_at;
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Query Restructuring

Select Only What You Need

-- ❌ BAD: Fetching all columns
SELECT * FROM orders WHERE status = 'active';

-- ✅ GOOD: Fetch only required columns
SELECT id, customer_id, total_amount, created_at 
FROM orders 
WHERE status = 'active';
Enter fullscreen mode Exit fullscreen mode

Performance Impact: 40% reduction in data transfer time

Use EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS) 
SELECT o.id, o.total_amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days';
Enter fullscreen mode Exit fullscreen mode

Key metrics to watch:

  • Seq Scan → Bad (needs index)
  • Index Scan → Good
  • Bitmap Heap Scan → Good for large result sets
  • Nested Loop → Bad for large tables
  • Hash Join → Good for large tables

Optimize Joins

-- ❌ BAD: Multiple LEFT JOINs returning nulls
SELECT o.*, c.*, p.*, s.*
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
LEFT JOIN products p ON o.product_id = p.id
LEFT JOIN shipments s ON o.id = s.order_id;

-- ✅ GOOD: Use INNER JOIN when possible
SELECT o.id, o.total_amount, c.name, p.title
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed';
Enter fullscreen mode Exit fullscreen mode

Strategy 3: Partitioning Large Tables

Time-Based Partitioning

-- Create parent table
CREATE TABLE orders (
    id SERIAL,
    customer_id INTEGER,
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP NOT NULL,
    status VARCHAR(50)
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE orders_2024_01 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02 PARTITION OF orders
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Create indexes on each partition
CREATE INDEX idx_orders_2024_01_customer 
ON orders_2024_01(customer_id);
Enter fullscreen mode Exit fullscreen mode

Performance Impact: Queries scanning specific date ranges became 10x faster

List-Based Partitioning

-- Partition by status
CREATE TABLE orders (
    id SERIAL,
    customer_id INTEGER,
    status VARCHAR(50),
    created_at TIMESTAMP
) PARTITION BY LIST (status);

CREATE TABLE orders_active PARTITION OF orders
FOR VALUES IN ('pending', 'processing', 'shipped');

CREATE TABLE orders_completed PARTITION OF orders
FOR VALUES IN ('delivered', 'completed');
Enter fullscreen mode Exit fullscreen mode

Strategy 4: Connection Pooling

PgBouncer Configuration

[databases]
myapp_db = host=localhost port=5432 dbname=production

[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
Enter fullscreen mode Exit fullscreen mode

Implementation in Node.js:

const { Pool } = require('pg');

const pool = new Pool({
  host: 'localhost',
  port: 6432, // PgBouncer port
  database: 'myapp_db',
  user: 'app_user',
  password: process.env.DB_PASSWORD,
  max: 20, // max connections in pool
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

// Efficient query execution
async function getActiveOrders() {
  const client = await pool.connect();
  try {
    const result = await client.query(
      'SELECT id, total_amount FROM orders WHERE status = $1',
      ['active']
    );
    return result.rows;
  } finally {
    client.release(); // Always release!
  }
}
Enter fullscreen mode Exit fullscreen mode

Performance Impact: Reduced connection overhead by 70%

Strategy 5: Caching Strategies

Application-Level Caching (Redis)

const redis = require('redis');
const client = redis.createClient();

async function getCustomerOrders(customerId) {
  const cacheKey = `customer:${customerId}:orders`;

  // Check cache first
  const cached = await client.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }

  // Query database if not cached
  const result = await pool.query(
    'SELECT * FROM orders WHERE customer_id = $1',
    [customerId]
  );

  // Cache for 5 minutes
  await client.setex(cacheKey, 300, JSON.stringify(result.rows));

  return result.rows;
}
Enter fullscreen mode Exit fullscreen mode

Materialized Views

-- Create materialized view for expensive aggregations
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT 
    DATE(created_at) as sale_date,
    COUNT(*) as total_orders,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as avg_order_value
FROM orders
WHERE status = 'completed'
GROUP BY DATE(created_at);

-- Create index on the materialized view
CREATE INDEX idx_daily_sales_date ON daily_sales_summary(sale_date);

-- Refresh the view (can be scheduled)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales_summary;
Enter fullscreen mode Exit fullscreen mode

Performance Impact: Dashboard queries went from 12s to 100ms

Strategy 6: Batch Operations

Efficient Bulk Inserts

// ❌ BAD: Multiple individual inserts
for (let i = 0; i < 10000; i++) {
  await pool.query(
    'INSERT INTO orders (customer_id, amount) VALUES ($1, $2)',
    [customerId, amount]
  );
}
// Time: 45 seconds

// ✅ GOOD: Single batch insert
const values = [];
const params = [];
for (let i = 0; i < 10000; i++) {
  const offset = i * 2;
  values.push(`($${offset + 1}, $${offset + 2})`);
  params.push(customerId, amount);
}

await pool.query(
  `INSERT INTO orders (customer_id, amount) VALUES ${values.join(',')}`,
  params
);
// Time: 2 seconds ✅
Enter fullscreen mode Exit fullscreen mode

Using COPY for Massive Imports

const { pipeline } = require('stream');
const { from } = require('pg-copy-streams');

async function bulkImport(dataStream) {
  const client = await pool.connect();
  try {
    const stream = client.query(
      from('COPY orders (customer_id, amount, created_at) FROM STDIN CSV')
    );

    await pipeline(dataStream, stream);
  } finally {
    client.release();
  }
}
Enter fullscreen mode Exit fullscreen mode

Strategy 7: Configuration Tuning

Key PostgreSQL Settings

# postgresql.conf optimizations

# Memory settings
shared_buffers = 4GB                # 25% of RAM
effective_cache_size = 12GB         # 75% of RAM
work_mem = 64MB                     # Per operation
maintenance_work_mem = 1GB          # For VACUUM, CREATE INDEX

# Checkpoint settings
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_wal_size = 4GB
min_wal_size = 1GB

# Query planner
random_page_cost = 1.1              # For SSD
effective_io_concurrency = 200      # For SSD

# Parallelism
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
Enter fullscreen mode Exit fullscreen mode

Vacuum and Analyze

-- Manual vacuum
VACUUM ANALYZE orders;

-- Autovacuum configuration (postgresql.conf)
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 30s

-- Monitor vacuum progress
SELECT * FROM pg_stat_progress_vacuum;
Enter fullscreen mode Exit fullscreen mode

Real-World Results: Before & After

Query Performance Comparison

Operation Before After Improvement
Complex JOIN query 18s 0.18s 100x faster
Aggregation report 12s 0.10s 120x faster
Bulk insert (10k) 45s 2s 22x faster
Full-text search 8s 0.3s 26x faster
Dashboard load 15s 0.5s 30x faster

Infrastructure Impact

  • Server Cost: Reduced from 3 servers to 1
  • CPU Usage: Dropped from 80% to 25%
  • Memory Usage: Optimized from 90% to 45%
  • Storage I/O: Reduced by 60%

Monitoring & Maintenance

Essential Queries for Monitoring

-- 1. Find slow queries
SELECT query, 
       mean_exec_time, 
       calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

-- 2. Check index usage
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY schemaname, tablename;

-- 3. Monitor table bloat
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

-- 4. Active connections
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;
Enter fullscreen mode Exit fullscreen mode

Automated Monitoring Script

const { Pool } = require('pg');
const pool = new Pool(/* config */);

async function healthCheck() {
  const checks = {
    slowQueries: await pool.query(`
      SELECT COUNT(*) FROM pg_stat_statements 
      WHERE mean_exec_time > 1000
    `),
    connections: await pool.query(`
      SELECT COUNT(*) FROM pg_stat_activity 
      WHERE state = 'active'
    `),
    tableSize: await pool.query(`
      SELECT pg_size_pretty(pg_database_size(current_database()))
    `)
  };

  console.log('Database Health:', checks);

  // Alert if thresholds exceeded
  if (checks.slowQueries.rows[0].count > 10) {
    console.error('⚠️ Too many slow queries detected!');
  }
}

// Run every 5 minutes
setInterval(healthCheck, 5 * 60 * 1000);
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls to Avoid

1. N+1 Query Problem

// ❌ BAD: N+1 queries
const orders = await pool.query('SELECT * FROM orders LIMIT 100');
for (let order of orders.rows) {
  const customer = await pool.query(
    'SELECT * FROM customers WHERE id = $1',
    [order.customer_id]
  );
  order.customer = customer.rows[0];
}

// ✅ GOOD: Single JOIN query
const result = await pool.query(`
  SELECT o.*, c.name, c.email
  FROM orders o
  JOIN customers c ON o.customer_id = c.id
  LIMIT 100
`);
Enter fullscreen mode Exit fullscreen mode

2. Not Using Prepared Statements

// ❌ BAD: SQL injection risk + no query plan caching
const query = `SELECT * FROM orders WHERE id = ${userId}`;

// ✅ GOOD: Prepared statement
const query = 'SELECT * FROM orders WHERE id = $1';
await pool.query(query, [userId]);
Enter fullscreen mode Exit fullscreen mode

3. Ignoring EXPLAIN Output

Always analyze your queries before deploying to production:

EXPLAIN (ANALYZE, BUFFERS, COSTS, VERBOSE, TIMING) 
SELECT * FROM orders 
WHERE created_at >= NOW() - INTERVAL '7 days';
Enter fullscreen mode Exit fullscreen mode

Advanced Techniques

Full-Text Search Optimization

-- Add tsvector column
ALTER TABLE products ADD COLUMN search_vector tsvector;

-- Create GIN index
CREATE INDEX idx_products_search 
ON products USING GIN(search_vector);

-- Update trigger to maintain search_vector
CREATE TRIGGER tsvector_update 
BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(search_vector, 'pg_catalog.english', title, description);

-- Efficient full-text search
SELECT * FROM products
WHERE search_vector @@ to_tsquery('english', 'laptop & gaming');
Enter fullscreen mode Exit fullscreen mode

Write-Ahead Log (WAL) Tuning

# For high-write workloads
wal_compression = on
wal_writer_delay = 200ms
commit_delay = 10
commit_siblings = 5
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

  1. Measure First: Use EXPLAIN ANALYZE before optimizing
  2. Index Wisely: More indexes ≠ better performance
  3. Monitor Continuously: pg_stat_statements is your best friend
  4. Test in Production-Like Environment: Your local DB won't show scaling issues
  5. Incremental Changes: Optimize one thing at a time
  6. Document Everything: Future you will thank present you

Conclusion

Optimizing PostgreSQL for large-scale applications is an iterative process. The techniques I've shared here reduced our query times by 100x and cut infrastructure costs by 60%.

The key takeaways:

  • ✅ Strategic indexing (not over-indexing)
  • ✅ Query restructuring and proper JOINs
  • ✅ Partitioning for time-series data
  • ✅ Connection pooling with PgBouncer
  • ✅ Caching at multiple levels
  • ✅ Regular monitoring and maintenance

Resources & Tools

  • pgAdmin - GUI management
  • pg_stat_statements - Query statistics
  • PgBouncer - Connection pooling
  • pgBadger - Log analyzer
  • explain.depesz.com - Visual EXPLAIN analyzer

Have you faced similar scaling challenges? What optimization techniques worked for your use case? Let me know in the comments!

performance #optimization #sql #backend #devops #scalability

Top comments (0)