Bakhat Yar

Posted on Dec 27

Optimizing PostgreSQL Queries for Large-Scale Data Applications

#postgres #devops #database

Over the past two years, I've been working on scaling database performance for a high-traffic application that handles millions of daily transactions. What started as a straightforward PostgreSQL setup quickly became a performance bottleneck as our data grew exponentially. This post shares the practical optimization techniques that helped us reduce query times from 15+ seconds to under 200ms.

The Problem: When Good Queries Go Bad

Our application was initially handling around 50,000 records. Fast forward to today, and we're managing over 10 million records with complex relationships. Here's what we learned the hard way.

Initial Performance Issues

-- Our original "simple" query
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE o.created_at >= '2024-01-01'
ORDER BY o.created_at DESC;

-- Query time: 18 seconds ❌

This query seemed innocent enough, but it was killing our application's responsiveness.

Strategy 1: Index Optimization

Understanding Index Types

PostgreSQL offers several index types, and choosing the right one makes a massive difference.

1. B-tree Indexes (Default)

Best for equality and range queries:

-- Create B-tree index on frequently queried columns
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

-- Composite index for multi-column queries
CREATE INDEX idx_orders_customer_date 
ON orders(customer_id, created_at);

Performance Impact: Query time dropped to 8 seconds

2. Partial Indexes

Index only relevant rows:

-- Only index active orders (not archived ones)
CREATE INDEX idx_orders_active 
ON orders(created_at) 
WHERE status = 'active';

-- Reduced index size by 60%

Performance Impact: Query time dropped to 4.5 seconds

3. GIN Indexes for JSONB

For JSON data columns:

-- Index JSONB column for fast lookups
CREATE INDEX idx_orders_metadata 
ON orders USING GIN (metadata);

-- Now we can query JSON efficiently
SELECT * FROM orders 
WHERE metadata @> '{"priority": "high"}';

Index Maintenance

-- Monitor index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

-- Remove unused indexes (idx_scan = 0)
DROP INDEX IF EXISTS unused_index_name;

-- Rebuild bloated indexes
REINDEX INDEX CONCURRENTLY idx_orders_created_at;

Strategy 2: Query Restructuring

Select Only What You Need

-- ❌ BAD: Fetching all columns
SELECT * FROM orders WHERE status = 'active';

-- ✅ GOOD: Fetch only required columns
SELECT id, customer_id, total_amount, created_at 
FROM orders 
WHERE status = 'active';

Performance Impact: 40% reduction in data transfer time

Use EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS) 
SELECT o.id, o.total_amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days';

Key metrics to watch:

Seq Scan → Bad (needs index)
Index Scan → Good
Bitmap Heap Scan → Good for large result sets
Nested Loop → Bad for large tables
Hash Join → Good for large tables

Optimize Joins

-- ❌ BAD: Multiple LEFT JOINs returning nulls
SELECT o.*, c.*, p.*, s.*
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
LEFT JOIN products p ON o.product_id = p.id
LEFT JOIN shipments s ON o.id = s.order_id;

-- ✅ GOOD: Use INNER JOIN when possible
SELECT o.id, o.total_amount, c.name, p.title
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed';

Strategy 3: Partitioning Large Tables

Time-Based Partitioning

-- Create parent table
CREATE TABLE orders (
    id SERIAL,
    customer_id INTEGER,
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP NOT NULL,
    status VARCHAR(50)
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE orders_2024_01 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02 PARTITION OF orders
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Create indexes on each partition
CREATE INDEX idx_orders_2024_01_customer 
ON orders_2024_01(customer_id);

Performance Impact: Queries scanning specific date ranges became 10x faster

List-Based Partitioning

-- Partition by status
CREATE TABLE orders (
    id SERIAL,
    customer_id INTEGER,
    status VARCHAR(50),
    created_at TIMESTAMP
) PARTITION BY LIST (status);

CREATE TABLE orders_active PARTITION OF orders
FOR VALUES IN ('pending', 'processing', 'shipped');

CREATE TABLE orders_completed PARTITION OF orders
FOR VALUES IN ('delivered', 'completed');

Strategy 4: Connection Pooling

PgBouncer Configuration

[databases]
myapp_db = host=localhost port=5432 dbname=production

[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5

Implementation in Node.js:

const { Pool } = require('pg');

const pool = new Pool({
  host: 'localhost',
  port: 6432, // PgBouncer port
  database: 'myapp_db',
  user: 'app_user',
  password: process.env.DB_PASSWORD,
  max: 20, // max connections in pool
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

// Efficient query execution
async function getActiveOrders() {
  const client = await pool.connect();
  try {
    const result = await client.query(
      'SELECT id, total_amount FROM orders WHERE status = $1',
      ['active']
    );
    return result.rows;
  } finally {
    client.release(); // Always release!
  }
}

Performance Impact: Reduced connection overhead by 70%

Strategy 5: Caching Strategies

Application-Level Caching (Redis)

const redis = require('redis');
const client = redis.createClient();

async function getCustomerOrders(customerId) {
  const cacheKey = `customer:${customerId}:orders`;

  // Check cache first
  const cached = await client.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }

  // Query database if not cached
  const result = await pool.query(
    'SELECT * FROM orders WHERE customer_id = $1',
    [customerId]
  );

  // Cache for 5 minutes
  await client.setex(cacheKey, 300, JSON.stringify(result.rows));

  return result.rows;
}

Materialized Views

-- Create materialized view for expensive aggregations
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT 
    DATE(created_at) as sale_date,
    COUNT(*) as total_orders,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as avg_order_value
FROM orders
WHERE status = 'completed'
GROUP BY DATE(created_at);

-- Create index on the materialized view
CREATE INDEX idx_daily_sales_date ON daily_sales_summary(sale_date);

-- Refresh the view (can be scheduled)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales_summary;

Performance Impact: Dashboard queries went from 12s to 100ms

Strategy 6: Batch Operations

Efficient Bulk Inserts

// ❌ BAD: Multiple individual inserts
for (let i = 0; i < 10000; i++) {
  await pool.query(
    'INSERT INTO orders (customer_id, amount) VALUES ($1, $2)',
    [customerId, amount]
  );
}
// Time: 45 seconds

// ✅ GOOD: Single batch insert
const values = [];
const params = [];
for (let i = 0; i < 10000; i++) {
  const offset = i * 2;
  values.push(`($${offset + 1}, $${offset + 2})`);
  params.push(customerId, amount);
}

await pool.query(
  `INSERT INTO orders (customer_id, amount) VALUES ${values.join(',')}`,
  params
);
// Time: 2 seconds ✅

Using COPY for Massive Imports

const { pipeline } = require('stream');
const { from } = require('pg-copy-streams');

async function bulkImport(dataStream) {
  const client = await pool.connect();
  try {
    const stream = client.query(
      from('COPY orders (customer_id, amount, created_at) FROM STDIN CSV')
    );

    await pipeline(dataStream, stream);
  } finally {
    client.release();
  }
}

Strategy 7: Configuration Tuning

Key PostgreSQL Settings

# postgresql.conf optimizations

# Memory settings
shared_buffers = 4GB                # 25% of RAM
effective_cache_size = 12GB         # 75% of RAM
work_mem = 64MB                     # Per operation
maintenance_work_mem = 1GB          # For VACUUM, CREATE INDEX

# Checkpoint settings
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_wal_size = 4GB
min_wal_size = 1GB

# Query planner
random_page_cost = 1.1              # For SSD
effective_io_concurrency = 200      # For SSD

# Parallelism
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

Vacuum and Analyze

-- Manual vacuum
VACUUM ANALYZE orders;

-- Autovacuum configuration (postgresql.conf)
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 30s

-- Monitor vacuum progress
SELECT * FROM pg_stat_progress_vacuum;

Real-World Results: Before & After

Query Performance Comparison

Operation	Before	After	Improvement
Complex JOIN query	18s	0.18s	100x faster
Aggregation report	12s	0.10s	120x faster
Bulk insert (10k)	45s	2s	22x faster
Full-text search	8s	0.3s	26x faster
Dashboard load	15s	0.5s	30x faster

Infrastructure Impact

Server Cost: Reduced from 3 servers to 1
CPU Usage: Dropped from 80% to 25%
Memory Usage: Optimized from 90% to 45%
Storage I/O: Reduced by 60%

Monitoring & Maintenance

Essential Queries for Monitoring

-- 1. Find slow queries
SELECT query, 
       mean_exec_time, 
       calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

-- 2. Check index usage
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY schemaname, tablename;

-- 3. Monitor table bloat
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

-- 4. Active connections
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;

Automated Monitoring Script

const { Pool } = require('pg');
const pool = new Pool(/* config */);

async function healthCheck() {
  const checks = {
    slowQueries: await pool.query(`
      SELECT COUNT(*) FROM pg_stat_statements 
      WHERE mean_exec_time > 1000
    `),
    connections: await pool.query(`
      SELECT COUNT(*) FROM pg_stat_activity 
      WHERE state = 'active'
    `),
    tableSize: await pool.query(`
      SELECT pg_size_pretty(pg_database_size(current_database()))
    `)
  };

  console.log('Database Health:', checks);

  // Alert if thresholds exceeded
  if (checks.slowQueries.rows[0].count > 10) {
    console.error('⚠️ Too many slow queries detected!');
  }
}

// Run every 5 minutes
setInterval(healthCheck, 5 * 60 * 1000);

Common Pitfalls to Avoid

1. N+1 Query Problem

// ❌ BAD: N+1 queries
const orders = await pool.query('SELECT * FROM orders LIMIT 100');
for (let order of orders.rows) {
  const customer = await pool.query(
    'SELECT * FROM customers WHERE id = $1',
    [order.customer_id]
  );
  order.customer = customer.rows[0];
}

// ✅ GOOD: Single JOIN query
const result = await pool.query(`
  SELECT o.*, c.name, c.email
  FROM orders o
  JOIN customers c ON o.customer_id = c.id
  LIMIT 100
`);

2. Not Using Prepared Statements

// ❌ BAD: SQL injection risk + no query plan caching
const query = `SELECT * FROM orders WHERE id = ${userId}`;

// ✅ GOOD: Prepared statement
const query = 'SELECT * FROM orders WHERE id = $1';
await pool.query(query, [userId]);

3. Ignoring EXPLAIN Output

Always analyze your queries before deploying to production:

EXPLAIN (ANALYZE, BUFFERS, COSTS, VERBOSE, TIMING) 
SELECT * FROM orders 
WHERE created_at >= NOW() - INTERVAL '7 days';

Advanced Techniques

Full-Text Search Optimization

-- Add tsvector column
ALTER TABLE products ADD COLUMN search_vector tsvector;

-- Create GIN index
CREATE INDEX idx_products_search 
ON products USING GIN(search_vector);

-- Update trigger to maintain search_vector
CREATE TRIGGER tsvector_update 
BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(search_vector, 'pg_catalog.english', title, description);

-- Efficient full-text search
SELECT * FROM products
WHERE search_vector @@ to_tsquery('english', 'laptop & gaming');

Write-Ahead Log (WAL) Tuning

# For high-write workloads
wal_compression = on
wal_writer_delay = 200ms
commit_delay = 10
commit_siblings = 5

Lessons Learned

Measure First: Use EXPLAIN ANALYZE before optimizing
Index Wisely: More indexes ≠ better performance
Monitor Continuously: pg_stat_statements is your best friend
Test in Production-Like Environment: Your local DB won't show scaling issues
Incremental Changes: Optimize one thing at a time
Document Everything: Future you will thank present you

Conclusion

Optimizing PostgreSQL for large-scale applications is an iterative process. The techniques I've shared here reduced our query times by 100x and cut infrastructure costs by 60%.

The key takeaways: